Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add plugin tutorial section integrating metadata #122

Merged
merged 2 commits into from
Sep 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,5 @@ parse:
dwq2_add_parallel_pipeline_commit_2_url: "{{dwq2_gh_commit_url.format(dwq2_add_parallel_pipeline_commit_2)}}"
dwq2_add_parallel_pipeline_commit_3: "590263ee9bb8c48df09fe62c0e966acfa99f9aff"
dwq2_add_parallel_pipeline_commit_3_url: "{{dwq2_gh_commit_url.format(dwq2_add_parallel_pipeline_commit_3)}}"
dwq2_integrate_metadata_commit: "ab244d80340f323ec05d294cca1b3cebc52f0d17"
dwq2_integrate_metadata_commit_url: "{{dwq2_gh_commit_url.format(dwq2_integrate_metadata_commit)}}"
1 change: 1 addition & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ parts:
- file: plugins/tutorials/add-2nd-transformer
- file: plugins/tutorials/add-pipeline
- file: plugins/tutorials/add-parallel-pipeline
- file: plugins/tutorials/integrate-metadata
- file: plugins/tutorials/conclusion
- file: plugins/how-to-guides/intro
sections:
Expand Down
3 changes: 1 addition & 2 deletions book/plugins/tutorials/conclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,4 @@ The easiest way to do that is to go back to the first section of the tutorial, [
Start working on adding your methods and visualizers, and don't forget your unit tests and usage examples!
Come join the {{ developer_discussion }} if you need help, have feedback on this tutorial, or have more general feedback on developing QIIME 2 plugins.

Thanks for reading and for your interest in QIIME 2!
Happy coding!
Thanks for reading and for your interest in QIIME 2 - happy coding!
153 changes: 153 additions & 0 deletions book/plugins/tutorials/integrate-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
(integrate-metadata)=
# Integrate metadata in Actions

We've done a lot so far in q2-dwq2, but we've left out one thing that is common in almost all *real* plugins: the integration of metadata in Actions.
In this relatively brief section, we'll enable users to optionally provide metadata associated with reference sequences to the tables generated by the `tabulate-las-results` Visualizer and the `search-and-summarize` Pipeline.
This metadata could be just about anything.
For example, reference metadata could be taxonomic information about the reference sequences, enabling viewers of the resulting visualization to infer the taxonomic origin of their query sequences from the taxonomy associated with the closest matching reference sequences.
(That example is also used in the *Sequence Homology Searching* chapter of [An Introduction to Applied Bioinformatics](https://readIAB.org) {cite}`iab-2`, so you can refer there if the idea isn't familar.)

Let's get started.

(integrate-metadata-commits)=
```{admonition} tl;dr
:class: tip

The code that I wrote for this section can be found here: {{ dwq2_integrate_metadata_commit_url }}.
```

## The input metadata

In this example, the reference metadata will be a QIIME 2 metadata file that has reference sequence ids as its identifiers, and some number of additional columns containing the actual metadata.
For example, the tab-separated text (`.tsv`) metadata file could look like:

```text
id phylum class genus species
ref-seq1 Bacteria Bacillota Bacillus_A paranthracis
ref-seq2 Bacteria Pseudomonadota Photobacterium kishitanii
ref-seq3 Bacteria Actinomycetota Cryptosporangium arvum
```

This metadata file will be passed into `tabulate_las_results` along with the `LocalAlignmentSearchResults` artifact that it already takes.

```{note}
By convention in QIIME 2, all relevant identifiers (in our case, those that could be present in the `LocalAlignmentSearchResults` artifact) should be represented in the metadata, or an error should be raised.
However, "extra" identifiers in the metadata that do not show up in the corresponding artifacts (again, the `LocalAlignmentSearchResults` in this example) are allowed.

The goal of this convention is that we want to facilitate users maintaining only a single metadata file.
If an upstream artifact - in this case the sequences in the `FeatureData[Sequence]` artifact that is passed into `local-alignment-search` as `reference_seqs` - is ever filtered to remove some sequences, the user shouldn't have to create a filtered version of their metadata file.
The reason for this is two-fold.
First of all, that would generally be an extra, unnecessary step, so it complicates users' workflows.
But, more importantly, that could result in a proliferation of metadata files that need to be kept in sync.
This is another application of the {term}`DRY` principle: don't encourage your users to duplicate information represented in their metadata file.
```

## Add an optional input to `tabulate_las_results`

Now that we know what the metadata will look like, let's add a new optional input to our `tabulate_las_results` function.
The new input we add will be called `reference_metadata`, and it will be received as a `qiime2.Metadata` object.
Our updated function signature in `_visualizers.py` will look like this:

```python
def tabulate_las_results(output_dir: str,
hits: pd.DataFrame,
title: str = _tabulate_las_defaults['title'],
reference_metadata: qiime2.Metadata = None) \
-> None:
```

If `reference_metadata` is provided, we'll take a few steps to integrate it into `hits` before we write out our HTML file.
Otherwise, `tabulate_las_results` will behave exactly as it did before.
I added the following `if` block:

```python
if reference_metadata is not None:
reference_metadata = reference_metadata.to_dataframe()

hits.reset_index(inplace=True)

metadata_index = reference_metadata.index.name
metadata_columns = reference_metadata.columns.to_list()
reference_metadata.reset_index(inplace=True)

missing_ids = \
set(hits['reference id']) - set(reference_metadata[metadata_index])
if len(missing_ids) > 0:
raise KeyError(
f"The following {len(missing_ids)} IDs are missing from "
f"reference metadata: {missing_ids}.")

hits = pd.merge(hits, reference_metadata,
left_on='reference id',
right_on=metadata_index,
how='inner')

hits.set_index(['query id', 'reference id'], inplace=True)
column_order = \
['percent similarity', 'alignment length', 'score'] + \
metadata_columns + ['aligned query', 'aligned reference']
hits = hits.reindex(columns=column_order)
```

This looks like a lot, but it's really a few simple actions.

1. First, convert the `qiime2.Metadata` object to a `pd.DataFrame` using the built-in method on `qiime2.Metadata`.
1. Then, remove the index from `hits` to prepare it for a `pd.merge` operation.
1. Next, cache the metadata's index name because - importantly - we don't know exactly what this index name will be.
The QIIME 2 metadata format allows for a few available options, including `id`, `sample-id`, `feature-id`, and several others, and we don't want to restrict a user to providing any specific one of these.
We also cache the list of metadata columns, and remove the index to prepare it for the `pd.merge` operation with `hits`.
Remember that we also don't know what metadata columns the user will provide, and in this example we're not putting any restrictions on this.
1. Then, we confirm that all ids that are represented in `hits` are present in `reference_metadata`, and we throw an informative error message if any are missing.
Our error message should help a user identify what's wrong, so I chose to indicate how many ids were missing, and provide a list of them.
1. Next, we merge our hits and the reference metadata on the index of `hits` and the index of our metadata.
Unlike for the metadata, we *do* know what the index name of `hits` will be because we were explicit about this when we defined our `LocalAlignmentSearchResultsFormat`, so we can refer to it directly.
This is an important distinction that differentiates metadata from {term}`File Formats <File Format>` we define: if we need the flexibility to allow for arbitrary column names, we're generally working with metadata. On the other hand, if our column names are predefined, we should generally be working with a File Format.
1. Then, we prepare the `hits` DataFrame for use downstream.
To do this, we first set the MultiIndex.
I also sorted the columns such that all of the metadata columns come between the *percent similarity*, *alignment length*, and *score* columns and the *aligned query* and *reference query* columns of the original `hits` DataFrame.
When developing, I found this column order to be useful for reviewing the results.

After this, our action proceeds the same as if no metadata was provided.

```python
with open(os.path.join(output_dir, "index.html"), "w") as fh:
fh.write(_html_template % (title, hits.to_html()))
```

## Update `search-and-summarize`

We'll also want this option to be available to users of `search-and-summarize`.
Since `tabulate-las-results` is already part of the `search-and-summarize` Pipeline, all we need to do in our `_pipelines.py` file is add the new optional parameter (`reference_metadata`), and pass it through in our call to `tabulate_las_results_action`.
Try to do that yourself, and refer to my code ({{ dwq2_integrate_metadata_commit_url }}) as needed.

## Update `plugin_setup.py`

Next, we'll need to make the plugin aware of this new parameter when we register the `tabulate-las-results` and `search-and-summarize` actions.
Metadata files are provided as {term}`Parameters <Parameter>` of type `qiime2.plugin.Metadata` on action registration.

In `plugin_setup.py`, add `Metadata` to the list of imports from `qiime2.plugin`.
Then, add the new parameter and description to the dictionaries we created to house these for `tabulate-las-results`.

```python
_tabulate_las_parameters = {'title': Str,
'reference_metadata': Metadata}
_tabulate_las_parameter_descriptions = {
'title': 'Title to use inside visualization.',
'reference_metadata': 'Reference metadata to be integrated in output.'
}
```

Recall that we reuse these dictionaries when we register `search-and-summarize`, so we only need to add this parameter and description in this one place.

At this point, you should be ready to use this new functionality with your plugin.
Open a terminal in a environment where your implementation of `q2-dwq2` is installed, run `qiime dev refresh-cache`, and you should see the new parameter in calls to `qiime dwq2 tabulate-las-results --help` and `qiime dwq2 search-and-summarize --help`.

## Add unit tests and update the `search-and-summarize` usage example

Finally, wrap this up by adding new unit tests for the functionality.
Do this on your own, and then refer my code ({{ dwq2_integrate_metadata_commit_url }}) to see how I did it.

In my code for this section, you'll also find a metadata file that I provided which corresponds to the example reference sequences that were provided.
Use that file to update the `search-and-summarize` usage example, and then test your code on the command line with the new usage example.


Loading