Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated documentation and inventory #39

Merged
merged 11 commits into from
Sep 8, 2016
Merged
24 changes: 20 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ mockrobiota/
└── example-1
├── dataset-metadata.tsv # dataset metadata
├── greengenes # database name
   └── 13_8 # database version
   ├── database-identifiers.tsv # database identifiers associated with each mock community member (optional)
   └── expected-taxonomy.tsv # per-sample taxonomic abundances
└── 13_8 # database version
├── database-identifiers.tsv # database identifiers associated with each mock community member (optional)
└── expected-taxonomy.tsv # per-sample taxonomic abundances
├── sample-metadata.tsv # QIIME-compatible mapping file
└── source
└── taxonomy.tsv # per-sample taxonomic abundances
Expand All @@ -51,7 +51,7 @@ The required fields are:

* ``citation``: DOI, PubMed Identifier (PMID), or direct link for original citation.
* ``qiita-id``: Study ID for raw data submitted to [QIITA database](https://qiita.ucsd.edu/).
* ``raw-data-url``: Direct link to raw data submitted to other public repositories. A valid, working URL must be provided. ``NA`` is not a permitted value for this field.
* ``raw-data-url``: Direct link to raw data submitted to other public repositories. A valid, working URL must be provided. ``NA`` is not a permitted value for this field. See information below about formatting and depositing raw data.
* ``human-readable-description``: A description of the mock community dataset. At minimum, should include the number and types (bacterial, eukaryotic, archaeal, etc) of strains included in the mock community; the number of sample replicates; the investigators responsible for creating the mock community; and the main institution where this mock community was generated. Include as much relevant information as possible. If relevant, indicate the features that are common to or different across the samples included in the dataset (for example, if all are replicates of the same sample); the number of unique samples included; whether strains were mixed at even or uneven ratios; and whether the samples in this mock community are included in any other mock communities, and if so, whether those are marker-gene or metagenome mock communities (or another mock community type).
* ``bokulich2013-id``: This only applies to the founder datasets included in mockrobiota, and indicates the mock community ID used in the [original citation](http://www.nature.com/nmeth/journal/v10/n1/abs/nmeth.2276.html). New mock communities should list ``NA`` as the value for this field.
* ``bokulich2015-id``: This only applies to the founder datasets included in mockrobiota, and indicates the mock community ID used in the [original citation](https://dx.doi.org/10.7287/peerj.preprints.934v2). New mock communities should list ``NA`` as the value for this field.
Expand Down Expand Up @@ -89,6 +89,21 @@ Contributors may provide database identifiers associated with each member of the

Several files are described as being classic [biom-format](http://www.biom-format.org) tables. In these files, the first line must begin with the text ``#Taxonomy``, followed by a tab-separated list of one or more sample identifiers. All sample identifiers provided here must be present in ``sample-metadata.tsv``. Each subsequent line should begin with the taxonomic name, followed by a tab-separated list of the relative abundances in each sample. The relative abundances must sum to 1.000 (to three decimal places) for each sample.

## Raw Data
Copy link
Member

@gregcaporaso gregcaporaso Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but does not manage these external resources and will not guarantee

will not to can not?


mockrobiota does not host raw data files (e.g., sequencing files). All sequencing data and other raw data files must be deposited on public, external websites. Stable, public depositories are preferred, but this requirement is not enforced by mockrobiota. mockrobiota ensures that valid, accessible links are provided in the dataset metadata (if not, integrity checks will fail and your dataset will not be accepted), but does not manage these external resources and will not guarantee the validity of raw data that are contributed by outside users. When preparing raw data for linking to mockrobiota datasets, please observe the following regulations:

1. All raw sequence data should be deposited in .fastq format and archived using standard compression formats, e.g., .gz or .zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing period at end of sentence.

2. Mock community datasets that contain multiple samples must be provided in non-demultiplexed files (i.e., one file per read direction per sequencing run, containing multiple uniquely barcoded samples).
3. Index/barcode sequences must be provided as a separate .fastq file. If QUAL scores do not exist for these reads, please note this in the human-readable-description field of dataset-metadata.tsv for that dataset.
4. Reverse sequencing reads are accepted, but not required. Forward and reverse reads should be submitted as separate files, not as joined reads.
5. All raw data must conform to the following naming conventions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a check for this. I created #41.

- mock-forward-read.fastq.gz
- mock-reverse-read.fastq.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

(if applicable)

- mock-index-read.fastq.gz

In other words, all datasets **should** be associated with three raw data files, following the naming conventions above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this line, the above text is really clear so I think this is just confusing.


## Submitting to mockrobiota
mockrobiota is hosted on [GitHub](http://www.github.com), and we use GitHub's [Pull Request](https://help.github.com/articles/using-pull-requests) mechanism for reviewing and accepting submissions. On submission of a pull request, a series of tests will be run to confirm the integrity of the submitted data (as well as to re-test the integrity of all existing data). We require these tests to pass for your data set before we will merge it to ensure the overall integrity of the mockrobiota resource.

Expand Down Expand Up @@ -134,3 +149,4 @@ Several issues may arise during database annotation that require careful attenti
* Taxonomy strings (or other annotations) for each strain match actual strings contained in the reference database used.
* Reference database names and versions are correct.
* Relative abundances of each strain sum to 1.000 (to three decimal places) for each sample in the mock community. Our automated data integrity checks will also test this.
* dataset-metadata.tsv lists only valid, publicly accessible URLs.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@

A public resource for microbiome bioinformatics benchmarking using artifically constructed (i.e., mock) communities.

[Searchable Inventory](https://github.com/caporaso-lab/mockrobiota/blob/issue-24/inventory.tsv) of mock communities.
[Searchable Inventory](./inventory.tsv) of mock communities.
22 changes: 11 additions & 11 deletions data/example-1/dataset-metadata.tsv
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
name value
citation NA
qiita-id NA
citation hyperlink to original citation or pubmed listing (recommended), or optionally full citation details (not recommended). If not published, NA.
qiita-id Study ID if raw data are deposited in qiita. If not, list NA
raw-data-url http://www.qiime.org
human-readable-description This is an example data set for software testing and documentation purposes.
bokulich2013-id NA
bokulich2015-id NA
target-gene 16S
target-subfragment V4
study-type marker-gene
sequencing-instrument illumina-miseq
physical-specimen-available No
physical-specimen-contact NA
human-readable-description Describe this dataset in as much detail as possible. Who generated these data, where, and when? How many strains and species are included in the mock community? How many samples? Were species mixed at equal ratios or uneven ratios? More is always better
bokulich2013-id NA. This field is only used for the original 11 datasets in mockrobiota, and lists the dataset ID listed in doi:10.1038/nmeth.2276. All new datasets should list NA for this field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I like having the descriptions in this file. It's intended to serve as a real example, and the documentation goes in CONTRIBUTING.md. Could we remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will move these descriptions to CONTRIBUTING.md

bokulich2015-id NA. This field is only used for the original 11 datasets in mockrobiota, and lists the dataset ID listed in https://dx.doi.org/10.7287/peerj.preprints.934v2. All new datasets should list NA for this field.
target-gene For marker-gene studies, what gene was amplified? For example, 16S, ITS, or 18S. For other study types, list NA
target-subfragment For marker-gene studies, what gene domain was amplified? For example, V4 or ITS1
study-type marker-gene or metagenome (other study types may be accepted. If you already have published mock community data of another type and would like to contribute these data to mockrobiota, create an issue using the link above to propose new study types.
sequencing-instrument What sequencing technology/instrument was used? For example, illumina-miseq
physical-specimen-available Yes or No. Are physical aliquots of this mock community available to share with other researchers?
contact-email email address of person to contact with questions about this dataset.
4 changes: 2 additions & 2 deletions data/mock-7/dataset-metadata.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name value
citation doi: 10.1016/j.cell.2012.10.052
qiita-id NA
raw-data-url http://metagenomics.anl.gov/metagenomics.cgi?page=MetagenomeProject&project=1319
raw-data-url ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh3/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this link have Turnbaugh3 in it? Shouldn't it be mock-X?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregcaporaso In the microbio.me FTP, we had the original dataset names for all directories and never changed them. Turnbaugh-3 would be consistent with this. If we are moving the raw data elsewhere, we can rename the directories there... until we decide on the final home I will leave this as is. Does that sound ok?

human-readable-description A mock community composed of 67 bacteria represented in equal abundances in three replicate samples. These are the same community members present in mock-8, which consists of the same mock community samples analyzed on different Illumina sequencing runs. This was generated by Dr. Peter Turnbaugh at Harvard in 2012.
bokulich2013-id NA
bokulich2015-id B7
Expand All @@ -10,4 +10,4 @@ target-subfragment V4
study-type marker-gene
sequencing-instrument illumina-hiseq
physical-specimen-available No
physical-specimen-contact NA
contact-email [email protected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this change go in all of the dataset-metadata.tsv files?

Loading