caporaso-lab · gregcaporaso · Sep 8, 2016 · Sep 6, 2016 · Sep 6, 2016 · Sep 6, 2016
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -31,9 +31,9 @@ mockrobiota/
     └── example-1
         ├── dataset-metadata.tsv # dataset metadata
         ├── greengenes # database name
-        │   └── 13_8 # database version
-        │       ├── database-identifiers.tsv # database identifiers associated with each mock community member (optional)
-        │       └── expected-taxonomy.tsv # per-sample taxonomic abundances
+        │   └── 13_8 # database version
+        │       ├── database-identifiers.tsv # database identifiers associated with each mock community member (optional)
+        │       └── expected-taxonomy.tsv # per-sample taxonomic abundances
         ├── sample-metadata.tsv # QIIME-compatible mapping file
         └── source
             └── taxonomy.tsv # per-sample taxonomic abundances
@@ -51,7 +51,7 @@ The required fields are:
 
 * ``citation``: DOI, PubMed Identifier (PMID), or direct link for original citation.
 * ``qiita-id``: Study ID for raw data submitted to [QIITA database](https://qiita.ucsd.edu/).
-* ``raw-data-url``: Direct link to raw data submitted to other public repositories. A valid, working URL must be provided. ``NA`` is not a permitted value for this field.
+* ``raw-data-url``: Direct link to raw data submitted to other public repositories. A valid, working URL must be provided. ``NA`` is not a permitted value for this field. See information below about formatting and depositing raw data.
 * ``human-readable-description``: A description of the mock community dataset. At minimum, should include the number and types (bacterial, eukaryotic, archaeal, etc) of strains included in the mock community; the number of sample replicates; the investigators responsible for creating the mock community; and the main institution where this mock community was generated. Include as much relevant information as possible. If relevant, indicate the features that are common to or different across the samples included in the dataset (for example, if all are replicates of the same sample); the number of unique samples included; whether strains were mixed at even or uneven ratios; and whether the samples in this mock community are included in any other mock communities, and if so, whether those are marker-gene or metagenome mock communities (or another mock community type).
 * ``bokulich2013-id``: This only applies to the founder datasets included in mockrobiota, and indicates the mock community ID used in the [original citation](http://www.nature.com/nmeth/journal/v10/n1/abs/nmeth.2276.html). New mock communities should list ``NA`` as the value for this field.
 * ``bokulich2015-id``: This only applies to the founder datasets included in mockrobiota, and indicates the mock community ID used in the [original citation](https://dx.doi.org/10.7287/peerj.preprints.934v2). New mock communities should list ``NA`` as the value for this field.
@@ -89,6 +89,21 @@ Contributors may provide database identifiers associated with each member of the
 
 Several files are described as being classic [biom-format](http://www.biom-format.org) tables. In these files, the first line must begin with the text ``#Taxonomy``, followed by a tab-separated list of one or more sample identifiers. All sample identifiers provided here must be present in ``sample-metadata.tsv``. Each subsequent line should begin with the taxonomic name, followed by a tab-separated list of the relative abundances in each sample. The relative abundances must sum to 1.000 (to three decimal places) for each sample.
 
+## Raw Data
+
+mockrobiota does not host raw data files (e.g., sequencing files). All sequencing data and other raw data files must be deposited on public, external websites. Stable, public depositories are preferred, but this requirement is not enforced by mockrobiota. mockrobiota ensures that valid, accessible links are provided in the dataset metadata (if not, integrity checks will fail and your dataset will not be accepted), but does not manage these external resources and will not guarantee the validity of raw data that are contributed by outside users. When preparing raw data for linking to mockrobiota datasets, please observe the following regulations:
+
+1. All raw sequence data should be deposited in .fastq format and archived using standard compression formats, e.g., .gz or .zip
+2. Mock community datasets that contain multiple samples must be provided in non-demultiplexed files (i.e., one file per read direction per sequencing run, containing multiple uniquely barcoded samples). 
+3. Index/barcode sequences must be provided as a separate .fastq file. If QUAL scores do not exist for these reads, please note this in the human-readable-description field of dataset-metadata.tsv for that dataset.
+4. Reverse sequencing reads are accepted, but not required. Forward and reverse reads should be submitted as separate files, not as joined reads.
+5. All raw data must conform to the following naming conventions: 
+    - mock-forward-read.fastq.gz
+    - mock-reverse-read.fastq.gz
+    - mock-index-read.fastq.gz
+
+In other words, all datasets **should** be associated with three raw data files, following the naming conventions above.
+
 ## Submitting to mockrobiota
 mockrobiota is hosted on [GitHub](http://www.github.com), and we use GitHub's [Pull Request](https://help.github.com/articles/using-pull-requests) mechanism for reviewing and accepting submissions. On submission of a pull request, a series of tests will be run to confirm the integrity of the submitted data (as well as to re-test the integrity of all existing data). We require these tests to pass for your data set before we will merge it to ensure the overall integrity of the mockrobiota resource.
 
@@ -134,3 +149,4 @@ Several issues may arise during database annotation that require careful attenti
 * Taxonomy strings (or other annotations) for each strain match actual strings contained in the reference database used.
 * Reference database names and versions are correct.
 * Relative abundances of each strain sum to 1.000 (to three decimal places) for each sample in the mock community. Our automated data integrity checks will also test this.
+* dataset-metadata.tsv lists only valid, publicly accessible URLs.
diff --git a/README.md b/README.md
@@ -4,4 +4,4 @@
 
 A public resource for microbiome bioinformatics benchmarking using artifically constructed (i.e., mock) communities.
 
-[Searchable Inventory](https://github.com/caporaso-lab/mockrobiota/blob/issue-24/inventory.tsv) of mock communities.
+[Searchable Inventory](./inventory.tsv) of mock communities.
diff --git a/data/example-1/dataset-metadata.tsv b/data/example-1/dataset-metadata.tsv
@@ -1,13 +1,13 @@
 name	value
-citation	NA
-qiita-id	NA
+citation	hyperlink to original citation or pubmed listing (recommended), or optionally full citation details (not recommended). If not published, NA.
+qiita-id	Study ID if raw data are deposited in qiita. If not, list NA
 raw-data-url	http://www.qiime.org
-human-readable-description	This is an example data set for software testing and documentation purposes.
-bokulich2013-id	NA
-bokulich2015-id	NA
-target-gene	16S
-target-subfragment	V4
-study-type	marker-gene
-sequencing-instrument	illumina-miseq
-physical-specimen-available	No
-physical-specimen-contact	NA
+human-readable-description	Describe this dataset in as much detail as possible. Who generated these data, where, and when? How many strains and species are included in the mock community? How many samples? Were species mixed at equal ratios or uneven ratios? More is always better
+bokulich2013-id	NA. This field is only used for the original 11 datasets in mockrobiota, and lists the dataset ID listed in doi:10.1038/nmeth.2276. All new datasets should list NA for this field.
+bokulich2015-id	NA. This field is only used for the original 11 datasets in mockrobiota, and lists the dataset ID listed in https://dx.doi.org/10.7287/peerj.preprints.934v2. All new datasets should list NA for this field.
+target-gene	For marker-gene studies, what gene was amplified? For example, 16S, ITS, or 18S. For other study types, list NA
+target-subfragment	For marker-gene studies, what gene domain was amplified? For example, V4 or ITS1
+study-type	marker-gene or metagenome (other study types may be accepted. If you already have published mock community data of another type and would like to contribute these data to mockrobiota, create an issue using the link above to propose new study types.
+sequencing-instrument	What sequencing technology/instrument was used? For example, illumina-miseq
+physical-specimen-available	Yes or No. Are physical aliquots of this mock community available to share with other researchers?
+contact-email	email address of person to contact with questions about this dataset.
diff --git a/data/mock-7/dataset-metadata.tsv b/data/mock-7/dataset-metadata.tsv
@@ -1,7 +1,7 @@
 name	value
 citation	doi: 10.1016/j.cell.2012.10.052
 qiita-id	NA
-raw-data-url	http://metagenomics.anl.gov/metagenomics.cgi?page=MetagenomeProject&project=1319
+raw-data-url	ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh3/
 human-readable-description	A mock community composed of 67 bacteria represented in equal abundances in three replicate samples. These are the same community members present in mock-8, which consists of the same mock community samples analyzed on different Illumina sequencing runs. This was generated by Dr. Peter Turnbaugh at Harvard in 2012.
 bokulich2013-id	NA
 bokulich2015-id	B7
@@ -10,4 +10,4 @@ target-subfragment	V4
 study-type	marker-gene
 sequencing-instrument	illumina-hiseq
 physical-specimen-available	No
-physical-specimen-contact	NA
+contact-email	[email protected]