Skip to content

TG2 Proposal for identifying synthetic data

Paul J. Morris edited this page Aug 28, 2018 · 3 revisions

A proposal for identifying original, modified, and synthetic example biodiversity data for testing/demonstrating data quality control processes:

Tests of the quality of biodiversity data will need data sets for the purpose of testing implementations of the tests. Such test data could be real data, or real data which has been modified to deliberately exercise particular tests for particular results, or wholly synthetic data with errors incuded to deliberately exersise particular tests in particular ways. It is critically important that consumers of biodiversity data be able to identify data that are synthetic or contain synthetic modifications, as such data deliberately do not accurately represent real occurrences. Modified or wholly synthetic data must be identified in the dataset metadata, but as data sets can be combined, such is not sufficient, and the marking must also occur on the record level, record level identification is the focus of this proposal.

Real data used as examples.

Use the values in the original source, without modification except:

If no dwc:datasetID is provided, add a value for dwc:datasetID, preferably the doi of the source data set in which the example record was found. Data sets assembled for testing would be expected to contain data derived from a variety of original data set sources.

Real data with synthetic modifications used as examples:

A. Provide values for record level terms to unambiguously mark the record as modified. RFC 2606 reserves the top level domain .invalid "for use in online construction of domain names that are sure to be invalid and which it is obvious at a glance are invalid."

dwc:institutionCode = "example.invalid"
dwc:institutionID = “http://example.invalid/"
dwc:collectionCode =  "Modified Example"
dwc:collectionID = "urn:uuid:1887c794-7291-4005-8eee-1afbe9d7814e"

B. Provide a new GUID for the modified record.

dwc:occurrenceID = urn:uuid: + a random type 4 UUID.

C. Include resource relationship terms in the modified example pointing at the original source:

dwc:relatedResourceID = the ID (e.g. occurrenceID) for the original source record.
dwc:relationshipOfResource = “source for modified example record”
dwc:relationshipRemarks: Structured data as below.

dwc:relationshipRemarks: Structured data specifying the original values for institutionID, institutionCode, collectionCode, collectionID, and occurrenceID, the doi for the data set the original example record was found in, a list of the modifications made to the original record, and potentially, a list of standard tests and expected test results that this example illustrates.

Wholly Synthetic Data (not recommended)

A. Provide values for record level terms to unambiguously mark the record as synthetic.

dwc:institutionCode = "example.invalid"
dwc:institutionID = “http://example.invalid/"
dwc:collectionCode =  "Synthetic Example"
dwc:collectionID = "urn:uuid:0b1b9546-64aa-446b-bd9c-cbb0eacf4332"

B. Provide a new GUID for the synthetic record.

dwc:occurrenceID = urn:uuid: + a random type 4 UUID
Clone this wiki locally