Report of TDWG DQIG Meeting, São Paulo, Brazil (Mar 2016)

Aims:

To further the work of the TDWG/GBIF Data Quality Interest Group and its three Task Groups and to plan future directions
To test the Framework on Data Quality (TG1) by preparing Profiles based on Use Cases arising out of the two GBIF Fitness for Use Working Groups.
To begin planning for Data Quality Symposium at TDWG2016.

Some Glossary terms and Abbreviations:

DwC: Darwin Core
GRBio: Global Registry of Biodiversity Repositories of Biodiversity
MCPD: Multi Crop Passport Data
Passport (data): refers to the minimal obligatory set of fields or metadata for specimens with respect to Germplasm data.
SDM: Species Distribution Modeling

Participants:

Antonio Mauro Saraiva (Co-convenor TDWG DQIG)
Arthur Chapman (Co-convenor TDWG DQIG)
Dmitry Schigel (GBIF)
Christian Gendreau (GBIF)
Allan Koch Veiga (USP, Leader TG1 – TDWG DQIG)
Elizabeth Arneaud (Biodiversity International (France) – GBIF FFUWG (Agrobiodiversity))
Nora Patricia Castañeda Álvarez (CIAT – GBIF FFUWG (Agrobiodiversity))
Enrique Martínez Meyer (NUMexico – GBIF FFUWG (Species Distribution Modeling))
Paul Morris (Harvard Univ. – TG1 – TDWG DQIG)
Lee Belbin (ALA – Leader TG3 – TDWG DQIG) – participated by Skype out of session
Miles Nicholls (ALA – Leader TG3 – TDWG DQIG) – participated via Skype out of session
Lourdes Takigami (USP - Secretary)
Various USP Staff and students attended parts of the workshop

Antonio – introduction and welcome

NB This workshop followed on from the Biodiversity Data Quality Symposium: Developing a Common Framework to Improve Fitness for Use of Biodiversity Data held at FAPESP with approx 200 participants. All the key participants in the workshop presented reports which were videotaped and will be made available via the USP Website (LINK TO BE ADDED). These will not be repeated here but they provide further background for the workshop.

Paul drew the attention of the workshop to the W3C Vocabulary for Data Quality currently under discussion and advised that we should make sure our Vocabulary was compatible with it.

Elizabeth (Agrobiodiversity)

Provided background on Agrobiodiversity WG

Reported that they had two Use cases for Crop Wild Relatives (CWR)

identify gaps in conservation
support landscape restoration objectives

Need to Integrate Multi Crop Passport Data (MCPD) into Darwin Core Germplasm Extension terms: taxonomy, status, accession, and country of origin. Should be able to search GBIF on these MCPD terms

Workshop: Agreed that the Profile should cover the existing Darwin Core fields (for which we have tests and assertions/tools) and not the Darwin Core extensions at this time as we don't have the tests etc. for those.

Paul: suggested that there was an analogy with Apple core, which is mapping the botanical data for herbaria into DwC. Basically it is a guide to DwC for herbarium users.

Elizabeth: There are some resources for quality Indicators for Passport Data including:

Passport Data Completeness Index (PDCI) - EURISCO (on Google Drive)
GENESYS – Gateway to genetic resources (www.genesyis.pgi.org) shows use of PDCI

Enriche (Species Distribution Modelling)

Reported that GBIF was the largest source of SDM data There are three main players

Providers
Aggregators
Users

There is a need for free flow of data and information between these communities – especially wrt feedbacks on quality.

In a survey they conducted the question was asked: “In your last use of GBIF data what % did you discard as useless”. Response was that about 50% records downloaded from GBIF are discarded from use in modeling.

Currently there is no free flow of information from the users back to the providers -e.g. users need to be able to feed back a flag to the original providers.

Data issues (errors, biases and gaps – taxonomic, geographic, temporal)

Users can identify biases and gaps and feed back that information

Need information from aggregators on Taxonomic, geographic and temporal completeness

Data errors – need tools to identify errors and need to be able to flag them

Accessibility issues

Missing information (fields empty etc.)

Information exists but can not be obtained (have locality not geocode)

```
Protected (sensitive species)
```

Functionality (aggregators – ease of extract, can't select some specific fields

Use issues

Careless use of data obtained from databases

Poor knowledge of taxonomic group (scientific problem to address)

Poor understanding of modelling algorithms and approach

Georeferences are key – a lot of records don't have georeferences.
GBIF should promote georeferencing
Need to educate users – flags usually ignored – is there an aggregator responsibility here?
Inconsistency analysis (states and sub country level)
Need (necessary) Individual record identifiers (doi with Unique identifier id)
Keep track of changes, errors
Keep track of uses
Allows linkages between data realms (e.g. specimen/genebank/etc.)
Current ids are not globally unique Christian: going to happen but when …..
Uniqueness will only work for new/updated sets coming into GBIF
What about the same specimen in different collections
Users can help develop tools for identifying gaps and feeding information back

Enrique: Need to find a way to handle Absence data

How to collect
How to database
How to deal with quality

Need to learn how to deal with interactions and tie to geographics, modeling etc.

Paul asked: What is single greatest data quality issue? Enrique: Georeferencing Nora: taxonomic identification and georeferencing

Allan Koch (Conceptual Framework

Allan showed how the Profile and his Profile Builder Tool would work. He used a fake ALA Use Case as an example and worked up fake ALA pages.

Quality is a multidimensional concept

Workshop: Need to examine Controlled Vocabulary in detail vis a vis W3C ISO Data Quality Vocabulary uses/standards – e.g. metadata

Arthur: Careful how you use term “Unique” cannot qualify uniqueness – either unique or not.

Arthur/Paul: Be careful not to confuse names with taxa.

Arthur: Suggested that in many cases where “Tools” is used “Test” may be more appropriate.

Interest Group should concentrate on Conceptual Level not Formalization or Implementation Levels

GUIDED EXERCISE (Allan)

http://200.144.182.24:3050/

Arthur suggested that the Profile Builder (Framework worksheet) is used to produce a Profile but it may be better and more user friendly to look at the creating a Profile Form with free text files and drop down lists

Paul drew a brief diagram on the board to illustrate how it may look.

Workshop agreed that this would be worth doing, but should also keep the Profile Builder for now as some people may prefer to work that way.

TG3 Use Cases Library

Dimitry has added a Template used for collecting Agrobiodiversity Use Case that was modified by Donald Hobern – Has added it, and the old one into the TG3 area on the Google Drive

Time spent discussing problems and the two FFUs pulling together the elements to make a profile on the Day 2.

Dmitry, Arthur and Christian reported that they had had a Skype discussion with Lee and Miles in the morning

TASK GROUP 2

Based on Dmitry's notes and as modified later by Lee

Lee has had no feedback if he is on the right track or not? Knowing this would provide a practical way to create a new TDWG standard comprising a standard test suite with paired assertions. This simple strategy would be understood by the data users (the priority) and the IT people. Everyone needs to see a benefit and it must also be easy to implement. An overly academic discussion as has occurred with EBV must be avoided.

The tools are far less important than the tests that are used to examine ‘data quality’. Tests will change less frequently than tools.

The spreadsheet prepared by Lee with 152 tests from a range of Data Publishers is the foundation. The columns need to be rationalised and made completely understandable by all. Allan’s additional columns can make it tricky (e.g., Technical specification vs simple specification?), but the terminology used must be agreed to. The current columns can be classified as follows-

NB I have incorporated our later discussion on these points here for clarity and brevity.

Number: For uniquely referring to during discussion - would be dropped when an agreed GUID could be generated Workshop: Agreed to this suggestion
ID: Currently only picked up from the ALA but other Data Publishers probably have their own. We need to implement a GUID here so we can uniquely identify all tests and their assertions. Workshop: Christian: it is preferred to have IDs as somewhat meaningful words, not number, similar to Variable (which is to rename and is potentially not unique as you have more than one implementation). Variable will need to become TestID, which is then unique, in the process of getting there variable can be non-unique. OwnerID and CodeIDs can be moved to the right.
Specification: This is the most important field as it needs to be easily understood by the end users who will see the assertions and must understand how they were generated. This field also needs to be understood by developers who must program the test and report any assertion for it when flagged. Workshop: Agreed it was the key field. Suggest that Specification contain human friendly (but also be useful for developers) sentences (Collection data field is missing or null). Dmitry: Could be split technical specification OBJECT, ACTION and CONDITION. (O: collection date, A: flag, C: if missing), but Workshop prefers to use this logic for constructing the human friendly sentences, such word order and only active or passive clause. Navigation from the old catalogue of GBIF tools can be possibly used. Specification should be consistent if they are positively or negatively oriented.
Specification (technical): Allan added this but I don’t think it is necessary. Workshop: Agreed that this field could be deleted and the information included in Specification – this is the second part of specification – (i.e.for developers)
Data resolution: Record, dataset etc. Controlled vocabulary terms required and Framework can do this. Workshop: Agreed
Output Type: Framework term ad again, must be a controlled vocabulary. Workshop: Agreed
Darwin Core Class: Needed for context – and for reference - as I suspect MOST tests will apply to a Darwin Core term. Workshop: Agreed – noted that some may refer to the DwC Extension or to MCPD
Darwin Core Field (Element?): As (7). Required if applicable. Workshop: Change “Field” to “Term” otherwise Agreed.
Severity: Is used by most Data Providers to inform code and user of the nature of the assertion – is it a warning or just an error or …controlled vocabulary terms required. Framework?` Workshop: Agreed that this should be removed as severity is 'use' specific, e.g. a missing date can be a problem for phenology study but does not matter for distribution map.
Owner: Must be who implements it currently or who devised it – this is probably the Framework term Mechanism? Workshop: Agreed that should be kept but be renamed to Sources which was thought to be similar to 'Author'. A person or organisation responsible for implementation. 'Originator' was discussed and rejected.
Specification Reference: OK if there is a SPECIFIC paper etc, code, link that describes the test/assertion. In most cases, the ‘Specification’ column and Darwin Core field should suffice. Workshop: Suggested to combine several of these fields and rename to 'Reference' (include URL, DOI for bibliographic citation)
Mechanism: As (10) Workshop: Said it should stay but rename to 'Example implementations'. An explainer is needed that only a name is expected.
Mechanism reference: I don’t think this is needed, at least for now/ Workshop: Remove – included under Reference (11)
Source Code: I don’t this is needed, at least for now. Workshop: Agreed that this should be a link out, but not code. Rename to 'Link to source code'. This should be a source code for the test, not for the tool! Suggested that the source code link may guide how we rate the human readable specification, it may also help phrase the human readable specs. Once we get there, we don’t necessarily need the source, but we may. The source code does not represent the reference documentation, it’s an example.

Workshop: Agreed that a standardised set of tests is the way to go? Lee: If yes, what is the implementation, a simpler table in both dimensions – rationalise the columns and rows (152- less than 100 because of overlap). Workshop: Agreed with this

Lee seems to be happy about the resources pages, the difficulty is how do you make people contribute? Easy at GitHub, updating is not so easy to set up. As noted, the TESTS/ASSERTIONS are more fundamental, than their tools packages that use them.

Lee: Ideal would be to create a set of open-free web services that support the standard set of tests and their assertions. It would then be VERY easy for anyone to implement them (e.g., workflows) and understand them. This is what the ALA does, for example the ALA’s Sandbox uses web services that define the tests/assertions - http://biocache.ala.org.au/ws/assertions/codes.

Lee stated that if the Workshop agreed with his approach (tick) then rationalise the rows (tests-assertions) to a standard, very well documented, non-overlapping set. For this, I'd like to see maybe 3-6 people working on it. If we could get ~ 5, I figure it would take each person maybe 3-5 hours work. The main work other than rationalisation, is ensuring that the column 'Specification' is written so that a developer can program it and an end-user fully understand it.

Workshop

Paul Morris: suggested that Services may not be stable as they depend upon people to maintain and they come and go. If tests rely on web services, they will be dependant on those services being available and continue to be maintained.

Workshop: There was general consensus on User Oriented and Data Publisher-oriented column types (really no discussion here). Agreed that fewer columns were best

Paul Morris: Looking at the content, is Vertnet included? Suggest that John Wieczorek be contacted as he would have all the relevant tests and assertions.

Paul Morris volunteered for Lee's small subgroup.
It was suggested that John Wieczorek would probably agree to be on group as well
Christian Gendreau volunteered to be on the group.
Later in the meeting Alex Thompson (iDigBio) agreed to be in the subgroup.
It was later suggested that Rob Guralnick might be a good person to add as a developer of Vernet and iDigBio

TG2:Catalogue of Tools

Workshop: Discussed whether the Tools now on GBIF should be moved to GITHUB or not.

Dmitry: stated that Tim Robertson at GBIF was happy for the tools list be moved but that there needed to be some control/vetting

Christian: suggested that GITHUB had some control by page

Paul: suggested that people could Fork it and create their own pages which could lead to confusion.

Paul: mentioned that all the tools should be included in the Biodiversity Catalogue (https://www.biodiversitycatalogue.org/) and not GRBio as he originally suggested.

Dmitry: stated that GBIF needs to track the use of tools – this will help the Secretariat determine how many resources they are prepared to commit to the task. Currently Lee is the only person outside GBIF with access. Currently the list of tools is very difficult to search. Needs better interface

TASK GROUP 3

Miles: has added a couple of use cases into the Google Docs

Dmitry: presented a Google Form idea, Miles is OK with anything that is easier. He later made a compromise form between Mile's spreadsheet and the Agrobiodiversity form, which was collectively edited at the workshop, and the link shared with Lee and Miles. It was also sent to Donald Hobern and Tim Robertson (GBIF) for comment.

GENERAL

An email from Stan Blum informed that he will give the DQIG Coordinators and Task Group Leaders access to the TDWG GITHUB so that we can place all relevant documents etc.

NB TDWG has been asked how they want Information Groups and Task Groups to carry out discussion. Currently we use the GBIF Community Site which is not the most suitable.

Paul: mentioned that TDWG would like us to use the TDWG Mailing list rather than Interest Groups maintaining their own. ASK TDWG how they currently support use of email for Interest Groups and Task Groups

Profile

Workshop: lot of discussion ensued on particular parts of the Profile. It was agreed that a joint profile be prepared for the Two Task Groups where they coalesced for example Location, Date and parts of Taxonomy.

Dmitry: Noted that no matter how good the profile will be the user will still need to do some tweaking of data.

Controlled Vocabulary

Workshop: Some discussions revolved around Precision in the Controlled Vocabulary – agreed that nothing (or very little) could be inferred from coordinate precisions with any degree of confidence.

Arthur: Example of profile shown by Allan mentioned Precision >3 where the precision shown was to 15 decimal places (approx. .0001 mm precision) – we MUST NOT continue to use coordinate precision as a measure of anything in Geocoding especially!

Discussion on the Controlled Vocabulary was moved for discussion on Day 3

Arthur, Dmitry and Christian held another early morning Skype session with Lee Belbin.

Paul Morris gave a presentation on Kurator

Showed workflow for Kurator-AKKA – look at the spreadsheet
The Kurator prepares data quality reports to curators

Christian Gendreau gave a presentation on DwC Archive Validator (DwCA-Validator)

TG3 Form

A lot of editing was carried out by the group on the Use Case gathering form developed by Miles and previously modified by Dmitry. This is lodged on the Google Drive and has been locked for editing. Any comments suggestions to be sent to Miles or Dmitry

Skype Discussion with iDigBio

Antonio and Arthur: participated in an Adobe Connect discussion with Mathew Collins and Alex Thompson from iDigBio to discuss a proposed Data Quality Symposium at TDWG2016 and any possible associated training courses.

Alex: mentioned that iDigBio had an interest in getting Unified Flag Definitions.

Antonio mentioned that the easiest way to progress submission of a proposal for a Data Quality Symposium was to modified the submissions made in 2013 and 2014 which both engendered solid responses.

Hardest part is inviting speakers etc, for the Symposium and reviewing the abstracts Alex and Mathew: They are willing to help. They can also actively record the Symposium for later. Their own contributions would be:

Data Quality flags
Standardisation of data quality flags
Mechanical challenge of dealing with millions of records (maybe in separate symposium)

They were involved in training course before Kenya. They were hoping to run a course on Data Carpentry (Standards/Metadata/Quality) etc. associated with TDWG2016. If we are to extend that in any way the following needs to be considered:

Who would be financing?
Logistics etc. (contact Eric Mata). RobGirulnick, Gail Kampmeier – would also likely be interested - Rob is Chairman of JRS and they may have interests as they usually sponsor some of the African participants
Venue reserving – paying hotel room for extra couple of days, catering – pay participants etc.

Asked if we were interested in preparing curriculum for our bit to fit within their carpentry course. They are keen to be involved in further discussion.

Antonio: noted that the Humboldt have lots of good data quality material prepared – may be good to involve them as well. He mentioned that TDWG had no funds but may be able to extend bookings for rooms etc.

Mathew mentioned that they will be able to help people that weren't able to come through the Videoing of the Symposium

TO DO:

Prepare TDWG Submission and circulate for comment.
organise an exchange of emails between Lee and Alex
Links to Community Group discussions

Returning to the workshop and reporting back It was suggested that we may try and organise a meeting of the DQIG and Task Groups on the Sunday before TDWG that would be a warm up for the main Symposium. Also an opportunity for a meeting of the Task Groups.

Suggested 2 x 1.5 hr public slots for presentation, and 1 x 1.5 hr slot for discussion & IG workshop, which is open to all.

Dmitry: asked to avoid overlap with DNA evidence.

Paul: Suggested that TG2 should possibly have a joint session with the Annotations Interest Group.

Arthur and Antonio will circulate the DQ symposium draft. Possible training course iDigBio course on data carpentry, courses before on logistics, who pays etc. Sunday warm up (small group),

GBIF could possibly encourage & support SA and Caribbean countries to attend.

Controlled Vocabularly

Earlier in the meeting Christian and Arthur made many recommendations re changes to the Controlled Vocabulary in the Google Docs. This was later circulated to Lee and Miles and others for comment before it was discussed by the group.

See Google Docs for a modified final document. The document still has to be checked against the W3C Vocabulary and relevant Metadata document standards.

Next Steps for IG

TG3**

– Dmitry now frozen editing rights to form – created spreadsheet from form. Need some definitions added and then circulated to the GBIF Community Site the Agrobiodiversity and Distribution Modelling Pages and use iDigBio as a test.

Dmitry: mentioned that in 2016 GBIF will have a FFU group on Invasive Species (by end of Month) – also will collect Use Cases from that community.
Other FFU groups that may come along later include DNA Evidence for Biodiversity (no money for this group). Other suggestions included possibly one on Marine Biodiversity, and one on Citizen Science.
Dmitry: will also try and capture Use cases for EBVs

TG2

Explore GITHUB
Ontology Representation of Vocabulary
Reduce no of tests.

Main DGIG

Exploring use of TDWG GITHUB and services for discussions
Plan for Costa Rica

TG1

Finish the Profiles that we started and make available for comment. (NB Allan may need some help with definitions and understanding of Taxonomy fields.
Publish the paper
Prepare more user-friendly tool to capture profiles (Profile Form)
Finish of Vocabulary (Allan)
Antonio or Arthur to make sure it links to W3C – Work through Bob Morris

Christian

Report on Data Publishing – issues and feed backs (mainly GBIF)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly