Skip to content

Summary of DQIG meetings held during TDWG2018, Dunedin, New Zealand (Aug 2018)

Arthur Chapman edited this page Sep 3, 2018 · 8 revisions

TDWG (Biodiversity Information Standards) and SPNHC (Society for the Preservation of Natural History Collections) held joint meetings in Dunedin from 26-31 August 2018. The TDWG Data Quality Interest Group had a very busy and successful series of meetings and symposia - some as TDWG meetings and some as part of the SPNHC Agenda.

A meeting was held on the Sunday (26th August). At least 23 people attended various parts of the meeting throughout the day. There were also 7 apologies. It was a good group of people that were willing to contribute to the deliberations of the Groups.

TG1 – Framework for Data Quality

Report: Allan reported on achievements with the Framework and noted

  • the need to update the TG1 GitHub,
    1. How to define a Profile
    2. How to describe DQ Solutions
    3. How to generate DQ Reports
    4. Update documents and references (presentations, documents, abstracts) in the GitHub Repository
  • that there are ongoing projects on implementation with Kurator and RCPol
  • the need to formalize the concepts (e.g. for the negative tests for which Paul Morris has a spec).
  • Examine the steps needed to prepare a Standard. Paul Morris has an RDF Implementation that may be good for a Technical Specification
  • Work on updating the Glossary/Vocabulary with the aim of preparing a DQ Vocabulary as a Data Standard
  • Prepare final report for the TDWG Executive

Schedule

  • 2018 Dec – Include DQ Problems in the framework
  • 2019 Jun – TG1 Website Guideline
  • 2019 Aug – Put together Framework Documents
  • 2019 Sep – Deliver final report

TG2 – Tools, Services and Workflows

Report: Lee reported on the extensive discussions held on the Tasks and Assertions during the year, and especially on the intense 4 days by a small group of dedicated participants in Gainesville in January.

It was reported that one of the ongoing issues with respect to the tests was the need for Vocabularies for many (about 29) of the tests. These vocabularies are also needed for the Darwin Core.

Lee reported that the tests:

  • are named using the pattern TG2_OUTPUT TYPE_TYPE OF TEST_TERM-ACTION (e.g., TG2_VALIDATION_DAY_NOTSTANDARD)
  • all tests are in the GitHub Issues at http://github.com/tdwg/bdq/issues
  • validation for all tests is at the single record level; the results can be accumulated across a data set with measure of: compliant/non-compliant
  • core tests - those that high priority related to space, time, name (and other)
  • supplemental tests - many which are domain specific - not being rigorously defined at this point in time
  • still have the challenge of how to report the results for the user

Tests with outstanding issues (about 15) were discussed during the meeting with vigorous discussion on some. This is work that can only be done face to face and it was noted the importance of the Sunday meetings prior to the TDWG Conferences for real progress to be made. The value of additional meetings in between TDWG meetings was also noted.

During the meetings at TDWG/SPNHC, all tests were finalized - needing only minor corrections and checks for consistencies, which continued throughout the week and during the days following.

Task still needed

  • Finalise and document all tests

    1. Once finalised, pull tests off GitHub Issues into a csv file
    2. Load up into an html/RDF document in preparation for a standard (Kurator has some scripts - will work with iDigBio and Lee)
  • Extract terms of value from the tests as a first step in preparation of a Vocabulary.

  • Reevaluate criteria for inclusion and exclusion of CORE tests.

  • Coding is well advanced. A call for people to help with coding identified a small group that will continue with this activity. Kurator/iDigBio to coordinate.

  • Prepare test dataset (this will be a real dataset with synthetic modifications). An explanation can be found at https://github.com/tdwg/bdq/wiki/TG2---Proposal-for-identifying-synthetic-data

  • Explore (especially in conjunction with GBIF, iDigBio and ALA) how best to handle Annotations. There was some discussion on using a centralized (in the cloud) model versus a semi-distributed model and the possibility of having a centralized test Sandbox. An issue (#154) has been raised on GitHub.

  • Explore what is needed to advance the Tests toward a TDWG Standard. It was thought that either a Best Current Practice or an Applicability Statement would be the most appropriate standard type. It was noted, however, that the Code needs to be written before the development of a Standard can be considered in order to better exercise the assumptions made in real-world situations.

Work on the tests continued throughout the week, with several lunch and other ad-hoc meetings, which advanced the work on the tests considerably.

TG3 – Use Case libraries

Little progress had been made over the past twelve months. No new use cases were submitted. As mentioned in last year's report, there is still a human bottleneck in transferring the data from the story book to the Spreadsheet.

The Task Group had achieved its objectives, and will be preparing a report for the executive over the next twelve months that will recommend winding up the Task Group. A detailed summary of the project and results will be included in the paper being prepared for BISS (see below).

TG4 – Best Practice for Development of Vocabularies of Value

Status: It was noted that the Task Group was formed earlier in 2018, and that work was progressing on the Scoping Document.

Paula held discussions throughout the week with people from both the SPNHC and TDWG communities and has received feedback on the draft Scoping Document. The Scoping Document is open for discussion until the 15 September.

There was some discussion from some groups on what a Vocabulary should include and ways to progress the Vocabulary of Values Best Current Practices Document.

The current list of Vocabularies was advertised and people invited to add vocabularies to it. It was noted that a new column needed to be added to the Spreadsheet to record the native language(s) of the vocabularies included.

Steve Baskauf clarified that the TDWG Standards Documentation Standard (SDS; https://www.tdwg.org/standards/sds/) described how to create Data standards (for Vocabularies) as well as Best Current Practices documents. The Vocabularies of Values Best Current Practices Document must conform with that document, just as any vocabularies of values must also conform to the specifications set out in the SDS. The DQIG believes that a Vocabularies of Values Best Current Practices document is needed to provide more specific and common guidance on vocabularies of values construction and maintenance - for example, guidance on the type of vocabulary to use (Thesaurus, Vocabulary, Dictionary, Ontology, etc.), and how to deal with synonymy, multiple languages, etc.

There was some discussion on the idea that Vocabularies that followed the Vocabularies Best Practice could somehow be "TDWG Accredited". This would need much more discussion with the TAG and TDWG Executive further down the line.

Next steps:

  • Finalise the Scoping Document
  • Continue with the other issues listed in the Charter. It was noted that the timeline had slipped for a number of reasons, the least not being the delay in getting the Task Group approved by the Executive.
  • Follow up on the recommendations arising from the Scoping Document.

Survey on Feedback to Museums

Deb Paul reported on a Survey that was conducted under the auspices of SPNHC on how collections institutions handle feedback on Data Quality. She reported that they had received 104 responses and gave a brief summary to the Sunday meeting. This was followed up by a more detailed reporting with discussion in a workshop later in the week.

A full report from the survey will be published. This has great implications for the DQIG, and especially Task Group 2 on how the Annotations are prepared and feedback made to the data custodians. At several sessions there was vigorous discussion of these issues.

Article in Biodiversity Information Science and Standards (BISS)

The paper “Improving Biodiversity Data Quality through a Fitness for Use Framework” is well behind schedule.

The big hold-up is the Case study using MCZ data and the TIME tests, but this is progressing.

Next Steps

  • Arthur to do a major edit - especially in conjunction with Allan and Antonio to shorten the section on the Framework
  • Lee to revisit the Tests and Assertions portion following meetings at TDWG2018
  • Paul and John to write up the Case Study
  • Paula to revisit the Vocabularies section in light of Scoping Document
  • Arthur to examine the list of authors with a view of removing those that have not contributed
  • Arthur to liaise with Gail Kampmeier with aim of moving paper from the Biodiversity Data Journal to BISS.
  • Get someone to do a final edit and submit (preferably by end of year)

Around the Rooms (DQ during the SPNHC and TDWG Meetings)

Throughout the week of Symposia and Workshops, there was barely a session in which Data Quality was not a major topic. Conveners of the Interest Group and Task Group leaders were active in discussions - including on Data Quality - in sessions involving Traits, Machine recorded data, Citizen Science, Invasive Species, Interactions, Feedback to the data custodians and others. In many of these sessions there was vigorous debate and the DQIG people took note of the issues and made many notes for further discussion and action.

Action Items that arose

  • The need for a test on TaxonRank was raised in one session. We currently don't have a test for this, but may now well include one.
  • It was noted that the Citizen Science IG, in discussions on PSSR-Core had established a working group to (among other things):
    • "Understand and develop a common vocabulary for discussing the range of data quality practices in Citizen Science"
    • It was agreed that the DQIG should liaise with that group with the aim of preparing one common vocabulary if possible.

Possible Future Standards

Throughout the week, the possible standards arising from the work of the DQIG and the Task Groups were thought to include

  • Framework - Technical Specification (RDF)
  • Data Quality Vocabulary arising from the Framework - Data Standard
  • Tests and Assertions - Best Current Practices Document OR Applicability Statement to the Framework TS (or maybe to Darwin Core)
  • Vocabulary of Terms used in the Tests and Assertions with links to the Framework Vocabulary - Data Standard
  • Vocabularies of Values - Best Current Practices Document

Arthur D. Chapman (Co-Convenor DQIG, September 2018).

Clone this wiki locally