Skip to content

Annual Report TDWG Data Quality Interest Group for 2017

Arthur Chapman edited this page Aug 17, 2018 · 6 revisions

Background

The TDWG-GBIF Data Quality Interest Group (DQIG) was formally approved by the TDWG Executive in October 2014. Arthur Chapman and Antonio Saraiva were elected as co-coordinators of the group.

Seminars and Interest Group meetings have been held since at TDWG2014 and TDWG2016 as well as at a number of between-TDWG meetings.

Three Task Groups have been established, viz

  • TG1 – A Framework for Data Quality – leader: Allan Koch Veiga
  • TG2 – Data Quality Tests and Assertions – leader: Lee Belbin
  • TG3 – Use Case Library – leader: Miles Nicholls

Two more Task Groups are being considered, viz.

  • Proposed TG4 – Vocabularies – Leader Paula Zermoglio
  • Proposed TG5 – Invasive Organism Information – Leader Quentin Groom

Discussions with the broad membership are held on Data Quality GitHub (bdq) which has been established on the TDWG GitHub site . This continues to be a work in progress.

The Interest Group is currently writing a paper for the Biodiversity Data Journal: Improving Biodiversity Data Quality through a Fitness for Use Framework. It is hoped to have the paper completed and submitted early in 2018.

Standouts

Significant highlights for the year were

  • The Framework paper that has been in the pipeline for a while has now been published

    Veiga, A.K. et al., A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12 (6): https://doi.org/10.1371/journal.pone.0178731

  • Prototype Framework Lite has been built on the GitHub

  • The ALA (and its international deployments), GBIF and iDigBio have fully committed to implement the core suite of tests-assertions. In addition, this project has instigated an alignment of basic functionality (e.g. standard queries and exports) between these agencies as it is recognized that differences between agencies supporting Darwin Core standard are frustrating for users. For example, Darwin Core exports from GBIF and the ALA are radically different, hampering amalgamation, and even the need to merge as GBIF should contain all ALA records and vice-versa for the ALA region of interest.

  • Funding was obtained from the TDWG Community Support Fund for participation of Renato de Giovanni at the meeting in Canberra (May 2017). With thanks to the TDWG Executive and appreciation of Renato’s input to the workshop.

Meetings and Liaisons

In the past twelve months, the DQIG has held many on-line meetings using Zoom, although time zones don’t always allow these to be as efficient as they may otherwise be.

Two physical meetings were held during the year.

  • In May 2017, a meeting was held in Canberra, Australia in conjunction with a Symposium “Greater impact through environmental infrastructure" held jointly between the Atlas of Living Australia (ALA) and the National Collaborative Research Infrastructure Strategy (NCRIS).

    This meeting was funded by the Atlas of Living Australia (ALA), GBIF and TERN and was hosted by the Atlas of Living Australia and CSIRO. The TDWG Community Support Fund funded the participation of Renato de Giovanni from Brazil. Thirteen people attended in person with another five participating remotely. A copy of the report can be found on the GitHub .

  • In June 2016, a joint DQIG and Annotation Interest Group meeting was held in conjunction with the SPNHC2017 meeting in Denver, Colorado. Funding for key participants was provided by the Kurator project and by iDigBio. A copy of the report can be found on the Annotations GitHub.

  • Since the last report, a full day meeting was held on the Sunday preceding TDWG2016 in La Fortuna de San Carlos, Costa Rica. A report on that and other meetings held at TDWG2016 can be found on the GitHub at https://github.com/tdwg/bdq/wiki/Summary-of-DQIG-meetings-during-TDWG2016,-Costa-Rica-(Dec-2016).

  • Throughout the year there has been extensive interchange and liaison with key data aggregators and others, including GBIF, ALA, iDigBio, VertNet, OBIS and Kurator.

  • A fourth meeting is planned for the Sunday before TDWG2017 in Ottawa, Canada.

Seminars

  • In conjunction with the meeting in Canberra, contributions were made to the 3-day symposium held at the Australian National Library (Greater impact through environmental infrastructure). As part of the Symposium, a session was held on Data Quality (Developing a Common Framework to Improve Fitness for Use of Biodiversity Data). Lee Belbin, Arthur Chapman and John Wieczorek participated in a panel discussion after John Wieczorek gave a provocative talk to get the ball rolling.
  • A Data Quality Symposium was held at TDWG2016 at which twelve talks were given.
  • Lee Belbin and Arthur Chapman gave a Webinar in September 2017 as part of the Darwin Core Webinar Series - Chapter 8: A Bite from the Core - testing for data quality.

Task Group 1 – Framework for Data Quality

Task group 1 is led by Allan Koch Veiga from the University of Säo Paulo, Brazil.

Aims:

To define, organize and provide the necessary components and resources to support DQ Assessment and DQ Management in a consistent and uniform way in the Biodiversity Informatics context.

Achievements 2016-2017:

  • A revised version of the thesis "A conceptual framework on biodiversity data quality", which presents a formalized and detailed description of a conceptual framework for BDQ, used in TG1, was published in March of 2017 (available at http://www.teses.usp.br/teses/disponiveis/3/3141/tde-17032017-085248/pt-br.php)

  • In June, 2017 the paper "A Conceptual Framework for Quality Assessment and Management of Biodiversity Data" which presents a higher level description of the conceptual framework used in TG1 was published in PLOS ONE (available at http://www.teses.usp.br/teses/disponiveis/3/3141/tde-17032017-085248/pt-br.php).

  • In July 2017, a paper "Biodiversity Data Quality Profiling: A practical guideline" was presented at the 8th Workshop of Applied Computing for the Management of the Environment and Natural Resources (WCAMA 2017) in São Paulo, SP, Brazil.

  • In 2017 a draft website was developed and published (https://tdwg.github.io/bdq/tg1/site/) for the TG1 containing:

    • A draft page for "Quick Start" with the framework, which presents a simplified way to use the framework from the point of view of four actors: DQ Profiler, Developers, Data Users and Data Holders.
    • A page for discussing a draft "Glossary and Controlled Vocabularies" related to the conceptual framework .
    • A page of "Resources", where presentations and publications are listed.
    • Two pages which still remain empty at this stage: (FAQs) and (Tools).

    The website was build using static HTML+CSS+JS files and includes a feature for enabling annotations to foster collaboration in all the pages.

  • a prototype of the Fitness for Use Backbone (FFUB) was developed to provide a practical way to use the conceptual framework for describing DQ Needs, DQ Solutions and DQ Reports.

Plans for 2017-2018:

As the next step we plan to advance the FFUB to create a stable REST API for helping the community to create, share and reuse DQ Profiles, DQ Solutions and DQ Reports about the available biodiversity distributed data, using a common format based on the conceptual framework.

We plan to provide reading and interactive materials to support the use of the TG1 conceptual framework and the FFUB web services. This includes improving the current TG1 website by:

  • Completing and improving the "Quick Start" page (lite version of the conceptual framework;)
  • Updating the "Resources" page;
  • Updating the "Glossary & Controlled" Vocabularies page;
  • Creating the "FAQs" page;
  • Creating the "Tools" page.

Task Group 2 – Data Quality Tests and Assertions

Task Group 2 is led by Lee Belbin who is the Science Advisor to The Atlas of Living Australia and who works from Tasmania, Australia.

NB. Previously known as “Tools, Services and Workflows” - the name was changed with the approval of the Executive in early 2017.

Achievements 2016-2017:

  • May 19-20: workshop on DQ held at CSIRO, Canberra. Agreement in principle to the core suite of tests and assertions, their parameters and the strategy for a test data suite. Alex Thompson (iDigBio) and Paul Morris (Kurator) have offered to generate the test data suite and the generic code for each test.
  • August, 2017: Complete the TG2 components of the manuscript on DQ, but further edit will be required post discussions ending at the Ottawa meeting.
  • August, 2017: I identified the ambiguities in the current test-assertion parameter "Severity" (error or warning). For example, while a value for Dwc:day=32 is a technical "error", would it be severe enough to flag the removal of the record? Severity seems user and 'fitness for use' dependent. Accordingly, I have replaced the "Severity" parameter and the "Output Type" parameter with a "Warning Type" parameter that takes the values
    • Ambiguous
    • Amended
    • Incomplete
    • Inconsistent
    • Invalid
    • Unlikely
  • August 2017: Several new tests have been added by various TG2 members since the May meeting. These appear to be a mixture of 'core' and more esoteric given the criteria for the addition of tests, viz.-
    • Informative
    • Easy to implement
    • Mandatory for enhancements
    • In use with decent % of hits
  • Accordingly, a posting was sent to the TG2 members to vote against all the tests and comment on the changes proposed. An evaluation of the scoring and comments is underway with final recommendations expected for the Ottawa meeting.
  • It is hoped that at least a draft suite of validation data and generic code for each test will be completed by October 30 following the TDWG2017 meeting.

Plans for 2017-2018:

  • David Martin (ALA) will attend the DQIG/TG meeting on behalf of Lee Belbin (TG2 Chair). We anticipate clearing up any loose ends at the Sunday workshop and can submit the Core Tests-Assertions as a TDWG standard in early 2018. Dave Martin was the overall system architect of the ALA and is responsible now for the implementation of the ALA platform as GBIF nodes at international sites. Such a position will be useful for a pragmatic evaluation and understanding of the core tests-assertions.

Task Group 3 – Use Case Library

Task Group 3 is led by Miles Nicholls from The Atlas of Living Australia, Canberra, Australia.

Achievements 2016-2017

Project was carried out by Emily Rees to collect, database and analyse data quality use cases

  • Use case library of 28 use cases assembled
  • Key data quality fields were identified
  • Difficulties in collecting data quality use case data highlighted (e.g. manual process, significant effort for every use case)
  • Results written up as part of the Biodiversity Data Journal being written for the IG

Plans for 2017-2018

  • Design more efficient mechanisms for collecting data quality use case metrics
  • Investigate options for use case collection in infrastructures (e.g. GBIF, ALA, iDigBio)
  • Investigate standard use cases as data quality profiles
  • Trial implementation of data quality profiles in infrastructures

Proposed Task Group 4 – Vocabularies

Proposed Task Group 4 is to be led by Paula Zermoglio from the University of Buenos Aires, Buenos Aires, Argentina.

Aims:

To create a framework within which to build biodiversity data vocabularies, particularly by developing a standard format for building TDWG vocabularies concerning the values used under the Darwin Core terms.

Achievements 2016-2017:

During 2017 discussions concerning this topic were held remotely among members of the DQIG and others from the broader community, as well as in person during the DQIG meeting in Canberra, 19-20 May 2017. As a result, a charter for the creation of a new Task Group to tackle this issue is currently in review by the members of the DQIG.

Plans for 2017-2018:

We plan to submit the charter before the end of the year (2017) and hope to begin with the task group activities Jan 2018. During 2018 our main goal will be to

  • prepare a scoping document,
  • develop a standard format for building the vocabularies,
  • build an exemplary vocabulary using such format, and
  • develop a common repository for vocabularies.

Proposed Task Group 5 – Invasive Organism Information

Proposed Task Group 5 is to be led by Quentin Groom from the Botanic Garden, Meise, Belgium

NB: There is ongoing discussion as to where this proposed TG fits – under the DQIG or under the Species Information Interest Group. That should be sorted out at TDWG2017.

Aims:

  • To improve Darwin Core terms for use in invasion biology, particularly by suggesting suitable controlled vocabularies and additional terms.

Achievements 2016-2017:

  • Writing a charter and assembling a team to work on the task group.

Plans for 2017-2018:

  • Write proposals for controlled vocabularies for dwc:occurenceStatus and dwc:establishmentMeans.
  • Get Task Group approved by TDWG Exec.
  • Present preliminary proposals at TDWG general meeting.

DQIG Plans for 2017-2018

Meetings at TDWG2017 will set the workplan for the next twelve months. We hope to use the Symposium and working meeting at TDWG to identify and recruit new people to help progress the work of the group over the next twelve months. It is planned that the next twelve months will be a major implementation phase, with continual liaison with key data aggregators (ALA, GBIF, iDigBio, SIBBR, OBIS), data custodians and data users. Tentative plans that we will take to the meeting at TDWG2017 include:

  • Complete and submit paper to Biodiversity Data Journal
  • Hold a meeting (ideally face-to-face) of core TG2 members to finalise the Tests and Assertions
  • Test and refine Tests and Assertions
  • Finalise Code and the Test Datasets for the Tests and Assertions
  • Submit Tests and Assertions as a TDWG Standard or equivalent
  • Liaise with Annotations Interest Group on standardising Assertions as Annotations
  • Continue to develop the Fitness for Use Backbone (FFUB)
  • Establish a more mechanical method for incorporating Use Case Studies
  • Increase participation in developing more Use Cases
  • Establish a Task Group and prepare a scoping document on Vocabularies
  • Possibly establish a Task Group on Invasive Species Data
  • Hold working meeting in first half of 2018
  • Continue liaison with ALA, GBIF, iDigBio and others on harmonizing/aligning Data Quality procedures.
  • Encourage uptake of standard tests and assertions
  • Outreach and dissemination of information

Arthur D. Chapman and Antonio Mauro Saraiva (co-convenors)

Clone this wiki locally