-
Notifications
You must be signed in to change notification settings - Fork 7
Meeting of TDWG GBIF Data Quality Interest Group in conjunction with the GBIF Working Group on Fitness for Use for Invasive Alien Species, Monash, Victoria, Australia (Oct 2016)
- Arthur Chapman (co-Convenor);
- Lee Belbin (Leader DQIG-TG2);
- Miles Nicholls (Leader DQIG-TG3) (13th only);
- Antonio Saraiva (Co-convenor) (on-line 13th part);
- Allan Koch Veiga (leader DQIG-TG3) (on-line 13th part)
- Melodie McGeoch (Chairman - Australia)
- Gregory Ruiz (USA)
- John Wilson (South Africa)
- Quentin Groom (Belgium)
- Veros Petrosyan (Russia)
- Shyama Pagad (New Zealand)
- Dmitry Schigel
Following several days of meetings of the GBIF WG on Invasive Alien Species, the TDWG Data Quality Interest Group (Arthur and Lee) met with the group on the afternoon of the 12th October. The members of the Working Group presented a brief summary of their work to date and raised a number of issues where they believed input from the Data Quality Interest Group may be of value. These included issues such as
- Native versus non-native attributes such as
- Invasive status
- Vectors
- Status
- Impact
- Traits
- Enhanced Observation Data
- Habitat variables (such as substrate, depth and elevation)
These were expanded upon in extended discussion on day 2.
The morning of the 13th began with an on-line discussion between Arthur, Lee, Miles and Dmitry with Antonio Saraiva and Allan Koch Veiga in Brazil.
- Planning and Agenda for the pre-TDWG meeting in Costa Rica on Sunday, 4th December
It was agreed that one of the key discussion issues at that meeting needed to include planning for the introductory presentation to the open Working Group meeting so as to maximise the usefulness of the short time available.
It was agreed that the presentations at the Symposium sessions (Introductory talk on the Interest Group by Arthur, and talks on the Task Groups 1 by Allan and 2 and 3 by Lee) would take the place of long introductions to the open Working Group session as the majority would have attended that part of the Symposium anyway and repetition needed to be avoided.
-
Use cases – Miles reported that there had been slow uptake in reporting Use Cases. He reported that he was hoping to get approval to employ an intern for a couple of months over summer to assist with inputting Use Cases, and to possibly extract a number of Use Cases from published literature (see further discussion below).
-
GitHub – there was discussion on establishing a GitHub on the TDWG site for the group –for consolidation with TDWG, for ease of use and standardization across portals and the possibility that support for the GBIF Community Site may not continue.
Dmitry sent off an email to the TDWG Executive and this has resulted in Peter Desmet setting up a GitHub under the TDWG site called bdq (https://github.com/tdwg/bdq). (Done)
Arthur will cut and paste relevant documents from the GBIF Community Site over the next week or so.
- Possible DQIG meeting between TDWGs 2016 and 2017
Arthur raised the idea of seeking funding from the TDWG’s Community Support Fund (http://www.tdwg.org/activities/comm-support-fund/) to have a meeting of the Interest Group between TDWG2016 and TDWG2017. Submissions for funding from this fund can only be done once each year – in 2016 this was in early September and this makes it difficult to plan a meeting when the deadline is so close to the TDWG meetings where the next year’s Workplans are developed. Arthur and Antonio to make an approach to the TDWG Executive to explore possibilities of having the funding opened at other times and explore possibilities of obtaining funds. Funding through TDWG would require co-funding which would need to be explored.
Dmitry mentioned the possibility of GBIF being a co-sponsor through its Pacific Project through links to BID (European Biodiversity for Development) funding for the Pacific. Dmitry will explore Capacity building through GBIF. It was suggested that the ALA might also be able to provide some funding to support participation for Miles. A later email from Donald Hobern (GBIF) indicated possible support. For GBIF’s involvement, the meeting would need to include training or development of training material and possibly include a two-day training course on Data Quality. Antonio and Dmitry may also explore possibilities at the 23rd GBIF Governing Board meeting in Brasilia (24-27 October).
- Symposium. It was reported that we have a full list of speakers for the Data Quality Symposium at TDWG2016 (12 talks over two 90 minute sessions). Arthur and Antonio have reviewed the submissions and have arranged the order of presentations.
It was suggested that we use the Symposium to identify any new issues and players.
- Publication of Framework Document
Allan reported that the Framework Document has been submitted to PLOS1. He has recently received an email from the editors apologising for the delay in publication and indicating that they were pursuing it. It is still hoped that it will be published prior to TDWG2016.
The two groups held individual discussions in the early morning and then joined together for the rest of the day. Because of considerable overlap, I have placed discussions held under the various key topics. Arthur presented an overview of the Data Quality Interest Group and gave broad introduction to Data Quality issues. Lee presented on the Tests and Assertions Task Group (TG2) and explained how we were working on standardising the tests and assertions being carried out by Data Publishers and custodians. He stressed that all the tests were based on Darwin Core fields.
Miles presented a brief introduction to the work his Task Group (TG3) is doing on developing a Use Case library. Tests and Assertions – TG2
Lee presented the philosophy behind the focus on standardizing Tests and Assertions and showed the current spreadsheet. He reported that a near-final draft will be completed prior to TDWG2016, where a number of outstanding issues will need to be addressed. He stressed that all the tests were based on Darwin Core terms and that a set of emerging Principles were being built (available as a worksheet in the above spreadsheet). These include the principle of not testing for empty terms other than for the key taxon, geographic and temporal terms. There are many Darwin Core terms – a majority of which are often empty – and thus it is unproductive to include tests for all of those terms.
Lee explained that the tests had been classed in a number of ways. For example, the Span of the test could relate to a single term (e.g. day), multiple terms (e.g. day, month), multiple records (e.g. outlier tests) or external (e.g. requiring an external product – look-up table, calendar, shape file) or a summary of tests carried out (e.g. number of DwC terms tested and number of annotations generated).
Original tests were obtained through many data publishers (GBIF, ALA, CRIA, IDigBio, Vertnet and more recently OBIS), and were then rationalised. With an original list of 178, the list was now down to about 78. There are still a few outstanding issues that are being resolved and the number of tests may change a little over the coming months.
Lee mentioned that all tests, assertions and look-up tables would be linked through GUIDs (Globally Unique Identifiers). This would enable explicit links to each test and associated documentation.
Lee mentioned that the current tests are only applied to the main Darwin Core terms and not any of the Darwin Core Extensions. The Also currently the tests are a core set and that various Domains (marine, Agrobiodiversity, etc.) will have additional tests, but these should be formed and documented in the same way for consistency.
It is envisaged that a scientific paper will arise from this work.
Miles mentioned that he has worked with Allan and that the Use Cases can now easily be used to produce Profiles under the Framework.
Miles highlighted the difficulties of importing data from the information usually supplied. It currently takes him about an hour to input each Use Case. Some discussion was held on ways to reduce the manual effort required. As mentioned earlier, Miles reported that he was hoping to get approval to employ an intern for a couple of months over summer to assist with inputting Use Cases, and to possibly extract a number of Use Cases from published literature.
Melodie suggested that at ESA (Ecological Society of Australia), or other relevant conferences the ALA/GBIF host workshops at which they invite and assist participants (most likely to be PhD's and postdocs) to complete Use Cases during the workshop.
It was suggested that Allan’s “Validation”, as shown on the Profiles, was not always a validation and was more often a filtering. For example, the validation on a field may be “Latitude must be between -90 and +90”, but a Use Case may only be interested in records between -9 and -44. This is rightly a filter and not a validation. Suggest to Allan that we may need to change the terminology here otherwise there will be confusion.
Dmitry asked if Google Forms could be used and developed with User-friendly input – i.e. use the Google Forms as the front end to a more detailed Google Sheet. It was suggested that the initial form could be quite simple, and then further information could subsequently be solicited. It was suggested that the Initial form may be filled in and then later collaboration (possibly even using community or citizen science) could be used to expand the Use Case. The originator would need to retain veto/editorial control, and collaborators input would be acknowledged through attribution. Miles to explore.
Lee suggested that the information from Use Cases should/may lead to better collections. For example, if collectors (e.g. Citizen Scientists – BioCollect, eNaturalist, BowerBird, various BioBlitzes) knew what fields, level of resolution, etc. are required by users they may be more inclined to collect the relevant data.
Miles reported that the capturing of Use Cases use filters on the DQ Tests and Assertions and other indexed variables. He would like a web-service that linked to the vocabularies used in the tests which would lead to consistencies in the Use Cases with the data quality checks (e.g. basis of record vocabularies). Miles requested that wherever possible look-up tables should include synonyms (e.g. a geographic look-up table should include “NSW”, “New South Wales” etc. with links to the preferred. Lee to explore – probably a TDWG issue to be raised in CR.
Lee suggested that the forms could be based along the lines of the three domains (taxonomic, geographic, temporal) e.g.
- What taxa (Eucalyptus)
- Geographic area of concern (Victoria)
- Temporal (over what time period)
Melodie suggested that the Use Case forms be made more user-friendly by using more general biological/ecological terms. Miles to pursue.
Lee suggested that it may be possible that some Use Cases could be mined from the ALA (using the new user download tool and the Spatial Portal logs) – needs exploring.
Melodie then presented a summary of the work the Invasive Alien Species group had been doing and had presented a slide of Data Quality related issues she had extracted from their draft report.
There was some discussion on non-institutionalised research data that was often being lost to science. This included data collected by researchers but then largely lost from science following the completion of the project. Questions were asked (without any real answers) on how this data could be better archived and made available via GBIF, etc. This is a universal issue: Locating and digitizing/ingesting data and getting metadata filled in.
Quentin raised the issue of check lists and mentioned that Darwin Core did not adequately deal with these. An example of one type of checklist they were concerned with was GRIIS (The Global Register of Introduced Species – http://www.griis.org/)
Check lists are lists of taxa within a described area possibly over a particular time, for example, species seen on an island or in a National Park. Such check lists may or may not include a set of traits linked to either the taxon or the area. For example, native status, pathways (for invasive species), date of introduction, etc. A number of IAS comments stressed that (as far as they knew), Darwin Core terms were inadequate to capture information mandatory for evaluation of invasions, and check lists were one example where it was unclear how Darwin Core terms could be used unambiguously. See section below on Darwin Core.
There was some discussion on how check-list data could undergo more quality assurance – for example, the taxonomy. The question was asked about possible ways to attribute the taxonomic identifications with uncertainty attributes. Arthur mentioned that there were three systems currently available – the TDWG HISPID standard includes one method that was developed for use in herbaria, a second method developed for Australian fish collections and a third method that he and Larry Speers developed, but he is not aware of anyone actually using it as yet. Arthur to forward some relevant references (Done).
There was also some discussion on the recording of attributes – for example “native/not-native” is usually more relevant to the taxon within the checklist area than to individual observations. It is often derived from other information and can’t always be determined at the time of collection. Data Quality as it pertains to checklists (i.e. a taxon at a place and time ) needs to be explored.
It was agreed that some information is better presented as a shape file rather than an observation. For example, “Native/non-native/unknown” may be generated as a layer or shape file and/or be part of a checklist. Lee: This can be handled by Darwin Core using footprintWKT.
It was stressed that Documentation and Metadata on how layers and checklists were derived is important.
There was a discussion on how one may best handle absences. Currently they are generally treated as a non-observation (absence as an attribute in an observation), however it was suggested that this was probably not the best way to handle absences. Delimitation or Constraint on what are the “possibilities” in an area is not easy, but there needs to be a constraint to those taxa that could reasonably be regarded as possible, and exclude all the “not-possibles”.
It is important to record and document how the absences were determined – for example, by inclusion of sampling effort estimates. Important to note that some absences are of the form “not known from this area” whereas others are of the type “may occur in area, but not seen at this particular time” – for example migratory species.
It was suggested that Absences may be best treated similarly to checklists – i.e. they are taxon, area and time based. They are the absence of a particular taxon within a defined area at a defined time (i.e. list of: species/space/time). Difficult at this stage to quality check, but needs to be considered.
In most cases, absences are generated after systematic surveys by generating a sites by species matrix from a suite of survey sites based on the same samplingProtocol (which limits the taxa) and over a given time period.
Melodie reported that the IAS group were to report to GBIF by 1 November.
Arthur gave a presentation on the way forward for the Data Quality Interest Group, reporting that the next twelve months will be an implementation phase and that all Stakeholders (data custodians, data publishers and users) need to be part of the process, and that considerable training would be necessary. He mentioned that all data publishers will need to expose their code for the process to work.
Lee presented a brief introduction on how the Atlas of Living Australia handles ‘Data Quality’ and the resulting Assertions. It was mentioned that the ALA is now being cloned by quite a number of countries, with 6-8 already established and more in the process of implementation. Following a question from Melodie, Lee stated that unfortunately, the annotations weren’t currently being harvested by GBIF, but this is all part of the DQIG process of bringing data publishes together and using a consistent set of standard annotations. The ALA did generate a paper on a standard for handling annotations (http://dx.plos.org/10.1371/journal.pone.0076093) and the associated code is exported to other national nodes. One assumes a standard will emerge, but only with the assistance of TDWG.
Lee and Miles explained how the feedback mechanisms work on the ALA. Also what happens with the feedback and how changes and corrections are maintained so that a user providing feedback can see what response there has been to that feedback.
Quentin raised the issue of eDNA and the possibilities for huge amounts of data to be generated, mostly without taxonomic names attached. This was recognised as an issue for the future, but that nothing could be done from a DQ perspective at this stage. The data will be included into databases such as GenBank (equivalent to Observation data at the species level) and thus be picked up by GBIF as observations. The eDNA could generate massive data management and data quality issues in the future.
Darwin Core currently doesn’t cover all information needed for IAS or everything about Checklists – e.g. Date of Introduction (to an area), introductory pathways, provenance of origin, etc. Quentin stated that he would raise these issues at TDWG2016 where workshops are being held on Darwin Core. It was mentioned that John Wieczorek will be at TDWG2016, and that he is the best person to discuss these issues with.
Quentin recommended that an “origin” field needed to be added to DwC (e.g. next to establishmentMeans – add new term for establishmentOrigin – use IUCN terms). Quentin will be attending the TDWG2016 and will raise the issue at that meeting.