-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Quality model extension #174
Comments
One proposal / idea results from the work in the EU CitiObs project. It is understood that the use of the DQ_Element or the use of the Observation One possible way to improve filtering performance is to introduce a seprate The other aspect that was identified is that The following figure illustrates the proposal. The details of the The The diagramm and all illustrations above represent the current snapshot of thinking. It will for sure evolve over the next month... |
It's interesting to see DataQuality linked to Sensor, makes it easier to manage the maintenance of faulty Sensors. Would it not be more appropriate to have the DataQuality linked to Datastream and store the DataQuality attribute in Datastream?
The attributes for the DQ Entity do require more information |
The following proposal to represent quality information in STA 2.0 is focused on the requirements of dedicated tools for anomaly detection, like, for example, saqc. Without going into too much detail here, the basic functionality of such tools can be described as follows: Software for anomaly detection can usually be understood as a collection of algorithms implemented as functions or methods, tailored towards a certain kind of anomaly (e.g., algorithms to detect outliers, scatter, constant values, ...). In the following, those functions/methods will be called 'quality measures'. Applied to a given data stream, those tools usually output some kind of quality information in the following called 'quality features', often also called "flags." When used, these tools are usually configured to apply certain quality measures on an input datastream. Those measures themselves are usually parameterized towards the characteristics of the input data. In most cases, several quality measures will be used to detect different kinds of anomalies within the same datastream. The output will be another datastream of quality features, usually, but not necessarily, of the same length as the input data. The exact representation of the quality features is heavily dependent on the use case, and to my knowledge, no widely accepted standards exist (although I'd love to be proven wrong on this). In practice, I have seen quite different quality feature schemes (i.e., definitions of quality features and their interrelation), ranging from a set of concrete labels (like "OK," "SUSPICIOUS," "BAD") over integers within given bounds (1-10, 0-255) to 'continuous' representation as real numbers like in the interval [0, 1]. The main point here is that basically everything could be a quality feature and therefore part of a quality feature scheme. The representation of quality features, especially if this representation should be traceable or even reproducible, needs to fulfill several requirements:
The following diagram shows a possible realization of those requirements as an extension of the STA 1.1 data model: This scheme allows describing different |
For V2 we're renaming the |
Question about the cardinalities: Also, how many QualityFeatures can an Observation have? The latest diagram has |
We may get questions about the term |
Regarding the term: How about „QualityAnnotation“? |
If we allow to 'reuse' a
I changed the cardinality from 0..* to 0..1 because of the introduction of |
Stupid question from my side: With regard to the new entity types that we want to add here - how do they related to the Data Quality Vocabulary? I'm not really into the topic, got the hint from a colleague. If this is already part of your ideas, them I'm completely fine with this. |
You mean you have a number of Observations in a Datastream, that are all linked to the Both models would work. The question is which one is easier to implement for the client. Which, and how many POST/PATCH requests are needed, and what can be automated. If one changes each Observation to link to new If there is no automation in the background, any model would work, but that may be complex for the client...
Compound data types are harder for the client to deal with. And depending on the combinations that are possible, there may be very many different |
Well, actually not. I am not an expert on the Data Quality Vocabulary (DQV) either but I read into it a bit and found, that we have the following, more or less congruent, definitions:
Furthermore DQV defines another entity the 'Quality Dimension', which acts as an abstract category for a number of 'Quality Metrics', that each cover a particular aspect of what it means for data to be considered "good" or "fit for purpose" (e.g. the Data 'Quality Dimension' Accuracy could cover 'Quality Metrics' that provide 'Quality Measurements' like error margins, absolute errors, z-scores, ...). I am not sure, that we need something like the 'Quality Dimension' and I don't see much benefit in introducing another higher-level abstraction. If we would want do be compatible with / translatable into the DQV we likely should add the 'Quality Dimension' as well. I like the terminology however and suggest the following renaming:
|
Then I would suggest to use the model, that is easier to implement and I leave the decision to you @hylkevds.
I guess for the moment we could restrict the available types to scalars and fixed type arrays . If we stick with the |
No matter how we decide on compound data types, we should allow multiple |
Since no one really knows what to do with the resultQuality DQ_Element property on Observation, we need something better.
Some proposals are currently being worked on.
The text was updated successfully, but these errors were encountered: