-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Process consideration to keeping consistent data formatting #390
Comments
Just for context, for the district and sub-district data, there was a question of creating a consolidated file that has all of the entries in a standard format by the demarcation board. This might be a good place to pick this up. If we can automate that conversion to have that single file that could be a good thing. We can then phase out the use of any names and use key files to reconcile. Thoughts. |
Does anyone have an example of the demarcation file. I agree with the proposals of @chadwpetersen though we really want to discourage it. Sometimes it is forced on us. For example Ekurhulieni was releasing Ekurhuleni East and North and then split them into East 1 and East 2 |
See the example of the Limpopo districts file. @JosephSefara has been using the demarcation names. |
Also forgot @shaze we have the 2018 Demarcation key in https://github.com/dsfsi/covid19za/blob/master/data/district_data/LM_2018.csv |
Great -- this is a coarser level than we are reporting. The main issues have been at sub-municipality issue. I've had a quick look at the Demarcation Board web site -- can't see a convenient spreadsheet at lower level -- there are maps. |
@shaze You are referring to ward (is a sub-municipality) or subplace (like suburb) ? |
There are three levels of sub-municipality data that I have seen Regions are a collection of wards for sure. I think wards are generally collection of suburbs but looking at the maps I have of my neighbourhood, I'm not 100% sure that this is 100% followed. |
I think the demarcation key will help with standardising the district column names I think that is a good idea. Might be worth-while then to standardise the the cases details part - |
@heerden can you help with your inputs |
I am reviewing the current keys and will give my recommendation tomorrow for a flexible system. |
My thoughts have settled on all the great suggestions in this thread. I have a few practical steps we can start with, that will lead to a governable specification for data consistency for existing district data, future modifications and any districts we need to onboard. The Readme in the data/district_data should define everything we agree on here. A combine key file should then reflect the titles and the level of districts. I will submit the first draft soon. While the demarcation keys are a great starting point, I see how they do not always align with the reported media releases for each province. If it is available, the combined key file will then also serve as a conversion for exiting titles to their demarcation equivalent. We should also list the data collection leads to every province, to keep everyone in the loop. It might be a worthwhile task to list all our stakeholders as well, to contact them directly if there are "breaking changes". Internally, our API and the notebooks that commit calculated data can be seen as stakeholders. Other automated checks (Github Actions) can be added to validate the combined key file with submitted data. The province lead can then be notified that there is a break-change that has not gone through the governance process. The goal is still to prevent any stakeholders workflow from breaking. The only issue I see with versioning is that you will need to keep updating two sets of data files for a while until the old version is deprecated. If this is not an issue for the province leads, then we should keep this option open if there is no way to patch the data. Appending data columns at the end of the data file might remain the best option, as we do not know how the stakeholder is reading the data. They might be using the column index number. We should thus encourage them to rather use keys. I am not going to suggest drastic changes to any existing data but will need to consider each province, case by case. |
Hey @chadwpetersen and @shaze please take a look at the pull request. |
@shaze you mentioned you have a new "recovery" column for the GP districts. You can add them to the key file, which is a data column key file. |
Yes, I will. First I want to add "Deaths" though which we have in the data but not in the keys |
Is your feature request related to a problem? Please describe.
CSV file header changes can cause some issues with services that rely on the provided CSV formats provided by this awesome repo. This might break some downstream services that are expecting data in a specific format -i.e. data in a particular column index and maybe even having particular column header names.
To help ensure data guarantees to the community it would be great to keep these changes -if possible to a minimum and at least managed with some lean process. :)
Describe the solution you'd like
We could first consider only making column changes to a file append only. So if you want to add something new to the file -maybe consider appending it to the end of the file in that way it does not break any existing indexes others currently rely on.
We could also consider appending some sort of versioning to the end of a file name if we want to introduce a backwards incompatible change i.e reordering columns or renaming columns. Where the new file might get a
_v2.csv
added to it. That way people get time to upgrade to using the new file until sometime when we deprecate the old one.Describe alternatives you've considered
We could add to the README that these file formats can break at anytime as it is not yet in a stable format.
We could also consider having a means of agreeing to the format before the format is used. So have the community vote (but this can be a bit too much I think).
Additional context
The reason I ask is that these types of changes should be consider as backwards incompatible as downstream services that rely on these files can break if they expect the values to be in certain columns with certain headings and thus not a great experience when things like this change.
I have experienced a few breakages related to the district data files and was hoping a simple, lean process could be considered when maintaining these data files so as to give the community some data structure guarantees. :)
The text was updated successfully, but these errors were encountered: