-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize validate #418
Parallelize validate #418
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #418 +/- ##
==========================================
+ Coverage 74.13% 74.19% +0.05%
==========================================
Files 106 106
Lines 6979 6998 +19
==========================================
+ Hits 5174 5192 +18
- Misses 1805 1806 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Unfortunately I ran into an issue when testing validation on 4000 resources. |
44b419b
to
41c7fcf
Compare
41c7fcf
to
b68c35b
Compare
_transitive_load_shape_graph
to_transitive_load_resource_graph
_validate_many
inDemoModel
ShapesGraph
implementation ofdef shapes
to returnlist
instead ofdict_values()
so that it can be picklable (for multiprocessing)-> cannot use threads because pyshacl uses rdflib, rdflib uses pyparsing and it is not thread safe
-> shape loading occurs before creating multiple processes for validation. It would be possible to load shapes in individual processes, however:
- "caching" of shapes becomes useless, the same shape will be queried multiple times across processes
- Shape loading in individual processes would be best if validating resources of different types/shapes
- The former point might not be completely true because shapes re-use each other
- It is for sure better to load the shapes beforehand if we're dealing with a list of resources of the same type/shape.
Given these reasons, I opted for loading all shapes beforehand.