Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize validate #418

Merged
merged 10 commits into from
Aug 16, 2024
Merged

Parallelize validate #418

merged 10 commits into from
Aug 16, 2024

Conversation

ssssarah
Copy link
Contributor

@ssssarah ssssarah commented Jul 10, 2024

  • the traverse function being optionally overwritten on Shape was due to a change in pyshacl versions (I believe an old version had this version but a newer one no longer had it. For the version we have fixed, this function does not exist) Install required version of Pyshacl when installing Nexus Forge #43 Fix temporary the Shape.traverse() issue #44
  • renamed usage of _transitive_load_shape_graph to _transitive_load_resource_graph
  • dummy implementation of _validate_many in DemoModel
  • Override pyshacl ShapesGraph implementation of def shapes to return list instead of dict_values() so that it can be picklable (for multiprocessing)
  • Parallelized validate.
    -> cannot use threads because pyshacl uses rdflib, rdflib uses pyparsing and it is not thread safe
    -> shape loading occurs before creating multiple processes for validation. It would be possible to load shapes in individual processes, however:
    - "caching" of shapes becomes useless, the same shape will be queried multiple times across processes
    - Shape loading in individual processes would be best if validating resources of different types/shapes
    - The former point might not be completely true because shapes re-use each other
    - It is for sure better to load the shapes beforehand if we're dealing with a list of resources of the same type/shape.
    Given these reasons, I opted for loading all shapes beforehand.

@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2024

Codecov Report

Attention: Patch coverage is 95.06173% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.19%. Comparing base (a1ae462) to head (19e3556).

Files Patch % Lines
kgforge/specializations/models/rdf/service.py 95.34% 2 Missing ⚠️
...gforge/specializations/models/rdf/store_service.py 0.00% 1 Missing ⚠️
kgforge/specializations/models/rdf_model.py 97.22% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
+ Coverage   74.13%   74.19%   +0.05%     
==========================================
  Files         106      106              
  Lines        6979     6998      +19     
==========================================
+ Hits         5174     5192      +18     
- Misses       1805     1806       +1     
Flag Coverage Δ
unittests 74.19% <95.06%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ssssarah ssssarah marked this pull request as draft July 18, 2024 10:17
@ssssarah
Copy link
Contributor Author

Unfortunately I ran into an issue when testing validation on 4000 resources.
Long story short, pyshacl uses rdflib, rdflib uses pyparsing and it is not thread safe
RDFLib/rdflib#765 I'll look more into it

@ssssarah ssssarah marked this pull request as ready for review July 19, 2024 13:23
@ssssarah ssssarah self-assigned this Jul 19, 2024
@ssssarah ssssarah added the enhancement New feature or request label Jul 19, 2024
@ssssarah ssssarah merged commit 244e900 into master Aug 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants