-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: IDSSE Dataset #147
feat: IDSSE Dataset #147
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, looks great!
just some cosmetics
tests/test_io/test_datasets.py
Outdated
@pytest.mark.unit | ||
def test_dfl_get() -> None: | ||
|
||
dataset = DFLDataset() | ||
event_objects, position_objects = dataset.get() | ||
assert isinstance(event_objects[0]["firstHalf"]["Home"], Events) | ||
assert isinstance(event_objects[0]["secondHalf"]["Away"], Events) | ||
assert isinstance(event_objects[1]["Home"], Teamsheet) | ||
assert isinstance(event_objects[1]["Away"], Teamsheet) | ||
assert isinstance(event_objects[2], Pitch) | ||
|
||
assert isinstance(position_objects[0]["firstHalf"]["Home"], XY) | ||
assert isinstance(position_objects[0]["secondHalf"]["Away"], XY) | ||
assert isinstance(position_objects[1]["secondHalf"], Code) | ||
assert isinstance(position_objects[1]["firstHalf"], Code) | ||
assert isinstance(position_objects[2]["secondHalf"], Code) | ||
assert isinstance(position_objects[2]["firstHalf"], Code) | ||
assert isinstance(position_objects[3]["Home"], Teamsheet) | ||
assert isinstance(position_objects[3]["Away"], Teamsheet) | ||
assert isinstance(position_objects[4], Pitch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appreciate the effort, but how long does this take? feels like it may blow up our unit test suite from 8 seconds capable of running offline to 1 minute requiring internet connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the duration part. This test alone takes about 1:15 min on my machine. I will remove the test for now. I think for this dataset, a good test could be checking the respective links for valid responses as internet connection is already required e.g., for the StatsBombOpenDataset
tests.
However, I expect the final link to change (probably remove the private token) once the paper is published. So this could be added in a future fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's do it that way! probably make sense to split up our test-suite at some point with fast unit tests and a dedicated test battery for the parsing that might take longer, so they only need to be tested when working on that module or on pull requests. feel free to open another PR or issue on that
one more general comment: I've noticed you have been using the requests module for file downloads in the added parsers/datasets. Is there a reason for it? I know we have it as a transitive dependency, but so far we've been relying exclusively on urllib. I know requests is widely recommend and liked, but all we ever going to do http wise is send get requests, so basic functionality only. And that's always going to be faster with urllib as we don't need to load a gazillion fancy request features. (we might want to update to urllib3, though). Reason I'm asking is because we have dedicated download functions in the io/utils.py submodule, but the new parsers are deviating from this. I'd prefer if we'd think about this and see if we can find a unified way, or at least have good reasons should we end up with multiple download strategies/packages/submodules. Any thoughts? |
Fully agree on that! Simply a misunderstanding from my side on what the download functions in the utils would do. Changed it in the wrapper function. @VellaRo I suggest you do the same in your respective features as this is a pretty easy fix. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #147 +/- ##
===========================================
- Coverage 95.58% 85.89% -9.70%
===========================================
Files 47 48 +1
Lines 3194 3594 +400
===========================================
+ Hits 3053 3087 +34
- Misses 141 507 +366 ☔ View full report in Codecov by Sentry. |
floodlight/io/datasets.py
Outdated
|
||
@staticmethod | ||
def get_pitch() -> Pitch: | ||
"""Returns a Pitch object corresponding to the DFL-data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDSSE?
floodlight/io/datasets.py
Outdated
References | ||
---------- | ||
.. [1] `Bassek, M., Weber, H., Rein, R., & Memmert,D. (2024). An integrated | ||
dataset of synchronized spatiotemporal andevent data in elite soccer. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset of synchronized spatiotemporal andevent data in elite soccer. In | |
dataset of synchronized spatiotemporal and event data in elite soccer. In |
tests/test_io/test_datasets.py
Outdated
@pytest.mark.unit | ||
def test_dfl_get() -> None: | ||
|
||
dataset = DFLDataset() | ||
event_objects, position_objects = dataset.get() | ||
assert isinstance(event_objects[0]["firstHalf"]["Home"], Events) | ||
assert isinstance(event_objects[0]["secondHalf"]["Away"], Events) | ||
assert isinstance(event_objects[1]["Home"], Teamsheet) | ||
assert isinstance(event_objects[1]["Away"], Teamsheet) | ||
assert isinstance(event_objects[2], Pitch) | ||
|
||
assert isinstance(position_objects[0]["firstHalf"]["Home"], XY) | ||
assert isinstance(position_objects[0]["secondHalf"]["Away"], XY) | ||
assert isinstance(position_objects[1]["secondHalf"], Code) | ||
assert isinstance(position_objects[1]["firstHalf"], Code) | ||
assert isinstance(position_objects[2]["secondHalf"], Code) | ||
assert isinstance(position_objects[2]["firstHalf"], Code) | ||
assert isinstance(position_objects[3]["Home"], Teamsheet) | ||
assert isinstance(position_objects[3]["Away"], Teamsheet) | ||
assert isinstance(position_objects[4], Pitch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's do it that way! probably make sense to split up our test-suite at some point with fast unit tests and a dedicated test battery for the parsing that might take longer, so they only need to be tested when working on that module or on pull requests. feel free to open another PR or issue on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just two typos
This PR adds the initial version of the dataset accompanying the paper "An integrated dataset of synchronized spatiotemporal and event data in elite soccer" which is currently in submission at Nature Scientific Data. The dataset contains seven full matches of position and event data from the German Bundesliga first and second division.
This PR should be fully functional, however documentation, testing and cosmetics will be improved in the next commits.