Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: IDSSE Dataset #147

Merged
merged 14 commits into from
Sep 10, 2024
Merged

Conversation

manuba95
Copy link
Contributor

@manuba95 manuba95 commented Sep 5, 2024

This PR adds the initial version of the dataset accompanying the paper "An integrated dataset of synchronized spatiotemporal and event data in elite soccer" which is currently in submission at Nature Scientific Data. The dataset contains seven full matches of position and event data from the German Bundesliga first and second division.

This PR should be fully functional, however documentation, testing and cosmetics will be improved in the next commits.

Copy link
Contributor

@draabe draabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, looks great!

just some cosmetics

floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
floodlight/io/datasets.py Outdated Show resolved Hide resolved
Comment on lines 155 to 174
@pytest.mark.unit
def test_dfl_get() -> None:

dataset = DFLDataset()
event_objects, position_objects = dataset.get()
assert isinstance(event_objects[0]["firstHalf"]["Home"], Events)
assert isinstance(event_objects[0]["secondHalf"]["Away"], Events)
assert isinstance(event_objects[1]["Home"], Teamsheet)
assert isinstance(event_objects[1]["Away"], Teamsheet)
assert isinstance(event_objects[2], Pitch)

assert isinstance(position_objects[0]["firstHalf"]["Home"], XY)
assert isinstance(position_objects[0]["secondHalf"]["Away"], XY)
assert isinstance(position_objects[1]["secondHalf"], Code)
assert isinstance(position_objects[1]["firstHalf"], Code)
assert isinstance(position_objects[2]["secondHalf"], Code)
assert isinstance(position_objects[2]["firstHalf"], Code)
assert isinstance(position_objects[3]["Home"], Teamsheet)
assert isinstance(position_objects[3]["Away"], Teamsheet)
assert isinstance(position_objects[4], Pitch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate the effort, but how long does this take? feels like it may blow up our unit test suite from 8 seconds capable of running offline to 1 minute requiring internet connection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the duration part. This test alone takes about 1:15 min on my machine. I will remove the test for now. I think for this dataset, a good test could be checking the respective links for valid responses as internet connection is already required e.g., for the StatsBombOpenDataset tests.
However, I expect the final link to change (probably remove the private token) once the paper is published. So this could be added in a future fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's do it that way! probably make sense to split up our test-suite at some point with fast unit tests and a dedicated test battery for the parsing that might take longer, so they only need to be tested when working on that module or on pull requests. feel free to open another PR or issue on that

@draabe
Copy link
Contributor

draabe commented Sep 7, 2024

Hi @manuba95 and @VellaRo

one more general comment: I've noticed you have been using the requests module for file downloads in the added parsers/datasets. Is there a reason for it? I know we have it as a transitive dependency, but so far we've been relying exclusively on urllib. I know requests is widely recommend and liked, but all we ever going to do http wise is send get requests, so basic functionality only. And that's always going to be faster with urllib as we don't need to load a gazillion fancy request features. (we might want to update to urllib3, though).

Reason I'm asking is because we have dedicated download functions in the io/utils.py submodule, but the new parsers are deviating from this. I'd prefer if we'd think about this and see if we can find a unified way, or at least have good reasons should we end up with multiple download strategies/packages/submodules.

Any thoughts?

@manuba95
Copy link
Contributor Author

manuba95 commented Sep 7, 2024

Fully agree on that! Simply a misunderstanding from my side on what the download functions in the utils would do. Changed it in the wrapper function. @VellaRo I suggest you do the same in your respective features as this is a pretty easy fix.

Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 9.09091% with 70 lines in your changes missing coverage. Please review.

Project coverage is 85.89%. Comparing base (22cb3a2) to head (ffa8bbf).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
floodlight/io/datasets.py 9.09% 70 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (22cb3a2) and HEAD (ffa8bbf). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (22cb3a2) HEAD (ffa8bbf)
6 3
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #147      +/-   ##
===========================================
- Coverage    95.58%   85.89%   -9.70%     
===========================================
  Files           47       48       +1     
  Lines         3194     3594     +400     
===========================================
+ Hits          3053     3087      +34     
- Misses         141      507     +366     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


@staticmethod
def get_pitch() -> Pitch:
"""Returns a Pitch object corresponding to the DFL-data."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDSSE?

References
----------
.. [1] `Bassek, M., Weber, H., Rein, R., & Memmert,D. (2024). An integrated
dataset of synchronized spatiotemporal andevent data in elite soccer. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataset of synchronized spatiotemporal andevent data in elite soccer. In
dataset of synchronized spatiotemporal and event data in elite soccer. In

Comment on lines 155 to 174
@pytest.mark.unit
def test_dfl_get() -> None:

dataset = DFLDataset()
event_objects, position_objects = dataset.get()
assert isinstance(event_objects[0]["firstHalf"]["Home"], Events)
assert isinstance(event_objects[0]["secondHalf"]["Away"], Events)
assert isinstance(event_objects[1]["Home"], Teamsheet)
assert isinstance(event_objects[1]["Away"], Teamsheet)
assert isinstance(event_objects[2], Pitch)

assert isinstance(position_objects[0]["firstHalf"]["Home"], XY)
assert isinstance(position_objects[0]["secondHalf"]["Away"], XY)
assert isinstance(position_objects[1]["secondHalf"], Code)
assert isinstance(position_objects[1]["firstHalf"], Code)
assert isinstance(position_objects[2]["secondHalf"], Code)
assert isinstance(position_objects[2]["firstHalf"], Code)
assert isinstance(position_objects[3]["Home"], Teamsheet)
assert isinstance(position_objects[3]["Away"], Teamsheet)
assert isinstance(position_objects[4], Pitch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's do it that way! probably make sense to split up our test-suite at some point with fast unit tests and a dedicated test battery for the parsing that might take longer, so they only need to be tested when working on that module or on pull requests. feel free to open another PR or issue on that

Copy link
Contributor

@draabe draabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just two typos

@draabe draabe mentioned this pull request Sep 9, 2024
@draabe draabe merged commit ddb0422 into floodlight-sports:develop Sep 10, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants