Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update model description to include pointers to suitable images in IDC #68

Open
fedorov opened this issue Nov 22, 2023 · 10 comments
Open
Labels
enhancement New feature or request

Comments

@fedorov
Copy link
Member

fedorov commented Nov 22, 2023

As discussed earlier, I think this should be added to model.json and also propagate into the web page of the model. I suggest the most straightforward would be to include SeriesInstanceUID(s) for the suitable DICOM series that were confirmed to produce acceptable results with the model.

@LennyN95
Copy link
Member

I was thinking about two sections here @fedorov.
1 - A list of suitable IDC collections.
2 - Patient pointers to files the model was actually tested on (we will then reuse exactly these pointers to test the model's functionality and to automate* such process).

  • one GitHub worker with GPU is enrolled, this will be super easy :)

@LennyN95 LennyN95 added the enhancement New feature or request label Nov 23, 2023
@fedorov
Copy link
Member Author

fedorov commented Nov 24, 2023

How about we start with SeriesInstanceUIDs for the series known to work? Collections may have mix of suitable and non-suitable images, same about patients.

@LennyN95
Copy link
Member

Oh yes, absolutely. I should've said SeriesInstanceUIDs, not patient. For multi-input models, we need to point to the SeriesInstanceUIDs anyway.

Is it possible to download a dicom series by their SeriesInstanceUIDs from IDC avoiding Big Query?
I ask because I want to fully automate the testing and eventually host it on GitHub as an automation, but afaik, Big Query requires a Google authentication.

@fedorov
Copy link
Member Author

fedorov commented Nov 24, 2023

Yes, download does not require BigQuery. For that we would need to include crdc_series_uuid and bucket location.

How about we include the following attributes:

  • SeriesInstanceUID: not versioned (ie, if the series content changes in another release, this UID will remain the same)
  • IDC data release version: for completeness, since SeriesInstanceUID + IDC version uniquely identifies a DICOM series
  • crdc_series_uuid: versioned identifier of the series
  • series_aws_url: location in an AWS bucket from where the series can be downloaded (the suffix of this URL will be crdc_series_url)

@LennyN95
Copy link
Member

Looks good @fedorov :)

Do you have any example around somewhere of how to ..

  • derive this information from IDC (for our contributors)
  • how to download the information (for us, I guess via s5cmd?)

@fedorov
Copy link
Member Author

fedorov commented Nov 24, 2023

Sure! IDC version is in the name of the dataset of the table, e.g., bigquery-public-data.idc_v16.dicom_all.

crdc_series_uuid is a column in the table.

aws_url is a column in the table that contains bucket URL for individual file/instance. Instances are organized in folders by series, so the series bucket folder can be obtained with like this: CONCAT("cp s3://", SPLIT(aws_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid).

Download is with s5cmd:

s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp <series_aws_url>/* .

I can write this up if you point me to the docs page dedicated to contribution process.

@LennyN95
Copy link
Member

So, if I select a patient via the IDC portal, I get the following s5cmd command:
s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp 's3://idc-open-data/852b3536-b956-4a44-971c-1502ec6c4db2/*' .

I see no notion of the IDC-Version nor SeriesInstanceUID, is 852b3536-b956-4a44-971c-1502ec6c4db2 a unique resource uuid and indifferent to the storage provider (aws / gc)? Is this reliable and future proof?

It seems to me that the easiest way to automatically download a specific instance then would be to specify that resource uuid. We can mention the IDC version and SeriesInstanceUID alongside to provide the right context, or is there a method to derive the resource uid based on the information without running a big-query (assuming such a big query would require authorization).

@fedorov
Copy link
Member Author

fedorov commented Nov 28, 2023

I see no notion of the IDC-Version nor SeriesInstanceUID, is 852b3536-b956-4a44-971c-1502ec6c4db2 a unique resource uuid and indifferent to the storage provider (aws / gc)? Is this reliable and future proof?

It is a unique resource uuid. But just knowing uuid you cannot get the bucket URL without querying anything, so we would definitely need the URLs. Otherwise we would need to run a query. Unless the data is retracted (e.g., if there is a PHI breach, or scientific misconduct), the file corresponding to this UUID should remain available.

We can mention the IDC version and SeriesInstanceUID alongside to provide the right context, or is there a method to derive the resource uid based on the information without running a big-query (assuming such a big query would require authorization).

Yes, the information may be redundant (you can get the file from the bucket URL alone), but if the file moves (which is rare, but may happen), you would be able to query to hopefully find the new location. And SeriesInstanceUID is going to be in the DICOM file, while crdc_series_uuid is not a DICOM attribute.

Note that you can now query basic metadata fields without login and without BigQuery - see https://github.com/ImagingDataCommons/idc-index. Note that right now crdc_series_uuid is not one of the columns included in idc-index. We will revise this after IDC data release v17.

@LennyN95
Copy link
Member

Note that you can now query basic metadata fields without login and without BigQuery

That's awesome! So once crdc_series_uuid is available in idc-index, we can get to individual sample data by the idc version and the SeriesInstanceUID alone - cool! Until then, we can use the resource uuid to download the resources. If you agree, then I'll propose some format to extend the meta.json with (that'll work for our semi and fully automated test pipeline in the future).

@fedorov
Copy link
Member Author

fedorov commented Nov 28, 2023

That's awesome! So once crdc_series_uuid is available in idc-index, we can get to individual sample data by the idc version and the SeriesInstanceUID alone - cool! Until then, we can use the resource uuid to download the resources.

At least right now, idc-index handles only the latest version of IDC. We need to think and discuss if and how we would want to support prior versions.

I still think all of the 4 attributes I suggested would be good to have in the model metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants