Update model description to include pointers to suitable images in IDC #68

fedorov · 2023-11-22T14:24:40Z

As discussed earlier, I think this should be added to model.json and also propagate into the web page of the model. I suggest the most straightforward would be to include SeriesInstanceUID(s) for the suitable DICOM series that were confirmed to produce acceptable results with the model.

The text was updated successfully, but these errors were encountered:

LennyN95 · 2023-11-23T09:38:46Z

I was thinking about two sections here @fedorov.
1 - A list of suitable IDC collections.
2 - Patient pointers to files the model was actually tested on (we will then reuse exactly these pointers to test the model's functionality and to automate* such process).

one GitHub worker with GPU is enrolled, this will be super easy :)

fedorov · 2023-11-24T00:10:54Z

How about we start with SeriesInstanceUIDs for the series known to work? Collections may have mix of suitable and non-suitable images, same about patients.

LennyN95 · 2023-11-24T06:07:11Z

Oh yes, absolutely. I should've said SeriesInstanceUIDs, not patient. For multi-input models, we need to point to the SeriesInstanceUIDs anyway.

Is it possible to download a dicom series by their SeriesInstanceUIDs from IDC avoiding Big Query?
I ask because I want to fully automate the testing and eventually host it on GitHub as an automation, but afaik, Big Query requires a Google authentication.

fedorov · 2023-11-24T13:23:29Z

Yes, download does not require BigQuery. For that we would need to include crdc_series_uuid and bucket location.

How about we include the following attributes:

SeriesInstanceUID: not versioned (ie, if the series content changes in another release, this UID will remain the same)
IDC data release version: for completeness, since SeriesInstanceUID + IDC version uniquely identifies a DICOM series
crdc_series_uuid: versioned identifier of the series
series_aws_url: location in an AWS bucket from where the series can be downloaded (the suffix of this URL will be crdc_series_url)

LennyN95 · 2023-11-24T13:37:05Z

Looks good @fedorov :)

Do you have any example around somewhere of how to ..

derive this information from IDC (for our contributors)
how to download the information (for us, I guess via s5cmd?)

fedorov · 2023-11-24T16:02:37Z

Sure! IDC version is in the name of the dataset of the table, e.g., bigquery-public-data.idc_v16.dicom_all.

crdc_series_uuid is a column in the table.

aws_url is a column in the table that contains bucket URL for individual file/instance. Instances are organized in folders by series, so the series bucket folder can be obtained with like this: CONCAT("cp s3://", SPLIT(aws_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid).

Download is with s5cmd:

s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp <series_aws_url>/* .

I can write this up if you point me to the docs page dedicated to contribution process.

LennyN95 · 2023-11-27T10:15:53Z

So, if I select a patient via the IDC portal, I get the following s5cmd command:
s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp 's3://idc-open-data/852b3536-b956-4a44-971c-1502ec6c4db2/*' .

I see no notion of the IDC-Version nor SeriesInstanceUID, is 852b3536-b956-4a44-971c-1502ec6c4db2 a unique resource uuid and indifferent to the storage provider (aws / gc)? Is this reliable and future proof?

It seems to me that the easiest way to automatically download a specific instance then would be to specify that resource uuid. We can mention the IDC version and SeriesInstanceUID alongside to provide the right context, or is there a method to derive the resource uid based on the information without running a big-query (assuming such a big query would require authorization).

fedorov · 2023-11-28T14:14:58Z

I see no notion of the IDC-Version nor SeriesInstanceUID, is 852b3536-b956-4a44-971c-1502ec6c4db2 a unique resource uuid and indifferent to the storage provider (aws / gc)? Is this reliable and future proof?

It is a unique resource uuid. But just knowing uuid you cannot get the bucket URL without querying anything, so we would definitely need the URLs. Otherwise we would need to run a query. Unless the data is retracted (e.g., if there is a PHI breach, or scientific misconduct), the file corresponding to this UUID should remain available.

We can mention the IDC version and SeriesInstanceUID alongside to provide the right context, or is there a method to derive the resource uid based on the information without running a big-query (assuming such a big query would require authorization).

Yes, the information may be redundant (you can get the file from the bucket URL alone), but if the file moves (which is rare, but may happen), you would be able to query to hopefully find the new location. And SeriesInstanceUID is going to be in the DICOM file, while crdc_series_uuid is not a DICOM attribute.

Note that you can now query basic metadata fields without login and without BigQuery - see https://github.com/ImagingDataCommons/idc-index. Note that right now crdc_series_uuid is not one of the columns included in idc-index. We will revise this after IDC data release v17.

LennyN95 · 2023-11-28T15:38:27Z

Note that you can now query basic metadata fields without login and without BigQuery

That's awesome! So once crdc_series_uuid is available in idc-index, we can get to individual sample data by the idc version and the SeriesInstanceUID alone - cool! Until then, we can use the resource uuid to download the resources. If you agree, then I'll propose some format to extend the meta.json with (that'll work for our semi and fully automated test pipeline in the future).

fedorov · 2023-11-28T17:33:59Z

That's awesome! So once crdc_series_uuid is available in idc-index, we can get to individual sample data by the idc version and the SeriesInstanceUID alone - cool! Until then, we can use the resource uuid to download the resources.

At least right now, idc-index handles only the latest version of IDC. We need to think and discuss if and how we would want to support prior versions.

I still think all of the 4 attributes I suggested would be good to have in the model metadata.

LennyN95 added the enhancement New feature or request label Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update model description to include pointers to suitable images in IDC #68

Update model description to include pointers to suitable images in IDC #68

fedorov commented Nov 22, 2023

LennyN95 commented Nov 23, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 24, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 24, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 27, 2023

fedorov commented Nov 28, 2023 •

edited

Loading

LennyN95 commented Nov 28, 2023

fedorov commented Nov 28, 2023

Update model description to include pointers to suitable images in IDC #68

Update model description to include pointers to suitable images in IDC #68

Comments

fedorov commented Nov 22, 2023

LennyN95 commented Nov 23, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 24, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 24, 2023

fedorov commented Nov 24, 2023

LennyN95 commented Nov 27, 2023

fedorov commented Nov 28, 2023 • edited Loading

LennyN95 commented Nov 28, 2023

fedorov commented Nov 28, 2023

fedorov commented Nov 28, 2023 •

edited

Loading