Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download error for list of WP accessions #372

Open
ericcox1 opened this issue Jun 3, 2024 · 11 comments
Open

Download error for list of WP accessions #372

ericcox1 opened this issue Jun 3, 2024 · 11 comments

Comments

@ericcox1
Copy link
Collaborator

ericcox1 commented Jun 3, 2024

Hi @olearyna

I updated through conda --update, the version showing is 16.18.1, This is my code (I ran it with --debug as suggested):

datasets download gene accession --inputfile ~/Desktop/wp_1_50 --filename wp150 --include gene,protein --debug
The error is:

Error: Download error: http2: server sent GOAWAY and closed the connection; LastDownloading: ncbi_dataset.zip 4.62MB error
Find attached the screen capture with the phid.

phid

Thanks!

Originally posted by @carolinasisco in #360 (comment)

@ericcox1
Copy link
Collaborator Author

ericcox1 commented Jun 3, 2024

Hi @carolinasisco,

I'm opening a new issue since this is a separate problem from #360. We are continuing to investigate.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
NIH/NLM/NCBI
[email protected]

@ericcox1
Copy link
Collaborator Author

ericcox1 commented Jun 4, 2024

Hi @carolinasisco,

Download failures for WP accessions annotated on large numbers of genomes is a known problem and we are continuing to research ways to make this work.

For example, I tested individual downloads of each WP accession in the list you provided and I was able to successfully download most of them, while I saw reproducible failures with WP_003084404.1, which is annotated on ~28 k genome assemblies.

Let me ask you, for proteins such as WP_003084404.1, are you interested in downloading the genomic sequence from all ~28 k genomes on which this protein is annotated? If the answer is yes, then I can tell you that we will continue looking at ways to make this work. If not, we may be able to point you to easier ways to get a smaller set of genome sequences.

I look forward to hearing from you soon.

Best,
Eric

@carolinasisco
Copy link

Hi @ericcox1

Thank you so much for your effort. I'm only interested in obtaining the nucleotide and aminoacid sequences of these proteins from Pseudomonas aeruginosa PA14, a hipervilurent strain (for more context I will work with genome metabolic models).
Carol

@ericcox1
Copy link
Collaborator Author

ericcox1 commented Jun 4, 2024

That's helpful information. From the list of 50 WP accessions in your list, I found 4 WPs that are annotated on PA14 genomes: WP_003088572.1, WP_003101261.1, WP_003109333.1, WP_003138346.1

Normally, I would suggest using this command to specifically request genomic sequence from the Taxid of interest:
datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip, but I found a bug where we are incorrectly reporting an error. While we investigate this bug, here is an alternative approach that uses curl against our API to get the data:

curl -o PA14-proteins.zip "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/protein/accession/WP_003088572.1%2CWP_003101261.1%2CWP_003109333.1%2CWP_003138346.1/download?include_annotation_type=FASTA_GENE&&include_annotation_type=FASTA_PROTEIN&taxon=652611"

Here's a peek at what the FASTA headers will look like for the genomic sequence included in the package:

unzip -cq PA14-proteins.zip ncbi_dataset/data/gene.fna | grep '>' | head -5
>NZ_CP104980.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104980.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]
>NZ_CP104980.1:1482995-1483945 accA [protein_accession=WP_003109333.1] [organism=Pseudomonas aeruginosa PA14] [name=acetyl-CoA carboxylase carboxyl transferase subunit alpha] [gene=accA]
>NZ_CP104981.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104981.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]

-Eric

@carolinasisco
Copy link

Hi @ericcox1

Thank you for your help, I tried with curl and it worked!
Please, Let us know when the datasets download for WP accessions is working again.

Carol

@gabepen
Copy link

gabepen commented Jul 1, 2024

Checking in on the progress for this. I'm also having issues downloading WP accessions, specifically many within the gammaproteobacteria subtree are failing with the zip archive error. I am getting some success on certain sets of accessions (im downloading hundreds of different sets) but its rare. I've attached the debug output of one of the sets I am trying to download within a workflow. I'm not sure if its a different issue but when using the datasets python api I get this error:

gene_ids_for_accessions = [int(gene_rec.gene.gene_id) for gene_rec in gene_reply.genes]
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'gene_id'

This is using the function described here: https://www.ncbi.nlm.nih.gov/datasets/docs/v1/how-tos/genes/download-gene-data-package/

CLI output:
datasets download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug
2024/07/01 12:47:06
POST /datasets/v2alpha/protein/accession/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 296
Accept: text/plain,application/zip
Content-Type: application/json
Ncbi-Phid: 3E7681ACC0FC1E4745A072E1
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"accessions":["WP_042478116.1","WP_035893615.1","WP_010672261.1","WP_054661990.1","WP_024306120.1","WP_006660648.1","WP_003097132.1","WP_043192052.1","WP_000896506.1","WP_111259221.1","WP_067643804.1","WP_075184776.1","WP_009684663.1","WP_003367855.1"],"include_annotation_type":["FASTA_GENE"]}

2024/07/01 12:47:11
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Mon, 01 Jul 2024 19:47:11 GMT
Grpc-Metadata-Logging-Accessions: WP_000896506.1,WP_003097132.1,WP_003367855.1,WP_006660648.1,WP_009684663.1,WP_010672261.1,WP_024306120.1,WP_035893615.1,WP_042478116.1,WP_043192052.1,WP_054661990.1,WP_067643804.1,WP_075184776.1,WP_111259221.1
Grpc-Metadata-Logging-Accessions_count: 14
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Include_annotation_type: FASTA_GENE
Grpc-Metadata-Logging-Service: prokaryote
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 3E7681ACC0FC1E4745A072E1.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Downloading: outpath.zip 934kB done
Validating package []
Error: Internal error (invalid zip archive). Please try again

@ericcox1
Copy link
Collaborator Author

ericcox1 commented Jul 2, 2024

Hi @gabepen,

Thanks for your comment.

We haven't had a chance to look into this yet due to other institutional priorities.

Just to confirm, are you interested in downloading all underlying genomic sequences for each protein in your query? For example, for WP_003097132.1, this protein is annotated on close to 10 k genomes. Do you need the genomic sequences from each of the 10 k genomes?

Best,
Eric

@gabepen
Copy link

gabepen commented Jul 3, 2024

@ericcox1

It depends on each query. It doesn't seem to be a gene dataset size issue though, I've tested the --taxon-filter option and still get the zip archive error for a single sequence.

I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?

@ericcox1
Copy link
Collaborator Author

ericcox1 commented Jul 5, 2024

Thanks @gabepen.

I've tested the --taxon-filter option and still get the zip archive error for a single sequence.

Good point. We are going to release a fix for this bug sometime next week.

Fixing the download errors for WP accessions annotated on many thousands of genomes is going to take some more research and we don't have a definite timeline for this yet.

I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?

Could you please share an example for this issue?

Best,
Eric

@gabepen
Copy link

gabepen commented Jul 5, 2024

@ericcox1

I thought I was getting the error for taxon without a genome labeled as a reference but I tested it with this taxon and received the same error:


datasets download gene accession 'WP_000818647.1' --include gene --taxon-filter 2774015 --debug
2024/07/05 14:38:23 
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 130
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"2774015","taxon_resource_filter":"TAXON_RESOURCE_FILTER_ALL"}

2024/07/05 14:38:27 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/07/05 14:38:27 
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"returned_content":"COMPLETE","taxons":["2774015"]}

2024/07/05 14:38:28 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/07/05 14:38:28 
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 153
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"Pectobacterium quasiaquaticum","taxon_resource_filter":"TAXON_RESOURCE_FILTER_GENE"}

2024/07/05 14:38:28 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:28 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


Error: The taxonomy ID '2774015' is valid for 'Pectobacterium quasiaquaticum', but no download gene by accession data is currently available for this taxon.

And then I noticed I get a different error when the WP accessions are passed as a list:


datasets download gene accession ['WP_000818647.1'] --inclu
de gene --taxon-filter 2774015 --debug
2024/07/05 14:40:35 
POST /datasets/v2alpha/gene HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 146
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 115B94F10BCFA5CFB1656134
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession [WP_000818647.1] --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"accessions":["[WP_000818647.1]"],"include_tabular_header":"INCLUDE_TABULAR_HEADER_FIRST_PAGE_ONLY","page_size":1,"returned_content":"IDS_ONLY"}

2024/07/05 14:40:39 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:40:39 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 115B94F10BCFA5CFB1656134.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Error: No genes found that match selection

If I pass a large enough list of accessions I will find some gene records for the taxid I tested above, but I'm confident that the single one tested is annotated on the reference genome.

@ericcox1
Copy link
Collaborator Author

Hi @carolinasisco and @gabepen,

Thanks for your patience.

Here's an update:

  1. We are continuing to investigate how to better support requests for large numbers of genome sequences (10 k+) for a given WP.
  2. We have fixed the bug where we are incorrectly returning an error for certain WPs

For example, after updating to 16.23.0, this now works:

datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip
Downloading: PA14.zip    4.69kB valid zip structure -- files not checked
Validating package [================================================] 100% 6/6

Best,
Eric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants