-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download error for list of WP accessions #372
Comments
Hi @carolinasisco, I'm opening a new issue since this is a separate problem from #360. We are continuing to investigate. Best, Eric Cox, PhD [Contractor] (he/him/his) |
Hi @carolinasisco, Download failures for WP accessions annotated on large numbers of genomes is a known problem and we are continuing to research ways to make this work. For example, I tested individual downloads of each WP accession in the list you provided and I was able to successfully download most of them, while I saw reproducible failures with WP_003084404.1, which is annotated on ~28 k genome assemblies. Let me ask you, for proteins such as WP_003084404.1, are you interested in downloading the genomic sequence from all ~28 k genomes on which this protein is annotated? If the answer is yes, then I can tell you that we will continue looking at ways to make this work. If not, we may be able to point you to easier ways to get a smaller set of genome sequences. I look forward to hearing from you soon. Best, |
Hi @ericcox1 Thank you so much for your effort. I'm only interested in obtaining the nucleotide and aminoacid sequences of these proteins from Pseudomonas aeruginosa PA14, a hipervilurent strain (for more context I will work with genome metabolic models). |
That's helpful information. From the list of 50 WP accessions in your list, I found 4 WPs that are annotated on PA14 genomes: WP_003088572.1, WP_003101261.1, WP_003109333.1, WP_003138346.1 Normally, I would suggest using this command to specifically request genomic sequence from the Taxid of interest:
Here's a peek at what the FASTA headers will look like for the genomic sequence included in the package:
-Eric |
Hi @ericcox1 Thank you for your help, I tried with curl and it worked! Carol |
Checking in on the progress for this. I'm also having issues downloading WP accessions, specifically many within the gammaproteobacteria subtree are failing with the zip archive error. I am getting some success on certain sets of accessions (im downloading hundreds of different sets) but its rare. I've attached the debug output of one of the sets I am trying to download within a workflow. I'm not sure if its a different issue but when using the datasets python api I get this error: gene_ids_for_accessions = [int(gene_rec.gene.gene_id) for gene_rec in gene_reply.genes] This is using the function described here: https://www.ncbi.nlm.nih.gov/datasets/docs/v1/how-tos/genes/download-gene-data-package/ CLI output: {"accessions":["WP_042478116.1","WP_035893615.1","WP_010672261.1","WP_054661990.1","WP_024306120.1","WP_006660648.1","WP_003097132.1","WP_043192052.1","WP_000896506.1","WP_111259221.1","WP_067643804.1","WP_075184776.1","WP_009684663.1","WP_003367855.1"],"include_annotation_type":["FASTA_GENE"]} 2024/07/01 12:47:11 New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets. |
Hi @gabepen, Thanks for your comment. We haven't had a chance to look into this yet due to other institutional priorities. Just to confirm, are you interested in downloading all underlying genomic sequences for each protein in your query? For example, for WP_003097132.1, this protein is annotated on close to 10 k genomes. Do you need the genomic sequences from each of the 10 k genomes? Best, |
It depends on each query. It doesn't seem to be a gene dataset size issue though, I've tested the --taxon-filter option and still get the zip archive error for a single sequence. I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this? |
Thanks @gabepen.
Good point. We are going to release a fix for this bug sometime next week. Fixing the download errors for WP accessions annotated on many thousands of genomes is going to take some more research and we don't have a definite timeline for this yet.
Could you please share an example for this issue? Best, |
I thought I was getting the error for taxon without a genome labeled as a reference but I tested it with this taxon and received the same error:
And then I noticed I get a different error when the WP accessions are passed as a list:
If I pass a large enough list of accessions I will find some gene records for the taxid I tested above, but I'm confident that the single one tested is annotated on the reference genome. |
Hi @carolinasisco and @gabepen, Thanks for your patience. Here's an update:
For example, after updating to
Best, |
Hi @olearyna
I updated through conda --update, the version showing is 16.18.1, This is my code (I ran it with --debug as suggested):
datasets download gene accession --inputfile ~/Desktop/wp_1_50 --filename wp150 --include gene,protein --debug
The error is:
Error: Download error: http2: server sent GOAWAY and closed the connection; LastDownloading: ncbi_dataset.zip 4.62MB error
Find attached the screen capture with the phid.
Thanks!
Originally posted by @carolinasisco in #360 (comment)
The text was updated successfully, but these errors were encountered: