Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify the excluded genomes and why they are excluded #1

Open
ccbaumler opened this issue Apr 26, 2023 · 0 comments
Open

Identify the excluded genomes and why they are excluded #1

ccbaumler opened this issue Apr 26, 2023 · 0 comments

Comments

@ccbaumler
Copy link
Collaborator

Upon comparing the two final manifests:

wc -l manifest/sourmash.manifest.original.csv
65705 manifest/sourmash.manifest.original.csv

wc -l manifest/sourmash.manifest.csv
64375 manifest/sourmash.manifest.csv

comm -23 <(sort manifest/sourmash.manifest.original.csv) <(sort manifest/sourmash.manifest.csv) | wc -l
1330

The 1330 excluded genomes may be exported into a new file excluded.genomes.csv:

 head excluded.genomes.csv
signatures/000eca019c6c56e66c37f649faabea61.sig.gz,000eca019c6c56e66c37f649faabea61,000eca01,31,DNA,0,1000,5917,1,"GCF_000745545.1 Caulobacter henricii strain=CF287, ASM74554v1",/dev/fd/63
signatures/0013ee76e5e54ff16572c6019ac87675.sig.gz,0013ee76e5e54ff16572c6019ac87675,0013ee76,31,DNA,0,1000,3113,1,"GCF_004345375.1 Nitrosomonas sp. Nm134 strain=Nm134, ASM434537v1",/dev/fd/63
signatures/004c6c71df933b0727f08d39ac08ede0.sig.gz,004c6c71df933b0727f08d39ac08ede0,004c6c71,31,DNA,0,1000,3976,1,"GCF_011065905.1 Clostridium estertheticum strain=FP4, ASM1106590v1",/dev/fd/63
signatures/006c16914076de98b82c3051fd6d3152.sig.gz,006c16914076de98b82c3051fd6d3152,006c1691,31,DNA,0,1000,4587,1,"GCF_000968535.2 Methylomicrobium alcaliphilum 20Z strain=20Z, ASM96853v1",/dev/fd/63
signatures/007e39a06d05ecbd99f4171d71bcd29f.sig.gz,007e39a06d05ecbd99f4171d71bcd29f,007e39a0,31,DNA,0,1000,4581,1,"GCF_000245055.1 Desulfovibrio sp. U5L strain=U5L, ASM24505v1",/dev/fd/63
signatures/00a238371d14b6c28dec7e031038892d.sig.gz,00a238371d14b6c28dec7e031038892d,00a23837,31,DNA,0,1000,3197,1,"GCF_000211855.2 Lacinutrix sp. 5H-3-7-4 strain=5H-3-7-4, ASM21185v3",/dev/fd/63
signatures/00e81a5cad7337fcb893c47b08b6deb8.sig.gz,00e81a5cad7337fcb893c47b08b6deb8,00e81a5c,31,DNA,0,1000,9994,1,"GCF_003752655.1 Streptomyces griseorubiginosus strain=SAI-142, ASM375265v1",/dev/fd/63
signatures/014909b6d652f382c8204d23cb3f144f.sig.gz,014909b6d652f382c8204d23cb3f144f,014909b6,31,DNA,0,1000,2707,1,"GCA_011333355.1 Deltaproteobacteria bacterium, ASM1133335v1",/dev/fd/63
signatures/017eec1003b2034b03ed00d3f18179a6.sig.gz,017eec1003b2034b03ed00d3f18179a6,017eec10,31,DNA,0,1000,4379,1,"GCF_000243715.2 Leptospira broomii serovar Hurstbridge str. 5399 strain=5399, gls454050v02",/dev/fd/63
signatures/01c84d63434ff050360b40cb49897db9.sig.gz,01c84d63434ff050360b40cb49897db9,01c84d63,31,DNA,0,1000,5848,1,"GCF_000364225.1 Eubacterium plexicaudatum ASF492 strain=ASF492, Euba_plex_ASF492_V1",/dev/fd/63

The first five within the excluded.genomes.csv have statuses of suppressed, replaced by another version, and seems like it should work:

genbank refseq name status
GCA_000745545.1 GCF_000745545.1 ASM74554v1 suppressed
GCA_004345375.1 GCF_004345375.1 ASM434537v1 suppressed
GCA_011065905.1 GCF_011065905.1 ASM1106590v1 replaced v2
GCA_000968535.1 GCF_000968535.2 ASM96853v1 Appears good
GCA_000245055.1 GCF_000245055.1 ASM24505v1 suppressed

I am thinking that I could include a script to parse the genbank/refseq string to check the FTP server status.

xargs <ftp.list.txt curl -I 2>&1 | awk '/HTTP\// {print $2}'

or

#! /bin/bash
for site in $(cat ftp.list.txt)
do
        if wget --spider -S "$site" 2>&1 | grep -w "403\|404\|500\|502\|503" ; then
                    echo "$site is down"
        fi

done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant