Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

TanXinjiang · 2024-11-12T09:34:27Z

Hello,

Thank you for the fantastic preprint article on The Great Genotyper. I am very interested in utilizing this tool for my analysis. However, I noticed that the documentation does not provide a detailed workflow for building the CCDG indexs starting from raw sequencing samples, such as a set of fastq files from around 10 samples.

Could you please provide some guidance on how to initiate the CCDG process from fastq files? Specifically, any recommendations on the steps and resources required to transition from raw fastq data to the CCDG index would be greatly appreciated.

Thank you very much for your assistance and for developing this valuable tool.

Best regards,
Xinjiang Tan

shokrof · 2024-11-21T03:23:52Z

Hi Xinjiang,

Thank you for your interest in The Great Genotyper. I hope it proves to be a valuable tool in your research, and I am eager to hear your feedback. For guidance on creating the CCDG, you can visit the following link: The Great Genotyper - Database Builder.

Creating a database with 10 samples is quite straightforward and should take between 5-10 hours, requiring less than 64 GB of memory. Please note that processing times may vary based on how similar the samples are to each other. If you need any further assistance with creating the index, please do not hesitate to reach out.

Best regards,

Moustafa

TanXinjiang · 2024-11-22T01:57:59Z

Hi Moustafa,

Thank you for your response. I have attempted to use The Great Genotyper in two ways:

I downloaded your prebuilt CHB indexes and tried genotyping with the VCF deconstructed from the HPRC PanGenome graph. I found that the memory usage was extremely high. On a server with 500 GB of RAM, the process was killed during the "Determine unique kmers for chromosome: chrX" step. However, when using the VCF of chr22 as the reference and inputting the whole-genome CCDG indexes, the peak memory usage was 70 GB. The results, benchmarked against NA18534's PAV results, showed an SV precision of 0.65, recall of 0.53, and F1-score of 0.58. When using the VCF of chr2 as the reference, the peak memory usage increased to 227 GB, with SV precision at 0.7, recall at 0.52, and F1-score at 0.60.

Upon reviewing your code, I noticed that it seems you divide CCDG into different chunks. Would this approach improve the genotyping results? Also, how should CCDG be mapped to chromosome-specific chunks?

Also, I noticed that a significant amount of resources were used for reading and processing the pangenome reference VCF. The PanGenie after v3.0.0 encourages the use of PanGenie-index to preprocess VCF files. Do you have any similar development plans for The Great Genotyper?
I attempted to follow the process for building CCDG and noticed that the metagraph build step appears to read and write many temporary files, requiring high I/O performance. When removing the --disk-swap option, the peak memory usage for a single sample reached 80 GB. Is this performance expected?

Thank you in advance for your time and insights. I truly appreciate your assistance and look forward to hearing from you.

Best regards,
Xinjiang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

TanXinjiang commented Nov 12, 2024

shokrof commented Nov 21, 2024

TanXinjiang commented Nov 22, 2024

Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

Comments

TanXinjiang commented Nov 12, 2024

shokrof commented Nov 21, 2024

TanXinjiang commented Nov 22, 2024