Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on Building CCDG Pipeline from Fastq Files for Custom Sequencing Samples #2

Open
TanXinjiang opened this issue Nov 12, 2024 · 2 comments

Comments

@TanXinjiang
Copy link

Hello,

Thank you for the fantastic preprint article on The Great Genotyper. I am very interested in utilizing this tool for my analysis. However, I noticed that the documentation does not provide a detailed workflow for building the CCDG indexs starting from raw sequencing samples, such as a set of fastq files from around 10 samples.

Could you please provide some guidance on how to initiate the CCDG process from fastq files? Specifically, any recommendations on the steps and resources required to transition from raw fastq data to the CCDG index would be greatly appreciated.

Thank you very much for your assistance and for developing this valuable tool.

Best regards,
Xinjiang Tan

@shokrof
Copy link
Collaborator

shokrof commented Nov 21, 2024

Hi Xinjiang,

Thank you for your interest in The Great Genotyper. I hope it proves to be a valuable tool in your research, and I am eager to hear your feedback. For guidance on creating the CCDG, you can visit the following link: The Great Genotyper - Database Builder.

Creating a database with 10 samples is quite straightforward and should take between 5-10 hours, requiring less than 64 GB of memory. Please note that processing times may vary based on how similar the samples are to each other. If you need any further assistance with creating the index, please do not hesitate to reach out.

Best regards,

Moustafa

@TanXinjiang
Copy link
Author

Hi Moustafa,

Thank you for your response. I have attempted to use The Great Genotyper in two ways:

  1. I downloaded your prebuilt CHB indexes and tried genotyping with the VCF deconstructed from the HPRC PanGenome graph. I found that the memory usage was extremely high. On a server with 500 GB of RAM, the process was killed during the "Determine unique kmers for chromosome: chrX" step. However, when using the VCF of chr22 as the reference and inputting the whole-genome CCDG indexes, the peak memory usage was 70 GB. The results, benchmarked against NA18534's PAV results, showed an SV precision of 0.65, recall of 0.53, and F1-score of 0.58. When using the VCF of chr2 as the reference, the peak memory usage increased to 227 GB, with SV precision at 0.7, recall at 0.52, and F1-score at 0.60.

    Upon reviewing your code, I noticed that it seems you divide CCDG into different chunks. Would this approach improve the genotyping results? Also, how should CCDG be mapped to chromosome-specific chunks?

    Also, I noticed that a significant amount of resources were used for reading and processing the pangenome reference VCF. The PanGenie after v3.0.0 encourages the use of PanGenie-index to preprocess VCF files. Do you have any similar development plans for The Great Genotyper?

  2. I attempted to follow the process for building CCDG and noticed that the metagraph build step appears to read and write many temporary files, requiring high I/O performance. When removing the --disk-swap option, the peak memory usage for a single sample reached 80 GB. Is this performance expected?

Thank you in advance for your time and insights. I truly appreciate your assistance and look forward to hearing from you.

Best regards,
Xinjiang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants