Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions related to interpretation results phased assembly. #245

Open
rfinkers opened this issue Apr 5, 2024 · 51 comments
Open

Questions related to interpretation results phased assembly. #245

rfinkers opened this issue Apr 5, 2024 · 51 comments
Assignees

Comments

@rfinkers
Copy link

rfinkers commented Apr 5, 2024

I'm attempting an assembly (relatively heterozygous diploid genome; 5Gb haploid size; with Hifi, ONT & Hi-C).

In the directory 7-consensus, the uniting-popped fast = 7.3Gb; however the uniting.popped.haplotype1, haplotype2 and unsigned is 0 bytes in size.

Thus this imply that there is no contribution of the ONT data to the phasing; though there are reads in the ont_subset.fasta.gz?

The final assembly.fasta is still 7.3Gb; with the haplotype1 and 2 fast both being 1.3G in size. This phasing is the sole contribution from the Hi-C data?

What would be good parameters to re-atempt this assembly; to deal with the higher diversity in this species (compared to human)?

Tnx!

@skoren
Copy link
Member

skoren commented Apr 5, 2024

The initial assembly (the 7.3 Gbp) file is still phased, it's just not split into haplotypes. Each contig is from a single haplotype based on HiFi + ONT data information. In the second case, the assembly.fasta will still be the full assembly (haplotype1 + haplotype2 + unassigned) so it makes sense that it has not changed. It may have linked some sequences together within haplotypes that couldn't be linked without HiC, making them longer which would be reflected in sequence stats.

I suspect the issue is the assembly is almost completely phased already and there isn't anything for HiC to do except assign to a haplotype. This requires homology detection which may not be tolerating the diversity of this species vs human. You can try increasing the --haplo-divergence parameter from the default of 0.05 to 0.10 or 0.2 and see if more of the assembly is assigned. Perhaps @Dmitry-Antipov has other suggestions. If you can share the colors file and the noseq.gfa file here from the 8-hicPipeline folder that should give more info on what is happening with the assembly.

@Dmitry-Antipov
Copy link
Contributor

Without graph and colors it is hard to say anything, but anyway, such large unassigned file is not normal.
Increasing --haplo-divergence is the first idea for me too.

Which species is it? Do you expect to have large rDNA arrays, what is estimated level of heterozygosity?

@rfinkers
Copy link
Author

rfinkers commented Apr 8, 2024

share.zip
Thanks @skoren , and yes, I was mixing-up terminology. See the files attached for feedback; I'l rerun the assembly with different --haplo-divergence settings and assess the impact on the final results.

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Apr 10, 2024

Aha, this is actually another problem. Graph nodes (which are constructed from ONT & HiFi but not HiC) are quite fragmented. This is not normal, those regions should be resolved with ONT reads. Without it there is just not enough HiC signal to assign labels for those short haplotype-specific regions - for some chromosome pairs there are thousands of nodes.
We expect to have larger haploblocks with HiFi & ONT, and then phase/assign labels for them with HiC

The reason seems to be a verkko problem, but on the graph simplification/ONT alignment stages, and not on HiC phasing - I see huge amount of fake bulges with really different coverage between "haplotypes" - @skoren can you have a look?
fake_bulges

--haplo-divergence should not actually change anything here.

@Dmitry-Antipov
Copy link
Contributor

Yet another assumption - is it a cell line or a tissue sample? Possibly those numerous 5x vs 50x bubbles can represent problems with the cell line?

@skoren
Copy link
Member

skoren commented Apr 10, 2024

The genome seems to be a combination of quite diverge but then some extremely homozygous regions (like the one @Dmitry-Antipov posted above). In normal homozygous genomes those low coverage nodes would be simplified and you'd end up with a single large node but here, because it's double the coverage of most of the genome, they aren't. I do think that --haplo-divergence will assign more of the assembly but these very homozygous regions are likely to remain in the unassigned bin and probably belong in both haplotypes.

@Dmitry-Antipov
Copy link
Contributor

yep, Sergey is right, I did not notice that in addition to that underresolved problem there are some unlabeled large nodes - those labels can be fixed with haplo-divergence. But it definitely will not help with the fragmented+unassigned problem i was writing about

@rfinkers
Copy link
Author

Gentleman, thanks so far, there is definitively some food for thought here before providing some feedback (next week). In the meanwhile, I'll rerun the final steps with a larger haplotype divergence setting and take that also along int he feedback.

@skoren
Copy link
Member

skoren commented Apr 11, 2024

When you re-run with the halo-divergence change, I'd suggest also adding the --haploid option. You don't need to re-run the full pipeline from scratch, in fact it's better if you don't so you can keep the read/node correspondences to compare the results. This would mean duplicating your current asm folder and removing all the assembly.* files along with the 5-*, 6-*, 7-*, and 8-* steps. That should better handle these very homozygous regions and pop those bubbles. It will likely still not assign them if the entire component is homozygous but at least the assembly will be less fragmented.

@rfinkers
Copy link
Author

rfinkers commented Jun 6, 2024

Still working on understanding this issue; Besides the example you picked out that part of the genome is relatively homozygous (on one chromosome), on other chromosomes we might have much higher diversity. Could diversity be to high, which results in failure to detect that two homolouges belong to the same chromosome, result in not exporting these sequences to the fased fasta file? As in, this probably would als impact your previous suggestion regarding the assembly as '--haploid'?

@skoren
Copy link
Member

skoren commented Jun 6, 2024

Right, there's two issues. One is there is high diversity in some regions and two is there is some very homozygous regions. The --haplo-divergence would help with the first one. Without proper haplotype phasing the HiC signal won't be there to push the two haplotypes away. The change I suggested (--haploid) would address the second one by more aggressively removing bubbles in this assembly, hopefully extending the sequences for Hi-C phasing.

@rfinkers
Copy link
Author

rfinkers commented Jun 9, 2024

No improvement with --haploid and --haplo-divergence 0.2. What divergence does the value of 0.2 imply?

@Dmitry-Antipov Dmitry-Antipov self-assigned this Jun 10, 2024
@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Jun 10, 2024

Could you share graph and colors for the most recent run once again?

What divergence does the value of 0.2 imply?

It should help to detect homologous regions and thus better assign colors(by Hi-C phasing) for them. But corresponding regions (aka nodes in graph) should be long enough to make it work.

@rfinkers
Copy link
Author

share.tar.gz
Tnx, please find the info attached. The species is a tricky diploid one with a haploid genome estimated to be 5Gb. At least one chromosome is estimated to be (mostly) homozygous; estimated about heterozygosity, of the two genomes, in the other chromosomes are estimated to be above 3-5%, but likely varies from chromosome-to-chromosome. But the prediction flows out there have difficulties with these estimates. This relates to my divergence question. Is 0.2 sufficiently high enough (though it is the max) to detect homologous regions in diverse species? Looking forward to your insights.

@rfinkers
Copy link
Author

As an addition to above, Hifi heterozygous/homozygous depth = 40x/80x and ONT heterozygous/homozygous depth = 11x/21x

@Dmitry-Antipov
Copy link
Contributor

I've checked, still see both problems we discussed above.
Can you also share node sequences (8-hicPipeline/unitigs.fasta & 8-hicPipeline/unitigs.hpc.fasta)?

Alternatively to fastas I can look on mashmap's mappings counted on your side - 8-hicPipeline/run_mashmap.sh & 8-hicPipeline/mashmap.out
Last file can be big; you can safely filter out short alignments by
awk '$11 >= 50000' mashmap.out > mashmap50.out
and send only smaller mashmap50.out

@rfinkers
Copy link
Author

rfinkers commented Jun 17, 2024

With respect to sharing the fastas; we'll have to see how to perform this in a secure manner. I'll generate the mashmap50.out file as alternative. But as the mashmap.out file was cleaned, I'll rerun 8-hicPipeline/run_mashmap.sh; which will take some days. Not sure if this helps, but untigs.fasta (7.4G) and unitigs.hpc.fasta (5.2G) raw sizes. Haploid is approx 4.5-5Gb

@Dmitry-Antipov
Copy link
Contributor

yeah, sharing fastas can be sensitive issue. I need them exactly to look on mashmap's results, so running it on your side and sending us resulting mappings is perfectly fine.

@rfinkers
Copy link
Author

mashmap60.out.gz
mashmap50.out would be to large to upload (GitHub limits); so used awk '$11 >= 60000' mashmap.out > mashmap60.out instead. Hope this is fine. Looking forward hearing your insights.

@Dmitry-Antipov
Copy link
Contributor

OK, so mashmap "alignments" are really fragmented, we just do not see enough similarity to phase corresponding nodes.
I.e. utig4-14078 is likely homologous to utig4-18672, but the total length of found homologous stretches reported by mashmap is just 2.5M of >20M length.
So, returning to your question, it looks that 0.2 is really not enough here.
Actually you can try something like 0.25 as --haplo-divergence but mashmap is extremely slow with low sequence identity (verkko's haplo_divergence equals to (100 - mashmap_percent_identity)/100 , so efficiently it is running mashmap with --pi 75) and we never were patient enough to check the results.

We can make some improvements on our side, to test that they are somehow reasonable we'll also need hic.byread.compressed (just pairwise hi-c mapping counts) from 8-hicPipeline

@skoren, do you have any suggestions about improving bulge removal in haploid part of this genome(see attached figure)
erroneous_bulges

@rfinkers
Copy link
Author

xab.gz
xaa.gz
Ok, needed to split the file in two, in order to be able to circumvent GitHub limits.

@Dmitry-Antipov if I read your suggestion correctly, it would be ok te reexecute the run_mashmap.sh script again with the modified --pi settings. Correct? Or would a rerun of the complete pipeline be necessary with the modified haplo-divergence 0.25 setting?

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Jun 20, 2024

you can edit that script and then rerun hic_phasing.sh to see whether the number of non-phased nodes (phased ones are listed in hicverkko.colors.tsv) will be significantly reduced or not.

@skoren
Copy link
Member

skoren commented Jun 30, 2024

On the bubble popping issue, verkko doesn't expect to have such a wide range of heterozygosity in the same genome and we rely on a global coverage estimate from the graph. This is why the haploid option didn't help. Some chromosomes/components are diploid and their average coverage is per haplotype while those with the unpopped bubbles are almost homozygous so the coverage there is per haplotype * 2. These are considered repeats and so aren't processed. I've made a branch to address this (using local component coverage and not global coverage) and the results look much more reasonable on the graph you shared. Once it's tested more and integrated into master you should be able to re-run from the 5-untip step.

@rfinkers
Copy link
Author

rfinkers commented Jul 2, 2024

the size of hicverkko.colors.tsv of the previous run and after rerun hic_phasing.sh is similar. Few changes, but the majority of contigs between both files are in common. Food for thought. Also, I'm still puzzled by the 5x coverage bubbles in the graph above. Haploid HiFi depth is 40x/ homozygous 80x. These bubbles could not come from incorporation of the ONT or HIC data?

@skoren
Copy link
Member

skoren commented Jul 2, 2024

The coverage on those nodes is based solely on HiFi kmers so they couldn't have come from ONT data (which would have 0 HiFi k-mers) and Hi-C data doesn't add any sequence to the graph. I suspect there is either some low-level somatic variation (if this is a cell line or similar) or some kind of systematic error.

@rfinkers
Copy link
Author

rfinkers commented Jul 3, 2024

It's a diploid plant species, haploid genome ~5Gb. Slowly getting some additional information from orthogonal datasets. Some chromosomes are similar / some chromosomes are diverged. With ~7.5Gb, the total size of the assembly Is not that bad (Verkko / p_utg Hifiasm ~8Gb) . I see signals that contigs from some chromosomes are nicely phase separated while in others the hamming rate is extremely high. I can share some more insights, but not via this medium. The update in terms of "using local component coverage and not global coverage" you commented on three days ago sounds like one strategy that would push things forward. I'm happy to give it a try, even if it is still in a test branch, but will wait util you give the go ahead @skoren. @Dmitry-Antipov does it make sense to look further at the mashmap output / hicverkko.colors.tsv? Or is it clear enough that this was not the way forward?

@Dmitry-Antipov
Copy link
Contributor

I have all the data needed for now. believe that it's possible to see at least some improvement on unphased contigs, but just didn't implement corresponding fix yet. Hope to update on this issue at the beginning on next week.

@rfinkers
Copy link
Author

rfinkers commented Jul 4, 2024

I have some other things to do, but can have more focus on this project again in August

@rfinkers
Copy link
Author

@Dmitry-Antipov is there an update (d branch) for me to test? Tnx!

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Aug 10, 2024

Hi,
sorry for the delay, didn't finish experiments before going on vacations.

You can try the branch het_fixes
Constants used there are very aggressive on phasing (attaching the phasing results on the last shared graph), but still some long nodes are not phased - i.e. for utig4-18391 it is really unclear what should be its homologous one based on sequence similarity.
hicverkko.colors10.csv

Anyway, I'd be very accurate with phasing results here, and definitely check them with gene-based verification like compleasm

@rfinkers
Copy link
Author

rfinkers commented Sep 9, 2024

Rerun stuff from scratch; but getting stuck in processing files in 8-hic.
Both mashmap jobs completes (2 results from the first set; 1 result from the second step (paths2ref.mashmap)), but than the flow crashes. A restart of slurm results in restarting a lot of the completed jobs (including the long (approx 1..5 weeks) mashmap jobs).

See the dag of jobs:
job count


HiC_rdnascaff 1
alignBWA 37
hicPhasing 1
indexBWA 1
mergeBWA 1
prepareScaffolding 1
runRukkiHIC 1
scaffoldMergeBWA 1
scaffoldPrefilterBAM 37
transformBWA 1
total 82

Error likely has to do with not being able to find a file, but have not yet figured out which.

Is there a specific batch script left-over by Snakemake (batch-script directory) to restart only the last part manually? To have more fine-grained control on where to start debugging?

@Dmitry-Antipov
Copy link
Contributor

What was the job crashed? tail of log file should help to get what happened.

If you have all mashmap jobs finished, there's quite a few scripts to run: bunch of scaffold_prefilter*.sh scripts: scaffold_mergeBWA.sh and hic_scaff.sh. Possibly they are already created in your 8-hicPipeline folder - do you see them?

My first suggestion is that rukki.paths.tsv & rukki.paths.gaf are the files that cause rerun because of suboptimal snakemake pipeline design. Do you see them in 8-hicPipeline folder?

@rfinkers
Copy link
Author

rukki.paths.tsv & rukki.paths.gaf (they are identical?) are there.

The scaffold_prefilter*.sh, scaffold_mergeBWA.sh and hic_scaff.sh are lacking.

Error executing rule prepareScaffolding on cluster (jobid: 110, external: 109538, jobscript: /media/ont/ASM3.91/.snakemake/tmp.enu3oklq/verkko.prepareScaffolding.110.sh). For error details see the cluster log and the log files of the involved rule(s).

but cannot find more.

@Dmitry-Antipov
Copy link
Contributor

Then it likely means that third mashmap crashed and paths2ref.mashmap is not valid. You can check what is the error in log, 8-hicPipeline/prepare_ref_mashmap.err

@rfinkers
Copy link
Author

---Preparing rukki paths fasta
---Running MashMap
[mashmap] MashMap v3.1.3
[mashmap] Reference = [../8-hicPipeline/paths.hpc.fasta]
[mashmap] Query = [../8-hicPipeline/paths.hpc.fasta]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 690
[mashmap] Segment length = 10000 (read split allowed)
[mashmap] No block length filtering
[mashmap] Chaining gap max = 10000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 80%
[mashmap] Do not skip self mappings
[mashmap] No hypergeometric filter
[mashmap] Mapping output file = /dev/fd/63
[mashmap] Filter mode = 3 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads = 8
[mashmap::skch::Sketch::build] minmer windows picked from reference = 737188003
[mashmap::skch::Sketch::index] unique minmers = 40515556
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 10102799) ... (704532, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 47924 times during lookup.
[mashmap::map] time spent computing the reference index: 417.324 sec
[mashmap::skch::Map::mapQuery] WARNING, no .fai index found for ../8-hicPipeline/paths.hpc.fasta, reading the file to sum sequence length (slow)
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.26e+04 bp/s elapsed: 05:01:41:40 remain: 00:00:00:006:04
[mashmap::skch::Map::mapQuery] count of mapped reads = 8976, reads qualified for mapping = 8978, total input reads = 8978, total input bp = 5522795566
[mashmap::map] time spent mapping the query: 4.38e+05 sec
[mashmap::map] mapping results saved in: /dev/fd/63

@Dmitry-Antipov
Copy link
Contributor

This output looks normal, strange.

Can you rerun verkko with snakemake dry-run option (verkko <> --snakeopts '--dry-run' > dry_rerun.log) and upload the resulting log?

@rfinkers
Copy link
Author

dry_rerun.log
sure, see attachment.

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Sep 19, 2024

Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it.

Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch.
To do so, you should use run verkko --paths <old_assembly>/8-hicPipeline/rukki.paths.gaf --assembly <old_assembly> -d <new_output_dir> --hifi <> --ont <>

If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky.

However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right.

@rfinkers
Copy link
Author

Ok, thanks. I'll try that over the weekend. I'm interested in the scaffolding as well, but that can be a later experiment. I'll first check the phasing step and validate with orthogonal datasets. Will keep you informed. tnx.

@Dmitry-Antipov
Copy link
Contributor

OK, so for scaffolding:

  1. Switch off cleanup in verkko.yml (keep_intermediate should be set to True)
  2. Run snakemake pipeline until mergeBWA rule, easiest way is to add --until mergeBWA into a copy of snakemake.sh
  3. Run snakemake --touch <snakemake_file_from snakemake.sh> to update timestamps
  4. Run rest of the pipeline with regular parameters.

For all runs I'd run in dry mode first to check that snakemake is not trying to rerun jobs that are already finished.

You can compare prescaf_rukki.paths.tsv and scaff_rukki.paths.tsv to see the effect of scaffolding.

@rfinkers
Copy link
Author

Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it.

Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch. To do so, you should use run verkko --paths <old_assembly>/8-hicPipeline/rukki.paths.gaf --assembly <old_assembly> -d <new_output_dir> --hifi <> --ont <>

If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky.

However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right.

What is the right file for --assembly to use?

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Sep 23, 2024

What is the right file for --assembly to use?

Not file but path to the assembly dir.

@rfinkers
Copy link
Author

rfinkers commented Sep 25, 2024

Both haplotypes are now 0 bytes. No apparent errors in the logs. I'll dive a bit deeper.

@skoren
Copy link
Member

skoren commented Sep 25, 2024

I'd guess the issue is the original assembly (passed in --assembly) was not fully complete so the consensus (--paths) option didn't know how to partition the new consensus into haplotypes. Do the names in the assembly.fasta file start with haplotype1 or haplotype2 or similar? If so, you can still use the assembly.fasta as your result and just take any sequence with the name haplotype1 as hap1 and haplotype2 as hap2.

@rfinkers
Copy link
Author

No, they were staring with contain-xxxxxx. Could'n find (until now) any other reason.

@skoren
Copy link
Member

skoren commented Oct 16, 2024

Something must have gone wrong with the paths processing, the paths file names contain things like haplotype1/2 correct? What was the exact --paths command?

@rfinkers
Copy link
Author

verkko --paths ASM3.91/8-hicPipeline/rukki.paths.gaf

@skoren
Copy link
Member

skoren commented Oct 31, 2024

If the file is coming from the 8-hicPipeline folder, it should be phased and not be named contig-*. If not then this implies the whole Hi-C pipeline didn't run properly. Can you post a snipped of the file above?

Perhaps given that the original issue you had is fixed in v2.2.1 and that you had an issue with re-generating the consensus, it is easiest to just run a new assembly from scratch with v2.2.1 to make sure we're not chasing restart/mixing problems in the runs.

@rfinkers
Copy link
Author

rfinkers commented Nov 5, 2024

Will do so, at another moment/ticket we'll need to look at the memory allocation to mashmap. this is not sufficient (slurm submission) and I'll need to restart things manually (not a problem for now).

@rfinkers
Copy link
Author

Ok, we are making progress here. The last version does indeed scaffold more into haplotypes (2x). Of the 7.2Gb basal assembly, 2Gb could not be assigned. The other are approx. 50/50 separated over the two haplotypes. I'm getting some orthogonal data as well to confirm some stuf. We can discuss things later in more detail, first need to process this.

@skoren
Copy link
Member

skoren commented Nov 13, 2024

Glad to hear that it is improving. The initial assembly you shared had quite a few homozygous regions w/no bubbles (except sequencing errors) so I suspect that many of the unassigned are from regions w/o any heterozygosity and likely belong to both haplotypes. I'd suggest checking the coverage of the nodes (the noseq.gfa has hifi coverage) to see if they are double the average. Or you can share *noseq.gfa, *scfmap, *tsv with us here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants