Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downsampling and cis counts #458

Open
ddepierre opened this issue Aug 7, 2023 · 2 comments
Open

Downsampling and cis counts #458

ddepierre opened this issue Aug 7, 2023 · 2 comments

Comments

@ddepierre
Copy link

ddepierre commented Aug 7, 2023

Hi,
A part of this bug has already been reported here, I have the same issue on cooltools 0.5.4 version.

Also while writing this issue, I realized I had maybe too many questions for one post, sorry about that.

I have some cool files I want to downsample to a given cis contact count for all my replicates and different conditions to have the same number of cis contacts, so I can compare them (first of all, am I right on the usage of downsampling as a way to normalize sample coverage between conditions?)

1/ get the cis contact count

With cooltools.coverage() or with expected_cis()


cooltools.coverage(some_cool_file, store=True, ignore_diags=0)
some_cool_file.info['cis']
# >>> 309426839

cooltools.coverage(some_cool_file, store=True)
some_cool_file.info['cis']
# >>> 210709980 # because --ignore-diags = 2 in cooler balance 

np.sum(some_cool_file.bins()[:]['cov_cis_raw'])
# >>> 618853678
np.sum(some_cool_file.bins()[:]['cov_cis_raw'])/2
# >>> 309426839.0

expected_temp = cooltools.expected_cis(some_cool_file, clr_weight_name = None, ignore_diags=0, view_df=gnm_arms, chunksize=1000000, nproc = nproc, smooth=False, aggregate_smoothed=False)
np.sum(expected_temp['count.sum'])
# >>> 309426839.0

Counts doubled in cov_cis_raw as already reported here

2 / Downsampling

I have unbalanced cool file as


some_cool_file.bins()[:]
       chrom     start       end  cov_cis_raw  cov_tot_raw
0       chr1         0     10000            0            0
1       chr1     10000     20000            0            0
2       chr1     20000     30000            0            0
3       chr1     30000     40000            0            0
4       chr1     40000     50000            0            0
...      ...       ...       ...          ...          ...
272561  chrY  91720000  91730000            0            0
272562  chrY  91730000  91740000            0            0
272563  chrY  91740000  91744698            0            0
272564  chrM         0     10000          258         3197
272565  chrM     10000     16299          270         3362

[272566 rows x 5 columns]

That I want to downsample:

cooltools.sample(clr = some_cool_file, out_clr_path = dnsmpl_cool_file, cis_count = 250000000)

Traceback (most recent call last):
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 117, in coverage
    else clr._load_attrs(clr.root.rstrip("/") + "/bins/weight")["ignore_diags"]
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooler/api.py", line 116, in _load_attrs
    return dict(grp[path].attrs)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/h5py/_hl/group.py", line 357, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'weight' doesn't exist)"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/sample.py", line 102, in sample
    cis_total = clr.info.get("cis", np.sum(coverage(clr)[0] // 2, dtype=int))
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 120, in coverage
    raise ValueError(
ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing. 

- Is it possible to sample contact on raw matrices? What is the point of doing downsampling after balancing?

Also

ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing. '

But ignore_diags is not an argument from sample()

 cooltools.sample(clr = cooler_list[ndx], out_clr_path = dnsmpl_cool_file, cis_count = cis_contact_trgt, ignore_diags=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sample() got an unexpected keyword argument 'ignore_diags'


- To bypass the balancing in sample(), can I duplicate 'cov_cis_raw' columns in .bins() and call it 'weight' so it fakes balance cool file and downsample on raw cis contacts? And then I'll need to balance it anyway after the downsampling.

- Usually, is downsampling done on full matrix or do you ignore first(s) diagonal for some reason?

- additional question:

store (bool, optional) – If True, store the results in the input cooler file when finished.
Does it mean that the result is stored in the python variable or directly modified in the cool file? I am not sure to understand whether the matrices are loaded and only loaded version is modified or if the original file it-self is modified.

Thanks for the support,
David


# session_info.show()
# -----
# bbi                 0.3.5
# bioframe            0.4.1
# click               8.1.4
# cooler              0.9.2
# coolpuppy           NA
# cooltools           0.5.4
# cytoolz             0.12.1
# ipywidgets          8.0.7
# matplotlib          3.7.2
# mpl_toolkits        NA
# numpy               1.24.4
# pandas              1.5.3
# pysam               0.21.0
# scipy               1.11.1
# seaborn             0.12.2
# session_info        1.0.0
# -----
# Python 3.9.5 (default, Jun  4 2021, 12:28:51) [GCC 7.5.0]
# Linux-5.3.18-150300.59.60-default-x86_64-with-glibc2.31
# -----
# Session information updated at 2023-07-28 12:37

@ddepierre ddepierre changed the title Downsampling and cis counts - ValueError: The number of contacts in a sample cannot exceed that in the original dataset. Downsampling and cis counts Aug 8, 2023
@ddepierre ddepierre reopened this Aug 8, 2023
@Phlya
Copy link
Member

Phlya commented Aug 9, 2023

These are good questions, although there is a lot to unpack here. I think the most important idea I can give you is to calculate coverage with ignore_diags=0, I think it should solve all your problems...

The values are indeed stored in the file, and .sample() uses the cis count when it is available in the file. When not, it calculates the coverage with default arguments (! perhaps not ideal and should be changed to ignore_diags=0 !) and uses that.

@gfudenberg
Copy link
Member

discussion 2023/10/9:
perhaps we should deprecate the usage of stored cis counts in sample(), as it is not obvious how to link this stored value with the number of diagonals ignored in previous coverage() calculation.

elif frac is None and count is None and cis_count is not None:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants