Skip to content
This repository has been archived by the owner on Sep 12, 2018. It is now read-only.

Allow user to graphically filter edges by multiplicity #232

Closed
fedarko opened this issue Jul 4, 2017 · 7 comments
Closed

Allow user to graphically filter edges by multiplicity #232

fedarko opened this issue Jul 4, 2017 · 7 comments
Assignees

Comments

@fedarko
Copy link
Owner

fedarko commented Jul 4, 2017

Sort of similarly to the charting mechanism for node lengths described in #99, we would use d3.js here (or another JS library) to generate a plot of edge weight against frequency in a given connected component of the graph. (I guess we could just show this in a dialog we'd pop up, similarly to the current "assembly information" dialog.)

Similarly to the "contributions" page for each GitHub repository (example here -- in particular, the green-colored chart near the top of the page), the user should be able to select a region of the chart and only show those edges. Unselected edges would be hidden. (This feature would subsume the "hide edges by weight" feature that I mostly implemented back in March 2017.) I'd imagine the chart-selecting functionality would involve a few main parts, notably:

  • Change the user's cursor to the "crosshair" pointer type when above the chart (this can be done via basic CSS, if I remember correctly)
  • On a dragging motion starting, draw a maximal-height line at the x-coordinate where the user just clicked
  • On a dragging motion ending, draw another maximal-height line at the x-coordinate where the user let go of the mouse button.
    • Then identify the edge multiplicity range selected by the user (since the user could've dragged the mouse in either direction, we'll need to use min and max or something -- not too difficult)
    • Then identify all edges outside of that range and "hide" them
    • Then identify all edges inside of that range and "unhide" them

I remember there were a lot of problems re: making edge weights work ok with collapsing/uncollapsing, and those problems ultimately necessitated temporarily disabling the edge hiding feature while I worked on more important stuff. It might be easiest, then, to not actually delete unselected edges but to just make them styled as invisible. That should work ok with collapsing, I think?

@fedarko fedarko self-assigned this Jul 4, 2017
@fedarko
Copy link
Owner Author

fedarko commented Jul 4, 2017

img_20170704_025722
A good test case for this is the 2nd cc of the Velvet E. coli assembly graph. There are two main "clusters" of edges in this cc, so the frequency/weight graph should have two "peaks" visible (see attached drawing of what this should generally look like).

@fedarko
Copy link
Owner Author

fedarko commented Jul 5, 2017

Some pertinent examples of related functionality in practice:

@fedarko
Copy link
Owner Author

fedarko commented Jul 5, 2017

It looks like the "bar chart" approach with raw data might be suitable for this (where y-axis indicates multiplicity, and each "bar" extending upward from the x-axis indicates an edge with multiplicity indicated via the y-axis). Below is a screenshot of such a chart (for the edge weights of the 2nd cc of the velvet E. coli assembly graph mentioned above), generated using LibreOffice Calc (it's an unpolished version -- we'd add further labels, etc. for the implementation version).

velvet_ecoli_cc2_edge_multiplicity_chart

That being said, I'm considering other approaches to this. One alternative approach is a frequency bar chart (where x-axis indicates multiplicity and y-axis indicates number of edges with that multiplicity), where the x-axis might have "bins" of multiplicity (say, increments of 10?). Here's a rough version of that, generated for the same data as above with "bins" of edge multiplicities of 10.

velvet_ecoli_cc2_edge_multiplicity_freq_chart

Here's the same thing with bins of 100:
velvet_ecoli_cc2_edge_multiplicity_freq_chart_100bins

fedarko added a commit that referenced this issue Jul 13, 2017
Will eventually be included in the functionality described in #232.
fedarko added a commit that referenced this issue Jul 17, 2017
@fedarko
Copy link
Owner Author

fedarko commented Jul 17, 2017

Should consider allowing the user to pick the number of subdivisions of the edges, instead of picking the "bin sizes." For example, instead of creating bins of size 10 (where each bin corresponds to the number of edges in the cc with a weight that falls within that bin's range of edge weights of length 10), we should let the user pick how many groups -- we could call them bins, I guess -- they want the edges in the cc to be split up into for the chart.

So, using the 2nd cc of the Velvet E. coli assembly graph as a benchmark, if the cc contains 32 edges and the user requests 10 bins, then my code should roughly divide the cc's edges into 10 groups, each of about size 3. This might take some work to work well for certain phenomena (for example, what to do when a cc contains a bunch of edges with the exact same weight?).

I'd imagine that we can just characterize each bin as a collection of edges, with the defined properties of min and max "child" edge weight. Since each bin would have roughly the same size (might differ by a few when the number of edges in the cc is not evenly divisible by the requested number of bins), we could plot the y-axis as the average edge weight in each bin.

That being said I'm not entirely sure how to assign edges to bins. Given e edges and b requested bins, a simple approach of taking f = floor(e / b) and creating b - 1 bins of size f and a final bin of size f + (e - (f * b)) works alright in some cases (for example, for the example cc above it works fine). However, it theoretically could result in very large last bin sizes, which would be undesirable. For example, if a cc has 101 edges and the user requests 21 bins, then the naive approach would create 20 bins of size 4 and a 21st bin of size 4 + (101 - (4 * 21)) = 21. My described "simple approach" earlier in this paragraph, then, clearly isn't a good solution at all (since it results in a relatively large final bin). So I should figure something else out.

(We could also just approximate the number of bins -- in the e = 101, b = 21 example, it would probably be ok to just construct 19 bins of size 5 and 1 bin of size 6.)

@fedarko
Copy link
Owner Author

fedarko commented Jul 17, 2017

In any case, once a bin construction algorithm is ready, it should be simple to calculate each bin (and its min, max, and average) and populate the bar chart accordingly. I also think that the min/max of bins can be presented as a tooltip for each bar (see the above d3.js example of using tooltips in bar charts).

fedarko added a commit that referenced this issue Aug 4, 2017
This prevents a bug where a new chart would just get drawn over
the old one.
fedarko added a commit that referenced this issue Aug 8, 2017
fedarko added a commit that referenced this issue Aug 8, 2017
Via changing the domain from [0, 1] to [0, max * 1.1].

Setting [max * 1.1] as the top of the domain also should mean that
all the values in the histogram are contained within the span of the
x-axis (unlike before, where the max was getting pushed to the right
of the x-axis).
fedarko added a commit that referenced this issue Aug 8, 2017
fedarko added a commit that referenced this issue Aug 9, 2017
Other changes in this commit:

-Changed up the JS to not enable/disable the cullEdges stuff (since
 that will only be available if the filter edges stuff is available
 -- therefore, no need to worry about enabling/disabling sub-controls
 within the edge filtering dialog).
-Encapsulated the new bin ct. UI element and the cullEdges UI element
 in Bootstrap cols and a row, making the edge filtering dialog's
 formatting look a lot nicer than before.
-Moved the edge weight histogram drawing stuff into its own
 function to facilitate ease of redrawing the histogram.

Next up for #232 is improving the general UI of this (e.g. using
enter to redraw the histogram), maybe flipping the UI elements
so the buttons are on the right of the text inputs (would match
the search bar), making the chart look prettier (fixing occasional
bar spacing issues, removing floating-point tick values, making
sure that the max edge weight is still included in other bins (?),
etc.), and adding graphical filtering functionality.
fedarko added a commit that referenced this issue Aug 9, 2017
This leaves more space for the y-axis label when the
frequencies of edge multiplicities get rather large.
@fedarko
Copy link
Owner Author

fedarko commented Aug 9, 2017

Something worth considering -- the current method of scaling every edge to within the range [0, max * 1.1] doesn't do much to prevent outliers. Consider, for example, the first cc of the pre-clean SRS049950 graph, where almost every edge falls into the first bin when (# bins = 20).

Not sure how to counteract this; will have to read more about this.

@fedarko
Copy link
Owner Author

fedarko commented Aug 9, 2017

This issue was moved to marbl/MetagenomeScope#61

@fedarko fedarko closed this as completed Aug 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant