Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinal Data Type #360

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open

Conversation

jinimukh
Copy link
Member

@jinimukh jinimukh commented Apr 15, 2021

Overview

This PR addresses #240 by adding support for the ordinal data type. Currently, the only way to set the data type to ordinal is by using df.set_data_type({"col_name": "ordinal}) functionality. Optionally, if the entries do not have a natural ordering like number or alphabetical, a custom ordering can be specified using df.set_data_type({"col_name": "ordinal}, order={"col_name": [ordered_lst]}). To visualize ordinal data types, we are using boxplots but because they are bivariate distributions, they only show up to enhance a selected visualization.

Changes

  • univariate.py: allow ordinal data types to be treated as nominal data types to create bar graphs in Occurrences tab
  • frame.py: allow the set_data_type function to take in optional order argument to specify orders on ordinal data
  • BoxPlot.py: currently only supports Altair BoxPlots
  • Compiler.py: allow the mark to be box when n_dim == 1 and n_msr == 1 and dimension_type == "ordinal"`

Example Output

Screen Shot 2021-04-14 at 1 51 27 PM

@jinimukh jinimukh marked this pull request as draft April 15, 2021 05:02
@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #360 (7820f1e) into master (1dbbcb9) will decrease coverage by 0.62%.
The diff coverage is 50.00%.

❗ Current head 7820f1e differs from pull request most recent head 19a14d8. Consider uploading reports for the commit 19a14d8 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #360      +/-   ##
==========================================
- Coverage   84.46%   83.84%   -0.63%     
==========================================
  Files          51       52       +1     
  Lines        3902     3961      +59     
==========================================
+ Hits         3296     3321      +25     
- Misses        606      640      +34     
Impacted Files Coverage Δ
lux/action/univariate.py 90.38% <ø> (ø)
lux/core/series.py 53.84% <ø> (ø)
lux/interestingness/interestingness.py 87.95% <ø> (ø)
lux/vislib/matplotlib/MatplotlibRenderer.py 84.61% <0.00%> (-2.69%) ⬇️
lux/vislib/altair/BoxPlot.py 21.87% <21.87%> (ø)
lux/vislib/altair/AltairRenderer.py 94.59% <33.33%> (-2.59%) ⬇️
lux/action/enhance.py 96.87% <66.66%> (-3.13%) ⬇️
lux/vislib/altair/BarChart.py 82.66% <75.00%> (-2.19%) ⬇️
lux/core/frame.py 81.75% <81.81%> (+0.02%) ⬆️
lux/executor/Executor.py 79.48% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dbbcb9...19a14d8. Read the comment docs.

@jinimukh jinimukh marked this pull request as ready for review April 15, 2021 08:41
@jinimukh jinimukh requested a review from dorisjlee April 15, 2021 08:42
@@ -21,6 +21,7 @@
from lux.vislib.altair.Histogram import Histogram
from lux.vislib.altair.Heatmap import Heatmap
from lux.vislib.altair.Choropleth import Choropleth
from lux.vislib.altair.BoxPlot import BoxPlot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user uses matplotlib for boxplots, could we render the boxplot in Altair and show an info button message letting users know that the matplotlib boxplot is not currently implemented? This is similar to what we did for the geographical maps in matplotlib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the Altair fallback as well as the message. However, since I'm not being able to set intent on the dataframe due to the matplotlib bug, I'm not sure if the message works. Let me know if you'd like me to remove it since there is no way to verify!

@@ -63,7 +63,10 @@ def univariate(ldf, *args):
ignore_rec_flag = True
elif data_type_constraint == "nominal":
possible_attributes = [
c for c in ldf.columns if ldf.data_type[c] == "nominal" and c != "Number of Records"
c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line split is pretty weird and hard to read. Can we fix this and add a comment on what this list of possible_attributes is used for?

@jinimukh jinimukh requested a review from dorisjlee April 23, 2021 02:35
@jinimukh jinimukh changed the title Features/ordinal 2 Ordinal Data Type Apr 23, 2021
@dorisjlee
Copy link
Member

dorisjlee commented Apr 26, 2021

Thanks @jinimukh!! Can we file a follow-up issue to delegate boxplot calculations to the Pandas and SQL Executor? This will help with performance by bringing down the rendering speed from the cost of a scatterplot to that of a boxplot (several summary statistics + outliers).

@dorisjlee
Copy link
Member

I'm wondering if ordinal data types have to be a subset of nominal data? Apart from the documentation and within the actions logic (enhance and univariate), is there anything in the code that treats ordinal as a subset of nominal. For example, can we capture scenarios where ordinal data type could be a subset of temporal data type? Such as {Summer, Winter, Fall}, {Q1, Q2, …}. It would be helpful to add an example for this.

@dorisjlee
Copy link
Member

Here's some examples that I was playing around with:

df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/aug_test.csv")
df =df.dropna(subset=['education_level',"company_size"])
df.set_data_type({'education_level': "ordinal"}, 
                 order={'education_level': ['Primary School', 'High School', 'Masters','Graduate', 'Phd']})
df["education_level"]


df.set_data_type({'company_size': "ordinal"}, 
                 order={'company_size': [
                     '<10', '10/49', '50-99', '100-500',
                       '500-999', '1000-4999', '5000-9999','10000+'
                 ]})
df["company_size"]

I was initially a bit confused by why the boxplot was not shown for the number of records case in univariate (until we set the intent), then I realized that the boxplot didn't make sense for the ordinal data type. I wonder if it makes sense to have a bivariate ordinal data type tab, i.e., ordinal with respect to all measure values, so that the boxplot could be shown in the initial view.
Otherwise, it would appear that setting the intent doesn't change anything.

@jinimukh jinimukh closed this Apr 26, 2021
@jinimukh jinimukh reopened this Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants