-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordinal Data Type #360
base: master
Are you sure you want to change the base?
Ordinal Data Type #360
Conversation
Codecov Report
@@ Coverage Diff @@
## master #360 +/- ##
==========================================
- Coverage 84.46% 83.84% -0.63%
==========================================
Files 51 52 +1
Lines 3902 3961 +59
==========================================
+ Hits 3296 3321 +25
- Misses 606 640 +34
Continue to review full report at Codecov.
|
@@ -21,6 +21,7 @@ | |||
from lux.vislib.altair.Histogram import Histogram | |||
from lux.vislib.altair.Heatmap import Heatmap | |||
from lux.vislib.altair.Choropleth import Choropleth | |||
from lux.vislib.altair.BoxPlot import BoxPlot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user uses matplotlib for boxplots, could we render the boxplot in Altair and show an info button message letting users know that the matplotlib boxplot is not currently implemented? This is similar to what we did for the geographical maps in matplotlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the Altair fallback as well as the message. However, since I'm not being able to set intent on the dataframe due to the matplotlib bug, I'm not sure if the message works. Let me know if you'd like me to remove it since there is no way to verify!
@@ -63,7 +63,10 @@ def univariate(ldf, *args): | |||
ignore_rec_flag = True | |||
elif data_type_constraint == "nominal": | |||
possible_attributes = [ | |||
c for c in ldf.columns if ldf.data_type[c] == "nominal" and c != "Number of Records" | |||
c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line split is pretty weird and hard to read. Can we fix this and add a comment on what this list of possible_attributes is used for?
Thanks @jinimukh!! Can we file a follow-up issue to delegate boxplot calculations to the Pandas and SQL Executor? This will help with performance by bringing down the rendering speed from the cost of a scatterplot to that of a boxplot (several summary statistics + outliers). |
I'm wondering if ordinal data types have to be a subset of nominal data? Apart from the documentation and within the actions logic ( |
Here's some examples that I was playing around with: df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/aug_test.csv")
df =df.dropna(subset=['education_level',"company_size"])
df.set_data_type({'education_level': "ordinal"},
order={'education_level': ['Primary School', 'High School', 'Masters','Graduate', 'Phd']})
df["education_level"]
df.set_data_type({'company_size': "ordinal"},
order={'company_size': [
'<10', '10/49', '50-99', '100-500',
'500-999', '1000-4999', '5000-9999','10000+'
]})
df["company_size"] I was initially a bit confused by why the boxplot was not shown for the number of records case in univariate (until we set the intent), then I realized that the boxplot didn't make sense for the ordinal data type. I wonder if it makes sense to have a bivariate ordinal data type tab, i.e., ordinal with respect to all measure values, so that the boxplot could be shown in the initial view. |
Overview
This PR addresses #240 by adding support for the ordinal data type. Currently, the only way to set the data type to ordinal is by using
df.set_data_type({"col_name": "ordinal})
functionality. Optionally, if the entries do not have a natural ordering like number or alphabetical, a custom ordering can be specified usingdf.set_data_type({"col_name": "ordinal}, order={"col_name": [ordered_lst]})
. To visualize ordinal data types, we are using boxplots but because they are bivariate distributions, they only show up to enhance a selected visualization.Changes
univariate.py
: allowordinal
data types to be treated asnominal
data types to create bar graphs inOccurrences
tabframe.py
: allow theset_data_type
function to take in optionalorder
argument to specify orders on ordinal dataBoxPlot.py
: currently only supports Altair BoxPlotsCompiler.py
: allow the mark to bebox
whenn_dim == 1 and n_msr == 1 and
dimension_type == "ordinal"`Example Output