Replies: 3 comments 3 replies
-
@jeremyholtzman @wasilaq and I were thinking of model training within the context of a dbt DAG, and struggled for a while to conceptualize how a train-validation-test split might work when decoupled from the actual model fitting step. The challenging constraint is that normally a data splitting step will return multiple artifacts (i.e. @jeremyholtzman's genius solution was to return all of these artifacts stacked on top of one another but with an extra column designating which row belonged to which set, using the --macros/train_test_split.sql
{% macro train_test_split(test_size = 0.25) %}
case
when uniform(0::float, 1::float, random(42)) > {{ test_size }} then 'train'
else 'test'
end
{% endmacro %} --models/model_name.sql
select
*,
{{ train_test_split( 0.33 ) }} as train_test_identifier
from {{ ref('fruit_user_input') }} |
Beta Was this translation helpful? Give feedback.
-
One thing I'm wondering about. From my experience in Industry with DS stuff, I've seen the most popular use cases that actually get deployed into production (as opposed to just EDA or POC) are forecasting and anomaly detection, which tend to lend themselves well to being returned as a dataframe. The only issue I see with this is having all the ML code being in one big function, which is kind of necessary for the dbt model 🤔 |
Beta Was this translation helpful? Give feedback.
-
Hi, there's a decent chance you already have, but I'd check out the Slack community MLOps.Community that has a high density of MLOps professionals: https://join.slack.com/t/mlops-community/shared_invite/zt-1cyb42jf9-PxVwNwJ7pCg0wxA8ztKOZw I don't have much ML experience yet, so this is my attempt to help out :) |
Beta Was this translation helpful? Give feedback.
-
The first use case we see being enabled with the addition of Python models, is embedding a model and it's predictions both within the dbt project as well as within the analytics DAG without having to build maintain, and orchestrate additional infrastructure. Even this capability we consider to be a game changer. But there's still so much more we could do!
In the medium- and long- term, the other aspects of MLOPs ( exploration, training, and model/metrics tracking) will be integrated into this new workflow. Do you have strong opinions on how dbt should tackle the problem of model and metric tracking? Do you have more questions than answers! Contribute your brain power here!
Beta Was this translation helpful? Give feedback.
All reactions