What's the shortest way to demonstrate a full fledged Kedro pipeline? #3775

astrojuanlu · 2024-04-03T15:23:49Z

astrojuanlu
Apr 3, 2024
Maintainer

AS A
person evaluating Kedro
I WOULD LIKE TO
have a quick way of trying it out
SO THAT
I can form an opinion in minutes rather than hours

Inspired by https://www.tryhamilton.dev/

Consider this:

from kedro.io import DataCatalog
from kedro.pipeline import Pipeline
from kedro.runner import SequentialRunner


def fun1(df):
    return df.filter(...)

def fun2(df):
    return df.select(...)


pipeline = Pipeline.from_dicts(
    dict(func=fun1, inputs=["source"], outputs=["filtered"]),
    dict(func=fun2, inputs=["filtered"], outputs=["final"]),
)

outputs = SequentialRunner().run(pipeline, catalog=DataCatalog.from_raw_inputs({"source": pl.read_csv("companies.csv")}))

print(outputs["final"].head())

Some points:

Pipeline.from_dicts proposes a @classmethod rather than the pipeline helper so it's more clear what object is being returned
- .from_dicts saves the user from importing the Node class or the node helper altogether
~~KedroSession.run_pipeline allows the user to run a Pipeline object directly without having to go through the registration process, create_pipeline etc~~ Replaced with runner.run, see below
DataCatalog.from_raw_inputs allows the user to inject data from memory directly, instead of having to build a dummy dataset on the spot
Maybe catalog= should be part of something like KedroSession.create_??? instead, at the moment

The philosophical arguments against this is clear: if we offer an easy way, people will misuse it. https://en.wikipedia.org/wiki/Worse_is_better we'd need to build the right guardrails.

The practical argument in favour of this is clear: it lowers the barrier of entry, hence (potentially, ideally, theoretically) could help increasing adoption.

Related issues and discussions:

datajoely · 2024-04-03T15:26:47Z

datajoely
Apr 3, 2024
Collaborator

I know this is a super quick example but we could make the run argument even simpler accepting Union[DataCatalog|Dict[str,Any]

i.e.

outputs = session.run(pipeline, catalog={"source": pl.read_csv("companies.csv")}

0 replies

astrojuanlu · 2024-04-05T13:03:04Z

astrojuanlu
Apr 5, 2024
Maintainer Author

I was made aware today that runner.run(pipeline, catalog) exists https://docs.kedro.org/en/stable/api/kedro.runner.AbstractRunner.html#kedro.runner.AbstractRunner.run

so maybe we don't even need to expose the KedroSession for something like this.

1 reply

datajoely Apr 8, 2024
Collaborator

So this really comes down to, do we want to expose lifecycle hooks? i

It's actually not very nice to do with the Session code API, but it is possible - whereas it cannot be done with the runner alone.

astrojuanlu · 2024-06-27T22:01:11Z

astrojuanlu
Jun 27, 2024
Maintainer Author

This would be useful in environments like Databricks, where everything is centered around a single file. See also https://github.com/ibis-project/kedro-ibis-tutorial/blob/main/03%20-%20First%20Steps%20with%20Kedro.ipynb

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the shortest way to demonstrate a full fledged Kedro pipeline? #3775

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What's the shortest way to demonstrate a full fledged Kedro pipeline? #3775

astrojuanlu Apr 3, 2024 Maintainer

Replies: 3 comments · 1 reply

datajoely Apr 3, 2024 Collaborator

astrojuanlu Apr 5, 2024 Maintainer Author

datajoely Apr 8, 2024 Collaborator

astrojuanlu Jun 27, 2024 Maintainer Author

astrojuanlu
Apr 3, 2024
Maintainer

Replies: 3 comments 1 reply

datajoely
Apr 3, 2024
Collaborator

astrojuanlu
Apr 5, 2024
Maintainer Author

datajoely Apr 8, 2024
Collaborator

astrojuanlu
Jun 27, 2024
Maintainer Author