The future of I/O managers: an opt-in layer #17595

jamiedemaria · 2023-11-01T17:49:41Z

jamiedemaria
Nov 1, 2023
Maintainer

Most, if not all, Dagster users have had to contend with the I/O manager system at some point. The I/O manager system has been tightly coupled with the orchestration layer, and it was expected that every Dagster user would have to understand the abstraction. For some users, I/O managers are a natural fit for how their data pipelines are designed and how their data is stored. These data pipelines typically transform data in-memory— for example with a library like pandas, and store the resulting data assets in the same location— for example, in Snowflake tables.

However, not all data pipelines fit this pattern. Sometimes, data assets are too large to store in-memory, or the data pipeline calls out to a third party tool that already handles storage— for example, dbt. In these cases, the I/O manager system doesn’t add additional value, and sometimes gets in the way of the developer experience.

While opting out of the I/O manager system has been possible since 0.13.11, in version 1.15.0 we introduced new APIs to simplify working with Dagster without the I/O managers. Our goal is for these new APIs to be the default way of working with Dagster. Users will no longer have to map input names and output names to asset keys: there will only be asset keys. I/O managers become an opt-in system for those who desire its opinionated structure. Users who do not opt-in should not have to consider or think about I/O managers.

These new APIs are:

`deps` parameter on `@asset`

The deps parameter allows users to set the upstream assets an asset depends on, but the I/O manager will not be used to load these assets into memory.

# this code is shared with the below examples the demonstrate I/O manager-centric APIs and the new APIs
@asset
def iris_dataset(duckdb: DuckDBResource) -> None:
    iris_df = pd.read_csv(
        "https://docs.dagster.io/assets/iris.csv",
        names=[
            "sepal_length_cm",
            "sepal_width_cm",
            "petal_length_cm",
            "petal_width_cm",
            "species",
        ],
    )

    with duckdb.get_connection() as conn:
        conn.execute("CREATE TABLE iris.iris_dataset AS SELECT * FROM iris_df")

Old API

Improved API

@asset(non_argument_deps={"iris_dataset"})
def iris_setosa(duckdb: DuckDBResource) -> None:
    with duckdb.get_connection() as conn:
        conn.execute(
            "CREATE TABLE iris.iris_setosa AS"
            " SELECT * FROM iris.iris_dataset"
            " WHERE species = 'Iris-setosa'"
        )

@asset(deps=[iris_dataset])
def iris_setosa(duckdb: DuckDBResource) -> None:
    with duckdb.get_connection() as conn:
        conn.execute(
            "CREATE TABLE iris.iris_setosa AS"
            " SELECT * FROM iris.iris_dataset"
            " WHERE species = 'Iris-setosa'"
        )

The deps parameter replaces using non_argument_deps, and non_argument_deps has been marked deprecated. non_argument_deps relied on string matching asset names, which often resulted in typos that were only caught at runtime. deps accepts assets, which allows in-editor type checkers to detect errors.

`AssetDep`

AssetDep is a class used for defining a dependency on another asset when additional information, like a PartitionMapping, is needed.

daily_partitions_def = DailyPartitionsDefinition(start_date="2023-08-20")

@asset(
    partitions_def=daily_partitions_def
)
def signups(context: AssetExecutionContext):
    # adds the signup data for date to the signups table
    get_and_store_signup_data(date=context.partition_key)

I/O Manager-centric API

New API

@asset(
    partitions_def=daily_partitions_def,
    ins={
        "signups": AssetIn(
            dagster_type=Nothing,
            partition_mapping=TimeWindowPartitionMapping(start_offset=-10)
        )
    }
)
def ten_day_retention_rate(
    context: AssetExecutionContext,
    snowflake: SnowflakeResource
) -> None:
    time_window = context.asset_partitions_time_window_for_input("signups")
    with snowflake.get_connection() as conn:
        all_signups = conn.execute(
            "SELECT COUNT(*) from signups"
            f" WHERE signup_date >= {time_window.start}"
            f" AND signup_date <= {time_window.end}"
        )
        still_active_signups = conn.execute(
            "SELECT COUNT(*) from signups"
            f" WHERE signup_date >= {time_window.start}"
            f" AND signup_date <= {time_window.end}"
            " AND status = 'ACTIVE'"
        )

        retention_rate = still_active_signups / all_signups

        conn.execute("INSERT INTO ten_day_retention_rate"
                     f" VALUES ({context.partition_key}, {retention_rate})"
        )

@asset(
    partitions_def=daily_partitions_def,
    deps=[AssetDep(
        signups,
        partition_mapping=TimeWindowPartitionMapping(start_offset=-10))
    ]
)
def ten_day_retention_rate(
    context: AssetExecutionContext,
    snowflake: SnowflakeResource
) -> None:
    time_window = context.asset_partitions_time_window_for_input("signups") 
    with snowflake.get_connection() as conn:
        all_signups = conn.execute(
            "SELECT COUNT(*) from signups"
            f" WHERE signup_date >= {time_window.start}"
            f" AND signup_date <= {time_window.end}"
        )
        still_active_signups = conn.execute(
            "SELECT COUNT(*) from signups"
            f" WHERE signup_date >= {time_window.start}"
            f" AND signup_date <= {time_window.end}"
            " AND status = 'ACTIVE'"
        )

        retention_rate = still_active_signups / all_signups

        conn.execute("INSERT INTO ten_day_retention_rate"
                     f" VALUES ({context.partition_key}, {retention_rate})"
        )

With the addition of AssetDep, users can now define complex dependency relationships that were previously only available in the I/O manager system.

`MaterializeResult`

MaterializeResult is a class that can be optionally returned from an @asset to report metadata, code version, and other information about the asset.

I/O Manager-centric API

New API

@asset(non_argument_deps={"iris_dataset"})
def iris_setosa(
    context: AssetExecutionContext,
    duckdb: DuckDBResource
) -> None:
    with duckdb.get_connection() as conn:
        conn.execute(
            "CREATE TABLE iris.iris_setosa AS"
            " SELECT * FROM iris.iris_dataset"
            " WHERE species = 'Iris-setosa'"
        )
        num_rows = conn.execute("SELECT COUNT(*) FROM iris.iris_setosa;")

    context.add_output_metadata(
        output_name="result",
        metadata={"num_rows": int(num_rows)}
    )

@asset(deps=[iris_dataset])
def iris_setosa(duckdb: DuckDBResource) -> MaterializeResult:
    with duckdb.get_connection() as conn:
        conn.execute(
            "CREATE TABLE iris.iris_setosa AS"
            " SELECT * FROM iris.iris_dataset"
            " WHERE species = 'Iris-setosa'"
        )
        num_rows = conn.execute("SELECT COUNT(*) FROM iris.iris_setosa;")

    return MaterializeResult(metadata={"num_rows": num_rows})

The MaterializeResult type does not require that an output value be returned, and does not require the user to understand output names or that the implicit output name for assets is "result".

`AssetSpec`

AssetSpec is a class for defining the specifications of a data asset - like the asset key, dependencies, group, and freshness policies, separate from the computation that creates the asset. Currently, AssetSpecs can be used when writing @multi_assets, and when used, Dagster will expect that storing the assets will be handled in the body of the @multi_asset function. @multi_assets that use AssetSpec can return MaterializeResults or None.

@asset 
def all_data():
    fetch_and_store_all_ml_data()

I/O Manager-centric API

New API

@multi_asset(
    non_argument_deps={"all_data"},
    outs={
        "train_data": AssetOut(dagster_type=Nothing),
        "test_data": AssetOut(dagster_type=Nothing)
    }
)
def train_test_split() -> Tuple[Output, Output]:
    # create_train_test_split splits a table called
    # all_data into two datasets and stores them as
    # snowflake tables
    num_train_rows, num_test_rows = create_train_test_split()

    return (
        Output(
            None,
            output_name="train_data",
            metadata={"num_rows": num_train_rows}
        ),
        Output(
            None,
            output_name="test_data",
            metadata={"num_rows": num_test_rows}
        )
    )

train_data = AssetSpec("train_data", deps=[all_data])
test_data = AssetSpec("test_data", deps=[all_data])

@multi_asset(
    specs=[train_data, test_data]
)
def train_test_split() ->Tuple[MaterializeResult, MaterializeResult]:
    # create_train_test_split splits a table called
    # all_data into two datasets and stores them as
    # snowflake tables
    num_train_rows, num_test_rows = create_train_test_split()

    return (
        MaterializeResult(
            key=train_data.key,
            metadata={"num_rows": num_train_rows}
        ),
        MaterializeResult(
            key=test_data.key,
            metadata={"num_rows": num_test_rows}
        )
    )

Authors of asset factory functions may find AssetSpecs to be particularly useful. AssetSpec bundles several parameters that are on the @multi_asset decorator together, which allows you to accept AssetSpecs in your factory function rather than duplicating each parameter

Old factory pattern

Improved factory pattern

def create_etl_asset(
    asset_name: str, 
    ins: Mapping[str, AssetIn],
    outs: Mapping[str, AssetOut],
    group_name: str,
)
    @multi_asset(
       name=asset_name,
       ins=ins
       outs=outs
       group_name=group_name
    )
    def etl():
      ...

def create_etl_asset(
    asset_name: str, 
    specs: List[AssetSpec]
)
    @multi_asset(
       name=asset_name,
       specs=specs
    )
    def etl():
      ...

What’s next?

Before we can officially declare I/O managers an opt-in system, we still have some work to do. This includes:

Improving and stabilizing the APIs introduced here, so that they can be marked nonexperimental by the end of the year.
Improving documentations and education materials so that users can choose when to use I/O managers.
Allowing users to access simple metadata about upstream assets. For example, the name of a table where data was stored in the upstream asset.
Simplifying the context so that it is not geared around inputs and outputs

We are continuing to work on improving these APIs and the experience of using Dagster without the I/O manager system. Your feedback and suggestions are always welcome!

geoHeil · 2023-11-02T11:09:10Z

geoHeil
Nov 2, 2023

I would love to see AssetSpecs for DBT-based asset factories to i.e. auto-magically define test cases

0 replies

j-blackwell · 2023-11-02T12:07:58Z

j-blackwell
Nov 2, 2023

Is there any plan or way to bring over the ._get_table_slice() functionality to the resource from the IO manager?

There would be a lot of overhead to carry out (for partitioned tables especially) that is already written and implemented here.

At the moment, we would love to connect to the database like this instead of via an IO manager, but without the ability to use table slices, it makes it far too complicated to use.

Something like:

@asset(deps=[iris_dataset])
def iris_setosa(context: AssetExecutionContext, duckdb: DuckDBResource) -> MaterializeResult:
    with duckdb.get_client() as client: # or similar?
        asset_table_slice = client.get_table_slice(context, context)
        dep_table_slice = client.get_table_slice(context, context.upstream_output) # doesn't currently exist in AssetExecutionContext
        dep_table_query = client.get_select_statement(dep_table_slice)
        client.ensure_schema_exists(context, asset_table_slice)
        client.delete_table_slice(context, asset_table_slice)

    with duckdb.get_connection() as conn:
        conn.execute(f"""
            INSERT INTO {asset_table_slice.schema}.{asset_table_slice.table} 
            SELECT * FROM ({dep_table_query})
            WHERE species = 'Iris-setosa'
        """)
        
        num_rows = conn.execute("SELECT COUNT(*) FROM iris.iris_setosa;")

    return MaterializeResult(metadata={"num_rows": num_rows})

3 replies

jamiedemaria Nov 3, 2023
Maintainer Author

We don't have concrete plans to offer functions like this. Part of introducing these new APIs is to allow for more flexibility in determining how assets are stored (how the path to the bucket/table/etc is created). One of the more frequent requests we got with I/O managers was for more customization in how assets were stored, so at least for now, the expectation is that users will write these kinds of utility functions themselves, according to their own storage requirements.

j-blackwell Nov 6, 2023

~~I agree with the aim, and the idea push the compute onto the database where it makes sense is great.~~
~~But the queries required to handle the deletion/insertion of records is non-trivial for partitioned assets and the table slice functionality is brilliant at handling this.~~

~~It is not currently possible to create this functionality without some information (table, schema name, columns, partition_dimensions) from the upstream assets.~~

~~Would this be something that is possible to add to the context, similar to context.asset_partitions_time_window_for_input("signups") above but for the partition_dimensions?~~

~~Or should I just be writing a new type handler for my IO manager where the obj is the query?~~

j-blackwell Nov 6, 2023

Or should I just be writing a new type handler for my IO manager where the obj is the query?

As all of the required context/metadata/etc. exists within the IO managers, it makes sense to me to create a SqlTypeHandler to add to the IO manager.

I have worked on this today and it solves our needs without any changes to the APIs, but I think it is also a really cool bit of functionality. Would you be interested in a PR for it? Or is there a reason you haven't added one before?

ion-elgreco · 2025-01-27T13:27:39Z

ion-elgreco
Jan 27, 2025

@jamiedemaria is this already practically possible? "Allowing users to access simple metadata about upstream assets. For example, the name of a table where data was stored in the upstream asset."

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The future of I/O managers: an opt-in layer #17595

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

The future of I/O managers: an opt-in layer #17595

jamiedemaria Nov 1, 2023 Maintainer

deps parameter on @asset

AssetDep

MaterializeResult

AssetSpec

What’s next?

Replies: 3 comments · 3 replies

geoHeil Nov 2, 2023

j-blackwell Nov 2, 2023

jamiedemaria Nov 3, 2023 Maintainer Author

j-blackwell Nov 6, 2023

j-blackwell Nov 6, 2023

ion-elgreco Jan 27, 2025

jamiedemaria
Nov 1, 2023
Maintainer

`deps` parameter on `@asset`

`AssetDep`

`MaterializeResult`

`AssetSpec`

Replies: 3 comments 3 replies

geoHeil
Nov 2, 2023

j-blackwell
Nov 2, 2023

jamiedemaria Nov 3, 2023
Maintainer Author

ion-elgreco
Jan 27, 2025