The future of I/O managers: an opt-in layer #17595
Replies: 3 comments 3 replies
-
I would love to see AssetSpecs for DBT-based asset factories to i.e. auto-magically define test cases |
Beta Was this translation helpful? Give feedback.
-
Is there any plan or way to bring over the There would be a lot of overhead to carry out (for partitioned tables especially) that is already written and implemented here. At the moment, we would love to connect to the database like this instead of via an IO manager, but without the ability to use table slices, it makes it far too complicated to use. Something like: @asset(deps=[iris_dataset])
def iris_setosa(context: AssetExecutionContext, duckdb: DuckDBResource) -> MaterializeResult:
with duckdb.get_client() as client: # or similar?
asset_table_slice = client.get_table_slice(context, context)
dep_table_slice = client.get_table_slice(context, context.upstream_output) # doesn't currently exist in AssetExecutionContext
dep_table_query = client.get_select_statement(dep_table_slice)
client.ensure_schema_exists(context, asset_table_slice)
client.delete_table_slice(context, asset_table_slice)
with duckdb.get_connection() as conn:
conn.execute(f"""
INSERT INTO {asset_table_slice.schema}.{asset_table_slice.table}
SELECT * FROM ({dep_table_query})
WHERE species = 'Iris-setosa'
""")
num_rows = conn.execute("SELECT COUNT(*) FROM iris.iris_setosa;")
return MaterializeResult(metadata={"num_rows": num_rows})
|
Beta Was this translation helpful? Give feedback.
-
@jamiedemaria is this already practically possible? "Allowing users to access simple metadata about upstream assets. For example, the name of a table where data was stored in the upstream asset." |
Beta Was this translation helpful? Give feedback.
-
Most, if not all, Dagster users have had to contend with the I/O manager system at some point. The I/O manager system has been tightly coupled with the orchestration layer, and it was expected that every Dagster user would have to understand the abstraction. For some users, I/O managers are a natural fit for how their data pipelines are designed and how their data is stored. These data pipelines typically transform data in-memory— for example with a library like pandas, and store the resulting data assets in the same location— for example, in Snowflake tables.
However, not all data pipelines fit this pattern. Sometimes, data assets are too large to store in-memory, or the data pipeline calls out to a third party tool that already handles storage— for example,
dbt
. In these cases, the I/O manager system doesn’t add additional value, and sometimes gets in the way of the developer experience.While opting out of the I/O manager system has been possible since 0.13.11, in version 1.15.0 we introduced new APIs to simplify working with Dagster without the I/O managers. Our goal is for these new APIs to be the default way of working with Dagster. Users will no longer have to map input names and output names to asset keys: there will only be asset keys. I/O managers become an opt-in system for those who desire its opinionated structure. Users who do not opt-in should not have to consider or think about I/O managers.
These new APIs are:
deps
parameter on@asset
The
deps
parameter allows users to set the upstream assets an asset depends on, but the I/O manager will not be used to load these assets into memory.The
deps
parameter replaces usingnon_argument_deps
, andnon_argument_deps
has been marked deprecated.non_argument_deps
relied on string matching asset names, which often resulted in typos that were only caught at runtime.deps
accepts assets, which allows in-editor type checkers to detect errors.AssetDep
AssetDep
is a class used for defining a dependency on another asset when additional information, like aPartitionMapping
, is needed.With the addition of
AssetDep
, users can now define complex dependency relationships that were previously only available in the I/O manager system.MaterializeResult
MaterializeResult
is a class that can be optionally returned from an@asset
to report metadata, code version, and other information about the asset.The
MaterializeResult
type does not require that an output value be returned, and does not require the user to understand output names or that the implicit output name for assets is "result".AssetSpec
AssetSpec
is a class for defining the specifications of a data asset - like the asset key, dependencies, group, and freshness policies, separate from the computation that creates the asset. Currently,AssetSpec
s can be used when writing@multi_asset
s, and when used, Dagster will expect that storing the assets will be handled in the body of the@multi_asset
function.@multi_assets
that useAssetSpec
can returnMaterializeResult
s orNone
.Authors of asset factory functions may find
AssetSpec
s to be particularly useful.AssetSpec
bundles several parameters that are on the@multi_asset
decorator together, which allows you to acceptAssetSpec
s in your factory function rather than duplicating each parameterWhat’s next?
Before we can officially declare I/O managers an opt-in system, we still have some work to do. This includes:
context
so that it is not geared around inputs and outputsWe are continuing to work on improving these APIs and the experience of using Dagster without the I/O manager system. Your feedback and suggestions are always welcome!
Beta Was this translation helpful? Give feedback.
All reactions