-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension Types #12644
Comments
I'm not very knowledgeable about DataFusion internals or database theory, so it's hard for me to provide feedback on the proposal, but I'm very excited about the prospect of extension types to enable spatial types (#7859). I've been collaborating on the GeoArrow spec, which defines Arrow extension types for spatial data. It's important to have additional logical types because the same physical layout can be interpreted in multiple logical ways (e.g. an array of |
FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday. |
Is your feature request related to a problem or challenge?
Currently DataFusion provides a lot of built-in types which are useful when building applications / query engines on top of DataFusion. However, even plethora of types is not enough. DataFusion doesn't have types existing in other systems, limiting DataFusion applicability as "LLVM for query engines"
For example, these types commonly found in other systems do not exist today
DataType
and the closest Arrow has is "timestamp(zone)" where each value is in same zoneDataType
and the closest Arrow has is "timestamp(zone)" with eg UTC zone. however cast to varchar for "timestamp(UTC)" and for "timestamp with local time zone" should behave differentlyDataType
and the closest Arrow has Utf8 potentially with some metadata information. Utf8 might be a perfect carrier type for JSON data, but "cast(json AS T)" and "cast(utf8 AS T)" are usually pretty different operationsDescribe the solution you'd like
CAST(array<T> AS varchar)
needs to know how to docast(T AS varchar)
. It cannot delegate this logic fully to Arrow, because Arrow won't have a notion of extension types.cast(... AS varchar)
. It cannot use the defaultcast(struct AS varchar)
.Describe alternatives you've considered
Everything is built-in
DataFusion could provide all types needed by applications building on top of DataFusion as built-in DataFusion types.
This would be easiest to implement, but could lead to scope-creep for the project. This could also lead to conflicts where types look the same but the desired behavior differs between applications building on top of DataFusion. For example Oracle's and Trino's "timestamp with time zone" can represent political zones while Snowflake's allows only fixed offsets.
No-op
Not providing extension types. This would limit DataFusion applicability.
DataFusion cannot be considered "LLVM for query engines" if it cannot serve as an engine, or potential engine, for existing popular query engines.
Additional context
The need to create extension types was raised in the [Proposal] Decouple logical from physical types
However introduction of DataFusion own types does not require introduction of extension types.
Extension types are complex enough (especially given their impact on functions) that they deserve their own roadmap issue.
The impact of extension types on functions, functions runtime and resolution is very clear, so this relates to Simple Functions initiative:
Having ExtensionType in arrow-rs would could the implementation simpler:
The text was updated successfully, but these errors were encountered: