-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog post for release 40.0.0 #6
Conversation
BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) | ||
|
||
|
||
## Parquet indexing / low latency queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used the Parquet indexing feature (ParquetAccessPlan
) to add efficient support for reading from DeltaLake tables and handling deletion vectors. I think its a cool use-case for the indexing feature and perhaps worth calling out as an example of a real-world use-case.
Co-authored-by: Phillip LeBlanc <[email protected]>
Co-authored-by: Phillip LeBlanc <[email protected]>
…into alamb/40.0.0_blog
Thanks @phillipleblanc -- I added your suggestions. I hope to clean this up more over the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pass at the community growth section and performance and started on features / looking ahead.
I'll try and take another pass over the features section tomorrow
Is there a way to deploy a rendered preview? |
<!-- todo update this intro --> | ||
[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that | ||
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to | ||
create new, fast data centric systems such as databases, dataframe libraries, | ||
machine learning and streaming applications. While [DataFusion’s primary design | ||
goal] is to accelerate creating other data centric systems, it has a | ||
reasonable experience directly out of the box as a [dataframe library] and | ||
[command line SQL tool]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder about the todo (this paragraph isn't to be reviewed yet, right?)
group by columns which resulted in a 40% performance improvement for some | ||
benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some benchmarks
can this be a little more explicit what type of workfloads are expected to benefit?
|
||
## SQL Unparser | ||
|
||
DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a side question - is there assumption that every plan can be converted to SQL?
what happens with operations that are result of optimization of original SQL, like partial aggregations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- I think as you correctly point out there are some LogicalPlan
which can not be converted to SQL
PartialAggregation only shows up in the ExecutionPlans to my knowledge, but you could easly construct logical plans which can't be conveted to SQL (like doing multiple logical aggregations for example)
of doing this efficiently is to minimize the number of requests made to an | ||
object store by caching metadata and skipping over parts of the file that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also join requests for adjacent or near-by pages (eg sibling columns a,b or siblings a,c where b is small)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works is that the initial pages are fetch in a single object store request with multiple RANGEs
* Faster and easier to use [TreeNode API] for traversing and manipulating plans and expressions. | ||
* All functions now use the same [Scalar User Defined Function API], making it easier to customize | ||
DataFusion's behavior without sacrificing performance. See [ticket] for more details. | ||
* DataFusion can now be compiled to [WASM]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thank you everyone for your reviews and comments ❤️ I plan to take one final proofreading / wordsmithing pass tomorrow 2024-07-23 and publish on 2024-07-24 unless anyone else would like time to review |
|
||
BTW it would be great if [someone made a demo] showing how easy this is to do 🎣. | ||
|
||
[someone made a demo]: https://github.com/apache/datafusion/issues/9326 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb !
One question, would it make sense to link example https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/function_factory.rs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sweet, yes it would! I think I forgot about that one. Added in 79e8f16
[apache arrow]: https://arrow.apache.org | ||
[rust]: https://www.rust-lang.org/ | ||
|
||
DataFusion's core thesis is that as a community together, we can build much more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion's core thesis is that as a community, together we can build much more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 6c0223f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A couple of small typos)
It's awesome to see all this progress captured in a single place!
|
||
[discussing what we will work on in the next six months]: https://github.com/apache/datafusion/issues/11442 | ||
[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000 | ||
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is referenced
Co-authored-by: Phillip LeBlanc <[email protected]> Co-authored-by: Edward Jones <[email protected]>
…into alamb/40.0.0_blog
Thanks everyone for the comments. I plan to publish this tomorrow Is there any chance one of the committers could approve this PR so I can do so? @andygrove maybe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this post
1. DataFusion became a top level Apache Software Foundation project (read the | ||
[press release] and [blog post]). | ||
2. We added several PMC members and new | ||
committers: [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@waynexia also joined the PMC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- yes indeed you are correct and I double checked it was only in June https://lists.apache.org/thread/h58vgtb0nm7kf25o8bhjns47nkdt5nf0
For some reason I thought he had been part of the PMC longer``
The blog post is live: https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/ 🎉 I also made a few PRs with instructions on how to publish and work with this site that are looking for review (and approval from a committer)
Also there is a change here for fixing dates: #4 |
Closes apache/datafusion#9602