Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post for release 40.0.0 #6

Merged
merged 32 commits into from
Jul 24, 2024
Merged

Blog post for release 40.0.0 #6

merged 32 commits into from
Jul 24, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 9, 2024

BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 )


## Parquet indexing / low latency queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the Parquet indexing feature (ParquetAccessPlan) to add efficient support for reading from DeltaLake tables and handling deletion vectors. I think its a cool use-case for the indexing feature and perhaps worth calling out as an example of a real-world use-case.

@alamb
Copy link
Contributor Author

alamb commented Jul 10, 2024

Thanks @phillipleblanc -- I added your suggestions. I hope to clean this up more over the next few days

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at the community growth section and performance and started on features / looking ahead.

I'll try and take another pass over the features section tomorrow

@findepi
Copy link
Member

findepi commented Jul 19, 2024

Is there a way to deploy a rendered preview?

Comment on lines 39 to 46
<!-- todo update this intro -->
[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While [DataFusion’s primary design
goal] is to accelerate creating other data centric systems, it has a
reasonable experience directly out of the box as a [dataframe library] and
[command line SQL tool].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder about the todo (this paragraph isn't to be reviewed yet, right?)

Comment on lines 173 to 174
group by columns which resulted in a 40% performance improvement for some
benchmarks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some benchmarks

can this be a little more explicit what type of workfloads are expected to benefit?

_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved

## SQL Unparser

DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a side question - is there assumption that every plan can be converted to SQL?
what happens with operations that are result of optimization of original SQL, like partial aggregations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- I think as you correctly point out there are some LogicalPlan which can not be converted to SQL

PartialAggregation only shows up in the ExecutionPlans to my knowledge, but you could easly construct logical plans which can't be conveted to SQL (like doing multiple logical aggregations for example)

Comment on lines 286 to 287
of doing this efficiently is to minimize the number of requests made to an
object store by caching metadata and skipping over parts of the file that are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also join requests for adjacent or near-by pages (eg sibling columns a,b or siblings a,c where b is small)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it works is that the initial pages are fetch in a single object store request with multiple RANGEs

_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved
* Faster and easier to use [TreeNode API] for traversing and manipulating plans and expressions.
* All functions now use the same [Scalar User Defined Function API], making it easier to customize
DataFusion's behavior without sacrificing performance. See [ticket] for more details.
* DataFusion can now be compiled to [WASM].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

_posts/2024-07-09-datafusion-40.0.0.md Outdated Show resolved Hide resolved
@alamb alamb marked this pull request as ready for review July 22, 2024 17:11
@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2024

Thank you everyone for your reviews and comments ❤️

I plan to take one final proofreading / wordsmithing pass tomorrow 2024-07-23 and publish on 2024-07-24 unless anyone else would like time to review


BTW it would be great if [someone made a demo] showing how easy this is to do 🎣.

[someone made a demo]: https://github.com/apache/datafusion/issues/9326

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sweet, yes it would! I think I forgot about that one. Added in 79e8f16

[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/

DataFusion's core thesis is that as a community together, we can build much more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion's core thesis is that as a community, together we can build much more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 6c0223f

Copy link
Contributor

@Throne3d Throne3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A couple of small typos)

It's awesome to see all this progress captured in a single place!

_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved

[discussing what we will work on in the next six months]: https://github.com/apache/datafusion/issues/11442
[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is referenced

_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
_posts/2024-07-23-datafusion-40.0.0.md Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2024

Thanks everyone for the comments. I plan to publish this tomorrow

Is there any chance one of the committers could approve this PR so I can do so? @andygrove maybe?

Screenshot 2024-07-23 at 11 50 05 AM

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this post

1. DataFusion became a top level Apache Software Foundation project (read the
[press release] and [blog post]).
2. We added several PMC members and new
committers: [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waynexia also joined the PMC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- yes indeed you are correct and I double checked it was only in June https://lists.apache.org/thread/h58vgtb0nm7kf25o8bhjns47nkdt5nf0

For some reason I thought he had been part of the PMC longer``

@alamb alamb merged commit ab1d356 into apache:main Jul 24, 2024
@alamb alamb deleted the alamb/40.0.0_blog branch July 24, 2024 10:23
@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2024

The blog post is live: https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/ 🎉

I also made a few PRs with instructions on how to publish and work with this site that are looking for review (and approval from a committer)

  1. Minor: add publishing instructions #11
  2. Minor: Update asf-site README to point at main REAME #12
  3. Minor: Add instructions for building with docker #5

Also there is a change here for fixing dates: #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Blog post with DataFusion Jan - June 2024