Skip to content

Commit

Permalink
Comet 0.2.0 blog post (#24)
Browse files Browse the repository at this point in the history
  • Loading branch information
andygrove authored Aug 29, 2024
1 parent 0b256ca commit e9c31b3
Show file tree
Hide file tree
Showing 7 changed files with 249 additions and 95 deletions.
161 changes: 161 additions & 0 deletions 2024/08/28/datafusion-comet-0.2.0/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1"><!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Apache DataFusion Comet 0.2.0 Release | Apache DataFusion Project News &amp; Blog</title>
<meta name="generator" content="Jekyll v4.3.3" />
<meta property="og:title" content="Apache DataFusion Comet 0.2.0 Release" />
<meta name="author" content="pmc" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="&lt;!–" />
<meta property="og:description" content="&lt;!–" />
<link rel="canonical" href="https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0/" />
<meta property="og:url" content="https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0/" />
<meta property="og:site_name" content="Apache DataFusion Project News &amp; Blog" />
<meta property="og:type" content="article" />
<meta property="article:published_time" content="2024-08-28T00:00:00+00:00" />
<meta name="twitter:card" content="summary" />
<meta property="twitter:title" content="Apache DataFusion Comet 0.2.0 Release" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"pmc"},"dateModified":"2024-08-28T00:00:00+00:00","datePublished":"2024-08-28T00:00:00+00:00","description":"&lt;!–","headline":"Apache DataFusion Comet 0.2.0 Release","mainEntityOfPage":{"@type":"WebPage","@id":"https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://datafusion.apache.org/blog/img/2x_bgwhite_original.png"},"name":"pmc"},"url":"https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0/"}</script>
<!-- End Jekyll SEO tag -->
<link rel="stylesheet" href="/blog/assets/main.css"><link type="application/atom+xml" rel="alternate" href="https://datafusion.apache.org/blog/feed.xml" title="Apache DataFusion Project News &amp; Blog" /></head>
<body><header class="site-header" role="banner">

<div class="wrapper"><a class="site-title" rel="author" href="/blog/">Apache DataFusion Project News &amp; Blog</a><nav class="site-nav">
<input type="checkbox" id="nav-trigger" class="nav-trigger" />
<label for="nav-trigger">
<span class="menu-icon">
<svg viewBox="0 0 18 15" width="18px" height="15px">
<path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z M18,13.516C18,14.335,17.335,15,16.516,15H1.484 C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
</svg>
</span>
</label>

<div class="trigger"><a class="page-link" href="/blog/about/">About</a></div>
</nav></div>
</header>
<main class="page-content" aria-label="Content">
<div class="wrapper">
<article class="post h-entry" itemscope itemtype="http://schema.org/BlogPosting">

<header class="post-header">
<h1 class="post-title p-name" itemprop="name headline">Apache DataFusion Comet 0.2.0 Release</h1>
<p class="post-meta">
<time class="dt-published" datetime="2024-08-28T00:00:00+00:00" itemprop="datePublished">Aug 28, 2024
</time><span itemprop="author" itemscope itemtype="http://schema.org/Person"><span class="p-author h-card" itemprop="name">pmc</span></span></p>
</header>

<div class="post-content e-content" itemprop="articleBody">
<!--
-->

<p>The Apache DataFusion PMC is pleased to announce version 0.2.0 of the <a href="https://datafusion.apache.org/comet/">Comet</a> subproject.</p>

<p>Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.</p>

<p>Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or
expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer
to the <a href="https://datafusion.apache.org/comet/user-guide/compatibility.html">compatibility guide</a> for more information.</p>

<p>This release covers approximately four weeks of development work and is the result of merging 87 PRs from 14
contributors. See the <a href="https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.2.0.md">change log</a> for more information.</p>

<h2 id="release-highlights">Release Highlights</h2>

<h3 id="docker-images">Docker Images</h3>

<p>Docker images are now available from the <a href="https://github.com/apache/datafusion-comet/pkgs/container/datafusion-comet/265110454?tag=spark-3.4-scala-2.12-0.2.0">GitHub Container Registry</a>.</p>

<h3 id="performance-improvements">Performance improvements</h3>

<ul>
<li>Native shuffle is now enabled by default</li>
<li>Improved handling of decimal types</li>
<li>Reduced some redundant copying of batches in Filter/Scan operations</li>
<li>Optimized performance of count aggregates</li>
<li>Optimized performance of CASE expressions for specific uses:
<ul>
<li>CASE WHEN expr THEN column ELSE null END</li>
<li>CASE WHEN expr THEN literal ELSE literal END</li>
</ul>
</li>
<li>Optimized performance of IS NOT NULL</li>
</ul>

<h3 id="new-features">New Features</h3>

<ul>
<li>Window operations now support count and sum aggregates</li>
<li>CreateArray</li>
<li>GetStructField</li>
<li>Support nested types in hash join</li>
<li>Basic implementation of RLIKE expression</li>
</ul>

<h2 id="current-performance">Current Performance</h2>

<p>We use benchmarks derived from the industry standard TPC-H and TPC-DS benchmarks for tracking progress with
performance. The following charts shows the time it takes to run the queries against 100 GB of data in
Parquet format using a single executor with eight cores. See the <a href="https://datafusion.apache.org/comet/contributor-guide/benchmarking.html">Comet Benchmarking Guide</a>
for details of the environment used for these benchmarks.</p>

<h3 id="benchmark-derived-from-tpc-h">Benchmark derived from TPC-H</h3>

<p>Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better than the Comet 0.1.0 release.</p>

<p><img src="/blog/img/comet-0.2.0/tpch_allqueries.png" width="100%" class="img-responsive" alt="Chart showing TPC-H benchmark results for Comet 0.2.0" /></p>

<h3 id="benchmark-derived-from-tpc-ds">Benchmark derived from TPC-DS</h3>

<p>Comet 0.2.0 provides a 21% speedup compared to Spark, which is a significant improvement compared to
Comet 0.1.0, which did not provide any speedup for this benchmark.</p>

<p><img src="/blog/img/comet-0.2.0/tpcds_allqueries.png" width="100%" class="img-responsive" alt="Chart showing TPC-DS benchmark results for Comet 0.2.0" /></p>

<h2 id="getting-involved">Getting Involved</h2>

<p>The Comet project welcomes new contributors. We use the same <a href="https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord">Slack and Discord</a> channels as the main DataFusion
project.</p>

<p>The easiest way to get involved is to test Comet with your current Spark jobs and file issues for any bugs or
performance regressions that you find. See the <a href="https://datafusion.apache.org/comet/user-guide/installation.html">Getting Started</a> guide for instructions on downloading and installing
Comet.</p>

<p>There are also many <a href="https://github.com/apache/datafusion-comet/contribute">good first issues</a> waiting for contributions.</p>


</div><a class="u-url" href="/blog/2024/08/28/datafusion-comet-0.2.0/" hidden></a>
</article>

</div>
</main><footer class="site-footer h-card">
<data class="u-url" href="/blog/"></data>

<div class="wrapper">

<h2 class="footer-heading">Apache DataFusion Project News &amp; Blog</h2>

<div class="footer-col-wrapper">
<div class="footer-col footer-col-1">
<ul class="contact-list">
<li class="p-name">Apache DataFusion Project News &amp; Blog</li><li><a class="u-email" href="mailto:[email protected]">[email protected]</a></li></ul>
</div>

<div class="footer-col footer-col-2"><ul class="social-media-list"><li><a href="https://github.com/apache"><svg class="svg-icon"><use xlink:href="/blog/assets/minima-social-icons.svg#github"></use></svg> <span class="username">apache</span></a></li><li><a href="https://www.twitter.com/ApacheDataFusio"><svg class="svg-icon"><use xlink:href="/blog/assets/minima-social-icons.svg#twitter"></use></svg> <span class="username">ApacheDataFusio</span></a></li></ul>
</div>

<div class="footer-col footer-col-3">
<p>Apache DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.</p>
</div>
</div>

</div>

</footer>
</body>

</html>
Loading

0 comments on commit e9c31b3

Please sign in to comment.