diff --git a/2024/09/13/string-view-german-style-strings-part-1/index.html b/2024/09/13/string-view-german-style-strings-part-1/index.html new file mode 100644 index 0000000..6f9621a --- /dev/null +++ b/2024/09/13/string-view-german-style-strings-part-1/index.html @@ -0,0 +1,257 @@ + + + + + +Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet | Apache DataFusion Project News & Blog + + + + + + + + + + + + + + + + + +
+
+
+ +
+

Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet

+ +
+ +
+ + +

Editor’s Note: This is the first of a two part blog series that was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao’s summer intern project

+ +

This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating it into Apache DataFusion, significantly accelerating string-intensive queries in the ClickBench benchmark by 20%- 200% (Figure 11).

+ +

Getting significant end-to-end performance improvements was non-trivial. Implementing StringView itself was only a fraction of the effort required. Among other things, we had to optimize UTF-8 validation, implement unintuitive compiler optimizations, tune block sizes, and time GC to realize the FDAP ecosystem’s benefit. With other members of the open source community, we were able to overcome performance bottlenecks that could have killed the project. We would like to contribute by explaining the challenges and solutions in more detail so that more of the community can learn from our experience.

+ +

StringView is based on a simple idea: avoid some string copies and accelerate comparisons with inlined prefixes. Like most great ideas, it is “obvious” only after someone describes it clearly. Although simple, straightforward implementation actually slows down performance for almost every query. We must, therefore, apply astute observations and diligent engineering to realize the actual benefits from StringView.

+ +

Although this journey was successful, not all research ideas are as lucky. To accelerate the adoption of research into industry, it is valuable to integrate research prototypes with practical systems. Understanding the nuances of real-world systems makes it more likely that research designs2 will lead to practical system improvements.

+ +

StringView support was released as part of arrow-rs v52.2.0 and DataFusion v41.0.0. You can try it by setting the schema_force_view_types DataFusion configuration option, and we are hard at work with the community to make it the default. We invite everyone to try it out, take advantage of the effort invested so far, and contribute to making it better.

+ +

End to end performance improvements for ClickBench queries

+ +

Figure 1: StringView improves string-intensive ClickBench query performance by 20% - 200%

+ +

What is StringView?

+ +

Diagram of using StringArray and StringViewArray to represent the same string content

+ +

Figure 2: Use StringArray and StringViewArray to represent the same string content.

+ +

The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) +has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) +and was introduced to Arrow as a new StringViewArray3 type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. +StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

+ +

A StringViewArray consists of three components:

+ +
    +
  1. The view array
  2. +
  3. The buffers
  4. +
  5. The buffer pointers (IDs) that map buffer offsets to their physical locations
  6. +
+ +

Each view is 16 bytes long, and its contents differ based on the string’s length:

+ +
    +
  • string length < 12 bytes: the first four bytes store the string length, and the remaining 12 bytes store the inlined string.
  • +
  • string length > 12 bytes: the string is stored in a separate buffer. The length is again stored in the first 4 bytes, followed by the buffer id (4 bytes), the buffer offset (4 bytes), and the prefix (first 4 bytes) of the string.
  • +
+ +

Figure 2 shows an example of the same logical content (left) using StringArray (middle) and StringViewArray (right):

+ +
    +
  • The first string – "Apache DataFusion" – is 17 bytes long, and both StringArray and StringViewArray store the string’s bytes at the beginning of the buffer. The StringViewArray also inlines the first 4 bytes – "Apac" – in the view.
  • +
  • The second string, "InfluxDB" is only 8 bytes long, so StringViewArray completely inlines the string content in the view struct while StringArray stores the string in the buffer as well.
  • +
  • The third string "Arrow Rust Impl" is 15 bytes long and cannot be fully inlined. StringViewArray stores this in the same form as the first string.
  • +
  • The last string "Apache DataFusion" has the same content as the first string. It’s possible to use StringViewArray to avoid this duplication and reuse the bytes by pointing the view to the previous location.
  • +
+ +

StringViewArray provides three opportunities for outperforming StringArray:

+ +
    +
  1. Less copying via the offset + buffer format
  2. +
  3. Faster comparisons using the inlined string prefix
  4. +
  5. Reusing repeated string values with the flexible view layout
  6. +
+ +

The rest of this blog post discusses how to apply these opportunities in real query scenarios to improve performance, what challenges we encountered along the way, and how we solved them.

+ +

Faster Parquet Loading

+ +

Apache Parquet is the de facto format for storing large-scale analytical data commonly stored LakeHouse-style, such as Apache Iceberg and Delta Lake. Efficiently loading data from Parquet is thus critical to query performance in many important real-world workloads.

+ +

Parquet encodes strings (i.e., byte array) in a slightly different format than required for the original Arrow StringArray. The string length is encoded inline with the actual string data (as shown in Figure 4 left). As mentioned previously, StringArray requires the data buffer to be continuous and compact—the strings have to follow one after another. This requirement means that reading Parquet string data into an Arrow StringArray requires copying and consolidating the string bytes to a new buffer and tracking offsets in a separate array. Copying these strings is often wasteful. Typical queries filter out most data immediately after loading, so most of the copied data is quickly discarded.

+ +

On the other hand, reading Parquet data as a StringViewArray can re-use the same data buffer as storing the Parquet pages because StringViewArray does not require strings to be contiguous. For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page. The string "Arrow Rust Impl" is represented by a view with offset 37 and length 15 into that buffer.

+ +

Diagram showing how StringViewArray can avoid copying by reusing decoded Parquet pages.

+ +

Figure 4: StringViewArray avoids copying by reusing decoded Parquet pages.

+ +

Mini benchmark

+ +

Reusing Parquet buffers is great in theory, but how much does saving a copy actually matter? We can run the following benchmark in arrow-rs to find out:

+ +

Our benchmarking machine shows that loading BinaryViewArray is almost 2x faster than loading BinaryArray (see next section about why this isn’t String ViewArray).

+ +

You can read more on this arrow-rs issue: https://github.com/apache/arrow-rs/issues/5904

+ +

From Binary to Strings

+ +

You may wonder why we reported performance for BinaryViewArray when this post is about StringViewArray. Surprisingly, initially, our implementation to read StringViewArray from Parquet was much slower than StringArray. Why? TLDR: Although reading StringViewArray copied less data, the initial implementation also spent much more time validating UTF-8 (as shown in Figure 5).

+ +

Strings are stored as byte sequences. When reading data from (potentially untrusted) Parquet files, a Parquet decoder must ensure those byte sequences are valid UTF-8 strings, and most programming languages, including Rust, include highly optimized routines for doing so.

+ +

Figure showing time to load strings from Parquet and the effect of optimized UTF-8 validation.

+ +

Figure 5: Time to load strings from Parquet. The UTF-8 validation advantage initially eliminates the advantage of reduced copying for StringViewArray.

+ +

A StringArray can be validated in a single call to the UTF-8 validation function as it has a continuous string buffer. As long as the underlying buffer is UTF-84, all strings in the array must be UTF-8. The Rust parquet reader makes a single function call to validate the entire buffer.

+ +

However, validating an arbitrary StringViewArray requires validating each string with a separate call to the validation function, as the underlying buffer may also contain non-string data (for example, the lengths in Parquet pages).

+ +

UTF-8 validation in Rust is highly optimized and favors longer strings (as shown in Figure 6), likely because it leverages SIMD instructions to perform parallel validation. The benefit of a single function call to validate UTF-8 over a function call for each string more than eliminates the advantage of avoiding the copy for StringViewArray.

+ +

Figure showing UTF-8 validation throughput vs string length.

+ +

Figure 6: UTF-8 validation throughput vs string length—StringArray’s contiguous buffer can be validated much faster than StringViewArray’s buffer.

+ +

Does this mean we should only use StringArray? No! Thankfully, there’s a clever way out. The key observation is that in many real-world datasets, 99% of strings are shorter than 128 bytes, meaning the encoded length values are smaller than 128, in which case the length itself is also valid UTF-8 (in fact, it is ASCII).

+ +

This observation means we can optimize validating UTF-8 strings in Parquet pages by treating the length bytes as part of a single large string as long as the length value is less than 128. Put another way, prior to this optimization, the length bytes act as string boundaries, which require a UTF-8 validation on each string. After this optimization, only those strings with lengths larger than 128 bytes (less than 1% of the strings in the ClickBench dataset) are string boundaries, significantly increasing the UTF-8 validation chunk size and thus improving performance.

+ +

The actual implementation is only nine lines of Rust (with 30 lines of comments). You can find more details in the related arrow-rs issue: https://github.com/apache/arrow-rs/issues/5995. As expected, with this optimization, loading StringViewArray is almost 2x faster than loading StringArray.

+ +

Be Careful About Implicit Copies

+ +

After all the work to avoid copying strings when loading from Parquet, performance was still not as good as expected. We tracked the problem to a few implicit data copies that we weren’t aware of, as described in this issue.

+ +

The copies we eventually identified come from the following innocent-looking line of Rust code, where self.buf is a reference counted pointer that should transform without copying into a buffer for use in StringViewArray.

+ +

However, Rust-type coercion rules favored a blanket implementation that did copy data. This implementation is shown in the following code block where the impl<T: AsRef<[u8]>> will accept any type that implements AsRef<[u8]> and copies the data to create a new buffer. To avoid copying, users need to explicitly call from_vec, which consumes the Vec and transforms it into a buffer.

+ +

Diagnosing this implicit copy was time-consuming as it relied on subtle Rust language semantics. We needed to track every step of the data flow to ensure every copy was necessary. To help other users and prevent future mistakes, we also removed the implicit API from arrow-rs in favor of an explicit API. Using this approach, we found and fixed several other unintentional copies in the code base—hopefully, the change will help other downstream users avoid unnecessary copies.

+ +

Help the Compiler by Giving it More Information

+ +

The Rust compiler’s automatic optimizations mostly work very well for a wide variety of use cases, but sometimes, it needs additional hints to generate the most efficient code. When profiling the performance of view construction, we found, counterintuitively, that constructing long strings was 10x faster than constructing short strings, which made short strings slower on StringViewArray than on StringArray!

+ +

As described in the first section, StringViewArray treats long and short strings differently. Short strings (<12 bytes) directly inline to the view struct, while long strings only inline the first 4 bytes. The code to construct a view looks something like this:

+ +

It appears that both branches of the code should be fast: they both involve copying at most 16 bytes of data and some memory shift/store operations. How could the branch for short strings be 10x slower?

+ +

Looking at the assembly code using Compiler Explorer, we (with help from Ao Li) found the compiler used CPU load instructions to copy the fixed-sized 4 bytes to the view for long strings, but it calls a function, ptr::copy_non_overlapping, to copy the inlined bytes to the view for short strings. The difference is that long strings have a prefix size (4 bytes) known at compile time, so the compiler directly uses efficient CPU instructions. But, since the size of the short string is unknown to the compiler, it has to call the general-purpose function ptr::copy_non_coverlapping. Making a function call is significant unnecessary overhead compared to a CPU copy instruction.

+ +

However, we know something the compiler doesn’t know: the short string size is not arbitrary—it must be between 0 and 12 bytes, and we can leverage this information to avoid the function call. Our solution generates 13 copies of the function using generics, one for each of the possible prefix lengths. The code looks as follows, and checking the assembly code, we confirmed there are no calls to ptr::copy_non_overlapping, and only native CPU instructions are used. For more details, see the ticket.

+ +

End-to-End Query Performance

+ +

In the previous sections, we went out of our way to make sure loading StringViewArray is faster than StringArray. Before going further, we wanted to verify if obsessing about reducing copies and function calls has actually improved end-to-end performance in real-life queries. To do this, we evaluated a ClickBench query (Q20) in DataFusion that counts how many URLs contain the word "google":

+ +

This is a relatively simple query; most of the time is spent on loading the “URL” column to find matching rows. The query plan looks like this:

+ +

We ran the benchmark in the DataFusion repo like this:

+ +

With StringViewArray we saw a 24% end-to-end performance improvement, as shown in Figure 7. With the --string-view argument, the end-to-end query time is 944.3 ms, 869.6 ms, 861.9 ms (three iterations). Without --string-view, the end-to-end query time is 1186.1 ms, 1126.1 ms, 1138.3 ms.

+ +

Figure showing StringView improves end to end performance by 24 percent.

+ +

Figure 7: StringView reduces end-to-end query time by 24% on ClickBench Q20.

+ +

We also double-checked with detailed profiling and verified that the time reduction is indeed due to faster Parquet loading.

+ +

Conclusion

+ +

In this first blog post, we have described what it took to improve the +performance of simply reading strings from Parquet files using StringView. While +this resulted in real end-to-end query performance improvements, in our next +post, we explore additional optimizations enabled by StringView in DataFusion, +along with some of the pitfalls we encountered while implementing them.

+ +

Footnotes

+ +
+
    +
  1. +

    Benchmarked with AMD Ryzen 7600x (12 core, 24 threads, 32 MiB L3), WD Black SN770 NVMe SSD (5150MB/4950MB seq RW bandwidth) 

    +
  2. +
  3. +

    Xiangpeng is a PhD student at the University of Wisconsin-Madison 

    +
  4. +
  5. +

    There is also a corresponding BinaryViewArray which is similar except that the data is not constrained to be UTF-8 encoded strings. 

    +
  6. +
  7. +

    We also make sure that offsets do not break a UTF-8 code point, which is cheaply validated

    +
  8. +
+
+ +
+
+ +
+
+ + + diff --git a/2024/09/13/string-view-german-style-strings-part-2/index.html b/2024/09/13/string-view-german-style-strings-part-2/index.html new file mode 100644 index 0000000..61cdb15 --- /dev/null +++ b/2024/09/13/string-view-german-style-strings-part-2/index.html @@ -0,0 +1,213 @@ + + + + + +Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations | Apache DataFusion Project News & Blog + + + + + + + + + + + + + + + + + +
+
+
+ +
+

Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations

+ +
+ +
+ + +

Editor’s Note: This blog series was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao’s summer intern project

+ +

In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies. +In this second part of the post, we describe the rest of the journey: implementing additional efficient operations for real query processing.

+ +

Faster String Operations

+ +

Faster comparison

+ +

String comparison is ubiquitous; it is the core of +cmp, +min/max, +and like/ilike kernels. StringViewArray is designed to accelerate such comparisons using the inlined prefix—the key observation is that, in many cases, only the first few bytes of the string determine the string comparison results.

+ +

For example, to compare the strings InfluxDB with Apache DataFusion, we only need to look at the first byte to determine the string ordering or equality. In this case, since A is earlier in the alphabet than I, Apache DataFusion sorts first, and we know the strings are not equal. Despite only needing the first byte, comparing these strings when stored as a StringArray requires two memory accesses: 1) load the string offset and 2) use the offset to locate the string bytes. For low-level operations such as cmp that are invoked millions of times in the very hot paths of queries, avoiding this extra memory access can make a measurable difference in query performance.

+ +

For StringViewArray, typically, only one memory access is needed to load the view struct. Only if the result can not be determined from the prefix is the second memory access required. For the example above, there is no need for the second access. This technique is very effective in practice: the second access is never necessary for the more than 60% of real-world strings which are shorter than 12 bytes, as they are stored completely in the prefix.

+ +

However, functions that operate on strings must be specialized to take advantage of the inlined prefix. In addition to low-level comparison kernels, we implemented a wide range of other StringViewArray operations that cover the functions and operations seen in ClickBench queries. Supporting StringViewArray in all string operations takes quite a bit of effort, and thankfully the Arrow and DataFusion communities are already hard at work doing so (see https://github.com/apache/datafusion/issues/11752 if you want to help out).

+ +

Faster take and filter

+ +

After a filter operation such as WHERE url <> '' to avoid processing empty urls, DataFusion will often coalesce results to form a new array with only the passing elements. +This coalescing ensures the batches are sufficiently sized to benefit from vectorized processing in subsequent steps.

+ +

The coalescing operation is implemented using the take and filter kernels in arrow-rs. For StringArray, these kernels require copying the string contents to a new buffer without “holes” in between. This copy can be expensive especially when the new array is large.

+ +

However, take and filter for StringViewArray can avoid the copy by reusing buffers from the old array. The kernels only need to create a new list of views that point at the same strings within the old buffers. +Figure 1 illustrates the difference between the output of both string representations. StringArray creates two new strings at offsets 0-17 and 17-32, while StringViewArray simply points to the original buffer at offsets 0 and 25.

+ +

Diagram showing Zero-copy `take`/`filter` for StringViewArray

+ +

Figure 1: Zero-copy take/filter for StringViewArray

+ +

When to GC?

+ +

Zero-copy take/filter is great for generating large arrays quickly, but it is suboptimal for highly selective filters, where most of the strings are filtered out. When the cardinality drops, StringViewArray buffers become sparse—only a small subset of the bytes in the buffer’s memory are referred to by any view. This leads to excessive memory usage, especially in a filter-then-coalesce scenario. For example, a StringViewArray with 10M strings may only refer to 1M strings after some filter operations; however, due to zero-copy take/filter, the (reused) 10M buffers can not be released/reused.

+ +

To release unused memory, we implemented a garbage collection (GC) routine to consolidate the data into a new buffer to release the old sparse buffer(s). As the GC operation copies strings, similarly to StringArray, we must be careful about when to call it. If we call GC too early, we cause unnecessary copying, losing much of the benefit of StringViewArray. If we call GC too late, we hold large buffers for too long, increasing memory use and decreasing cache efficiency. The Polars blog on StringView also refers to the challenge presented by garbage collection timing.

+ +

arrow-rs implements the GC process, but it is up to users to decide when to call it. We leverage the semantics of the query engine and observed that the CoalseceBatchesExec operator, which merge smaller batches to a larger batch, is often used after the record cardinality is expected to shrink, which aligns perfectly with the scenario of GC in StringViewArray. +We, therefore, implemented the GC procedure inside CoalseceBatchesExec1 with a heuristic that estimates when the buffers are too sparse.

+ +

The art of function inlining: not too much, not too little

+ +

Like string inlining, function inlining is the process of embedding a short function into the caller to avoid the overhead of function calls (caller/callee save). +Usually, the Rust compiler does a good job of deciding when to inline. However, it is possible to override its default using the #[inline(always)] directive. +In performance-critical code, inlined code allows us to organize large functions into smaller ones without paying the runtime cost of function invocation.

+ +

However, function inlining is not always better, as it leads to larger function bodies that are harder for LLVM to optimize (for example, suboptimal register spilling) and risk overflowing the CPU’s instruction cache. We observed several performance regressions where function inlining caused slower performance when implementing the StringViewArray comparison kernels. Careful inspection and tuning of the code was required to aid the compiler in generating efficient code. More details can be found in this PR: https://github.com/apache/arrow-rs/pull/5900.

+ +

Buffer size tuning

+ +

StringViewArray permits multiple buffers, which enables a flexible buffer layout and potentially reduces the need to copy data. However, a large number of buffers slows down the performance of other operations. +For example, get_array_memory_size needs to sum the memory size of each buffer, which takes a long time with thousands of small buffers. +In certain cases, we found that multiple calls to concat_batches lead to arrays with millions of buffers, which was prohibitively expensive.

+ +

For example, consider a StringViewArray with the previous default buffer size of 8 KB. With this configuration, holding 4GB of string data requires almost half a million buffers! Larger buffer sizes are needed for larger arrays, but we cannot arbitrarily increase the default buffer size, as small arrays would consume too much memory (most arrays require at least one buffer). Buffer sizing is especially problematic in query processing, as we often need to construct small batches of string arrays, and the sizes are unknown at planning time.

+ +

To balance the buffer size trade-off, we again leverage the query processing (DataFusion) semantics to decide when to use larger buffers. While coalescing batches, we combine multiple small string arrays and set a smaller buffer size to keep the total memory consumption low. In string aggregation, we aggregate over an entire Datafusion partition, which can generate a large number of strings, so we set a larger buffer size (2MB).

+ +

To assist situations where the semantics are unknown, we also implemented a classic dynamic exponential buffer size growth strategy, which starts with a small buffer size (8KB) and doubles the size of each new buffer up to 2MB. We implemented this strategy in arrow-rs and enabled it by default so that other users of StringViewArray can also benefit from this optimization. See this issue for more details: https://github.com/apache/arrow-rs/issues/6094.

+ +

End-to-end query performance

+ +

We have made significant progress in optimizing StringViewArray filtering operations. Now, let’s test it in the real world to see how it works!

+ +

Let’s consider ClickBench query 22, which selects multiple string fields (URL, Title, and SearchPhase) and applies several filters.

+ +

We ran the benchmark using the following command in the DataFusion repo. Again, the --string-view option means we use StringViewArray instead of StringArray.

+ +

To eliminate the impact of the faster Parquet reading using StringViewArray (see the first part of this blog), Figure 2 plots only the time spent in FilterExec. Without StringViewArray, the filter takes 7.17s; with StringViewArray, the filter only takes 4.86s, a 32% reduction in time. Moreover, we see a 17% improvement in end-to-end query performance.

+ +

Figure showing StringViewArray reduces the filter time by 32% on ClickBench query 22.

+ +

Figure 2: StringViewArray reduces the filter time by 32% on ClickBench query 22.

+ +

Faster String Aggregation

+ +

So far, we have discussed how to exploit two StringViewArray features: reduced copy and faster filtering. This section focuses on reusing string bytes to repeat string values.

+ +

As described in part one of this blog, if two strings have identical values, StringViewArray can use two different views pointing at the same buffer range, thus avoiding repeating the string bytes in the buffer. This makes StringViewArray similar to an Arrow DictionaryArray that stores Strings—both array types work well for strings with only a few distinct values.

+ +

Deduplicating string values can significantly reduce memory consumption in StringViewArray. However, this process is expensive and involves hashing every string and maintaining a hash table, and so it cannot be done by default when creating a StringViewArray. We introduced an opt-in string deduplication mode in arrow-rs for advanced users who know their data has a small number of distinct values, and where the benefits of reduced memory consumption outweigh the additional overhead of array construction.

+ +

Once again, we leverage DataFusion query semantics to identify StringViewArray with duplicate values, such as aggregation queries with multiple group keys. For example, some ClickBench queries group by two columns:

+ +
    +
  • UserID (an integer with close to 1 M distinct values)
  • +
  • MobilePhoneModel (a string with less than a hundred distinct values)
  • +
+ +

In this case, the output row count is count(distinct UserID) * count(distinct MobilePhoneModel), which is 100M. Each string value of MobilePhoneModel is repeated 1M times. With StringViewArray, we can save space by pointing the repeating values to the same underlying buffer.

+ +

Faster string aggregation with StringView is part of a larger project to improve DataFusion aggregation performance. We have a proof of concept implementation with StringView that can improve the multi-column string aggregation by 20%. We would love your help to get it production ready!

+ +

StringView Pitfalls

+ +

Most existing blog posts (including this one) focus on the benefits of using StringViewArray over other string representations such as StringArray. As we have discussed, even though it requires a significant engineering investment to realize, StringViewArray is a major improvement over StringArray in many cases.

+ +

However, there are several cases where StringViewArray is slower than StringArray. For completeness, we have listed those instances here:

+ +
    +
  1. Tiny strings (when strings are shorter than 8 bytes): every element of the StringViewArray consumes at least 16 bytes of memory—the size of the view struct. For an array of tiny strings, StringViewArray consumes more memory than StringArray and thus can cause slower performance due to additional memory pressure on the CPU cache.
  2. +
  3. Many repeated short strings: Similar to the first point, StringViewArray can be slower and require more memory than a DictionaryArray because 1) it can only reuse the bytes in the buffer when the strings are longer than 12 bytes and 2) 32-bit offsets are always used, even when a smaller size (8 bit or 16 bit) could represent all the distinct values.
  4. +
  5. Filtering: As we mentioned above, StringViewArrays often consume more memory than the corresponding StringArray, and memory bloat quickly dominates the performance without GC. However, invoking GC also reduces the benefits of less copying so must be carefully tuned.
  6. +
+ +

Conclusion and Takeaways

+ +

In these two blog posts, we discussed what it takes to implement StringViewArray in arrow-rs and then integrate it into DataFusion. Our evaluations on ClickBench queries show that StringView can improve the performance of string-intensive workloads by up to 2x.

+ +

Given that DataFusion already performs very well on ClickBench, the level of end-to-end performance improvement using StringViewArray shows the power of this technique and, of course, is a win for DataFusion and the systems that build upon it.

+ +

StringView is a big project that has received tremendous community support. Specifically, we would like to thank @tustvold, @ariesdevil, @RinChanNOWWW, @ClSlaid, @2010YOUY01, @chloro-pn, @a10y, @Kev1n8, @Weijun-H, @PsiACE, @tshauck, and @xinlifoobar for their valuable contributions!

+ +

As the introduction states, “German Style Strings” is a relatively straightforward research idea that avoid some string copies and accelerates comparisons. However, applying this (great) idea in practice requires a significant investment in careful software engineering. Again, we encourage the research community to continue to help apply research ideas to industrial systems, such as DataFusion, as doing so provides valuable perspectives when evaluating future research questions for the greatest potential impact.

+ +

Footnotes

+ +
+
    +
  1. +

    There are additional optimizations possible in this operation that the community is working on, such as https://github.com/apache/datafusion/issues/7957

    +
  2. +
+
+ +
+
+ +
+
+ + + diff --git a/about/index.html b/about/index.html index 4df8b1e..3721de4 100644 --- a/about/index.html +++ b/about/index.html @@ -43,9 +43,10 @@

About

-

Apache DataFusion is a very fast, extensible query engine for building high-quality +

Apache DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

+
diff --git a/assets/main.css.map b/assets/main.css.map index 4da063c..3dde519 100644 --- a/assets/main.css.map +++ b/assets/main.css.map @@ -1 +1 @@ -{"version":3,"sourceRoot":"","sources":["../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_base.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_layout.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EACA;EACA;;;AAKF;AAAA;AAAA;AAGA;AAAA;AAAA;AAAA;EAIE;;;AAKF;AAAA;AAAA;AAGA;EACE;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;;;AAGF;EACE,WChEiB;;;ADqEnB;AAAA;AAAA;AAGA;EACE,aCtEiB;;;AD0EjB;AAAA;EAEE;;;AAMJ;AAAA;AAAA;AAGA;EACE,aC1FiB;;;AD+FnB;AAAA;AAAA;AAGA;EACE,OC3FiB;ED4FjB;;AAEA;EACE;;AAGF;EACE,OCrGe;EDsGf;;AAGF;EACE;;AAEA;EACE;;;AAMN;AAAA;AAAA;AAGA;EACE,OCnHiB;EDoHjB;EACA;EC3FA;ED6FA;EACA;;AAEA;EACE;;;AAMJ;AAAA;AAAA;AAGA;AAAA;EC1GE;ED6GA;EACA;EACA;;;AAGF;EACE;;;AAGF;EACE;EACA;;AAEA;EACE;EACA;EACA;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;EACA;EACA,eC3KiB;ED4KjB,cC5KiB;;AA0BjB;ED4IF;IAUI;IACA;IACA;IACA;;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;;;AAKF;AAAA;AAAA;AAIA;EACI;EACA;EACA;EACA;EACA;EACA;;;AAIF;EACE;;;AAMJ;AAAA;AAAA;AAGA;EACE,eC7NiB;ED8NjB;EACA,YCrNiB;EDsNjB;EACA;EACA;;AAEE;EACE;;AAGJ;EACE;;AAEF;EACE;EACA;EACA;;AAEF;EACE;;;AE3PJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;EAGA;;;AAGF;ED8BE;EC5BA;EACA;EACA;EACA;EACA;;AAEA;EAEE,ODJe;;;ACQnB;EACE;EACA;;AAEA;EACI;;AAGJ;EACE;;AAGF;EACE,OD3Be;EC4Bf,aDhCe;;ACmCf;EACE;;ADRJ;ECVF;IAuBI;IACA;IACA;IACA,kBDvCe;ICwCf;IACA;IACA;;EAEA;IACE;IACA;IACA;IACA;IACA;IACA;;EAGF;IACE;IACA;IACA;IACA;IACA;IACA;IACA;;EAEA;IACE,MD1DW;;EC8Df;IACE;IACA;;EAGF;IACE;IACA;;EAGF;IACE;IACA;IAKA;;EAHA;IACE;;;;AASR;AAAA;AAAA;AAGA;EACE;EACA;;;AAGF;EDtEE;ECwEA;;;AAGF;AAAA;EAEE;EACA;;;AAGF;EDjFE;ECmFA,OD7GiB;EC8GjB;;;AAIF;EACE;EACA;EACA;;;AAGF;EACE;EACA;;;AAGF;EACE;EACA;;;AAGF;EACE;EACA;;;ADhHA;ECoHA;AAAA;IAEE;IACA;;EAGF;IACE;IACA;;;AD5HF;ECiIA;IACE;IACA;IACA;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;;;AAGF;ED5IE;;;ACgJF;EDhJE;;;ACoJF;EACE;EACA;;AAEA;EACE,eDzLe;;;AC6LnB;EACE,WDjMiB;ECkMjB,ODzLiB;;;AC4LnB;EACE;EDnKA;;;ACyKF;AAAA;AAAA;AAGA;EACE,eD7MiB;;;ACgNnB;EDhLE;ECkLA;EACA;;ADzLA;ECsLF;IDhLE;;;;AC0LF;EACE,eD3NiB;;AC6NjB;ED7LA;;AANA;ECmMA;ID7LA;;;ACqMA;EDrMA;;AANA;EC2MA;IDrMA;;;AC6MA;ED7MA;;AANA;ECmNA;ID7MA;;;;AE1CF;AAAA;AAAA;AAGA;EACE;;AAGA;EACE;;AAGF;EAAS;EAAa;;AACtB;EAAS;EAAgB;;AACzB;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;EAAa;EAAmB;;AACzC;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS","sourcesContent":["/**\n * Reset some basic elements\n */\nbody, h1, h2, h3, h4, h5, h6,\np, blockquote, pre, hr,\ndl, dd, ol, ul, figure {\n margin: 0;\n padding: 0;\n}\n\n\n\n/**\n * Basic styling\n */\nbody {\n font: $base-font-weight #{$base-font-size}/#{$base-line-height} $base-font-family;\n color: $text-color;\n background-color: $background-color;\n -webkit-text-size-adjust: 100%;\n -webkit-font-feature-settings: \"kern\" 1;\n -moz-font-feature-settings: \"kern\" 1;\n -o-font-feature-settings: \"kern\" 1;\n font-feature-settings: \"kern\" 1;\n font-kerning: normal;\n display: flex;\n min-height: 100vh;\n flex-direction: column;\n}\n\n\n\n/**\n * Set `margin-bottom` to maintain vertical rhythm\n */\nh1, h2, h3, h4, h5, h6,\np, blockquote, pre,\nul, ol, dl, figure,\n%vertical-rhythm {\n margin-bottom: $spacing-unit / 2;\n}\n\n\n\n/**\n * `main` element\n */\nmain {\n display: block; /* Default value of `display` of `main` element is 'inline' in IE 11. */\n}\n\n\n\n/**\n * Images\n */\nimg {\n max-width: 100%;\n vertical-align: middle;\n}\n\n\n\n/**\n * Figures\n */\nfigure > img {\n display: block;\n}\n\nfigcaption {\n font-size: $small-font-size;\n}\n\n\n\n/**\n * Lists\n */\nul, ol {\n margin-left: $spacing-unit;\n}\n\nli {\n > ul,\n > ol {\n margin-bottom: 0;\n }\n}\n\n\n\n/**\n * Headings\n */\nh1, h2, h3, h4, h5, h6 {\n font-weight: $base-font-weight;\n}\n\n\n\n/**\n * Links\n */\na {\n color: $brand-color;\n text-decoration: none;\n\n &:visited {\n color: darken($brand-color, 15%);\n }\n\n &:hover {\n color: $text-color;\n text-decoration: underline;\n }\n\n .social-media-list &:hover {\n text-decoration: none;\n\n .username {\n text-decoration: underline;\n }\n }\n}\n\n\n/**\n * Blockquotes\n */\nblockquote {\n color: $grey-color;\n border-left: 4px solid $grey-color-light;\n padding-left: $spacing-unit / 2;\n @include relative-font-size(1.125);\n letter-spacing: -1px;\n font-style: italic;\n\n > :last-child {\n margin-bottom: 0;\n }\n}\n\n\n\n/**\n * Code formatting\n */\npre,\ncode {\n @include relative-font-size(0.9375);\n border: 1px solid $grey-color-light;\n border-radius: 3px;\n background-color: #eef;\n}\n\ncode {\n padding: 1px 5px;\n}\n\npre {\n padding: 8px 12px;\n overflow-x: auto;\n\n > code {\n border: 0;\n padding-right: 0;\n padding-left: 0;\n }\n}\n\n\n\n/**\n * Wrapper\n */\n.wrapper {\n max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit} * 2));\n max-width: calc(#{$content-width} - (#{$spacing-unit} * 2));\n margin-right: auto;\n margin-left: auto;\n padding-right: $spacing-unit;\n padding-left: $spacing-unit;\n @extend %clearfix;\n\n @include media-query($on-laptop) {\n max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit}));\n max-width: calc(#{$content-width} - (#{$spacing-unit}));\n padding-right: $spacing-unit / 2;\n padding-left: $spacing-unit / 2;\n }\n}\n\n\n\n/**\n * Clearfix\n */\n%clearfix:after {\n content: \"\";\n display: table;\n clear: both;\n}\n\n\n\n/**\n * Icons\n */\n\n.svg-icon {\n width: 16px;\n height: 16px;\n display: inline-block;\n fill: #{$grey-color};\n padding-right: 5px;\n vertical-align: text-top;\n}\n\n.social-media-list {\n li + li {\n padding-top: 5px;\n }\n}\n\n\n\n/**\n * Tables\n */\ntable {\n margin-bottom: $spacing-unit;\n width: 100%;\n text-align: $table-text-align;\n color: lighten($text-color, 18%);\n border-collapse: collapse;\n border: 1px solid $grey-color-light;\n tr {\n &:nth-child(even) {\n background-color: lighten($grey-color-light, 6%);\n }\n }\n th, td {\n padding: ($spacing-unit / 3) ($spacing-unit / 2);\n }\n th {\n background-color: lighten($grey-color-light, 3%);\n border: 1px solid darken($grey-color-light, 4%);\n border-bottom-color: darken($grey-color-light, 12%);\n }\n td {\n border: 1px solid $grey-color-light;\n }\n}\n","@charset \"utf-8\";\n\n// Define defaults for each variable.\n\n$base-font-family: -apple-system, BlinkMacSystemFont, \"Segoe UI\", Roboto, Helvetica, Arial, sans-serif, \"Apple Color Emoji\", \"Segoe UI Emoji\", \"Segoe UI Symbol\" !default;\n$base-font-size: 16px !default;\n$base-font-weight: 400 !default;\n$small-font-size: $base-font-size * 0.875 !default;\n$base-line-height: 1.5 !default;\n\n$spacing-unit: 30px !default;\n\n$text-color: #111 !default;\n$background-color: #fdfdfd !default;\n$brand-color: #2a7ae2 !default;\n\n$grey-color: #828282 !default;\n$grey-color-light: lighten($grey-color, 40%) !default;\n$grey-color-dark: darken($grey-color, 25%) !default;\n\n$table-text-align: left !default;\n\n// Width of the content area\n$content-width: 800px !default;\n\n$on-palm: 600px !default;\n$on-laptop: 800px !default;\n\n// Use media queries like this:\n// @include media-query($on-palm) {\n// .wrapper {\n// padding-right: $spacing-unit / 2;\n// padding-left: $spacing-unit / 2;\n// }\n// }\n@mixin media-query($device) {\n @media screen and (max-width: $device) {\n @content;\n }\n}\n\n@mixin relative-font-size($ratio) {\n font-size: $base-font-size * $ratio;\n}\n\n// Import partials.\n@import\n \"minima/base\",\n \"minima/layout\",\n \"minima/syntax-highlighting\"\n;\n","/**\n * Site header\n */\n.site-header {\n border-top: 5px solid $grey-color-dark;\n border-bottom: 1px solid $grey-color-light;\n min-height: $spacing-unit * 1.865;\n\n // Positioning context for the mobile navigation icon\n position: relative;\n}\n\n.site-title {\n @include relative-font-size(1.625);\n font-weight: 300;\n line-height: $base-line-height * $base-font-size * 2.25;\n letter-spacing: -1px;\n margin-bottom: 0;\n float: left;\n\n &,\n &:visited {\n color: $grey-color-dark;\n }\n}\n\n.site-nav {\n float: right;\n line-height: $base-line-height * $base-font-size * 2.25;\n\n .nav-trigger {\n display: none;\n }\n\n .menu-icon {\n display: none;\n }\n\n .page-link {\n color: $text-color;\n line-height: $base-line-height;\n\n // Gaps between nav items, but not on the last one\n &:not(:last-child) {\n margin-right: 20px;\n }\n }\n\n @include media-query($on-palm) {\n position: absolute;\n top: 9px;\n right: $spacing-unit / 2;\n background-color: $background-color;\n border: 1px solid $grey-color-light;\n border-radius: 5px;\n text-align: right;\n\n label[for=\"nav-trigger\"] {\n display: block;\n float: right;\n width: 36px;\n height: 36px;\n z-index: 2;\n cursor: pointer;\n }\n\n .menu-icon {\n display: block;\n float: right;\n width: 36px;\n height: 26px;\n line-height: 0;\n padding-top: 10px;\n text-align: center;\n\n > svg {\n fill: $grey-color-dark;\n }\n }\n\n input ~ .trigger {\n clear: both;\n display: none;\n }\n\n input:checked ~ .trigger {\n display: block;\n padding-bottom: 5px;\n }\n\n .page-link {\n display: block;\n padding: 5px 10px;\n\n &:not(:last-child) {\n margin-right: 0;\n }\n margin-left: 20px;\n }\n }\n}\n\n\n\n/**\n * Site footer\n */\n.site-footer {\n border-top: 1px solid $grey-color-light;\n padding: $spacing-unit 0;\n}\n\n.footer-heading {\n @include relative-font-size(1.125);\n margin-bottom: $spacing-unit / 2;\n}\n\n.contact-list,\n.social-media-list {\n list-style: none;\n margin-left: 0;\n}\n\n.footer-col-wrapper {\n @include relative-font-size(0.9375);\n color: $grey-color;\n margin-left: -$spacing-unit / 2;\n @extend %clearfix;\n}\n\n.footer-col {\n float: left;\n margin-bottom: $spacing-unit / 2;\n padding-left: $spacing-unit / 2;\n}\n\n.footer-col-1 {\n width: -webkit-calc(35% - (#{$spacing-unit} / 2));\n width: calc(35% - (#{$spacing-unit} / 2));\n}\n\n.footer-col-2 {\n width: -webkit-calc(20% - (#{$spacing-unit} / 2));\n width: calc(20% - (#{$spacing-unit} / 2));\n}\n\n.footer-col-3 {\n width: -webkit-calc(45% - (#{$spacing-unit} / 2));\n width: calc(45% - (#{$spacing-unit} / 2));\n}\n\n@include media-query($on-laptop) {\n .footer-col-1,\n .footer-col-2 {\n width: -webkit-calc(50% - (#{$spacing-unit} / 2));\n width: calc(50% - (#{$spacing-unit} / 2));\n }\n\n .footer-col-3 {\n width: -webkit-calc(100% - (#{$spacing-unit} / 2));\n width: calc(100% - (#{$spacing-unit} / 2));\n }\n}\n\n@include media-query($on-palm) {\n .footer-col {\n float: none;\n width: -webkit-calc(100% - (#{$spacing-unit} / 2));\n width: calc(100% - (#{$spacing-unit} / 2));\n }\n}\n\n\n\n/**\n * Page content\n */\n.page-content {\n padding: $spacing-unit 0;\n flex: 1;\n}\n\n.page-heading {\n @include relative-font-size(2);\n}\n\n.post-list-heading {\n @include relative-font-size(1.75);\n}\n\n.post-list {\n margin-left: 0;\n list-style: none;\n\n > li {\n margin-bottom: $spacing-unit;\n }\n}\n\n.post-meta {\n font-size: $small-font-size;\n color: $grey-color;\n}\n\n.post-link {\n display: block;\n @include relative-font-size(1.5);\n}\n\n\n\n/**\n * Posts\n */\n.post-header {\n margin-bottom: $spacing-unit;\n}\n\n.post-title {\n @include relative-font-size(2.625);\n letter-spacing: -1px;\n line-height: 1;\n\n @include media-query($on-laptop) {\n @include relative-font-size(2.25);\n }\n}\n\n.post-content {\n margin-bottom: $spacing-unit;\n\n h2 {\n @include relative-font-size(2);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.75);\n }\n }\n\n h3 {\n @include relative-font-size(1.625);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.375);\n }\n }\n\n h4 {\n @include relative-font-size(1.25);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.125);\n }\n }\n}\n","/**\n * Syntax highlighting styles\n */\n.highlight {\n background: #fff;\n @extend %vertical-rhythm;\n\n .highlighter-rouge & {\n background: #eef;\n }\n\n .c { color: #998; font-style: italic } // Comment\n .err { color: #a61717; background-color: #e3d2d2 } // Error\n .k { font-weight: bold } // Keyword\n .o { font-weight: bold } // Operator\n .cm { color: #998; font-style: italic } // Comment.Multiline\n .cp { color: #999; font-weight: bold } // Comment.Preproc\n .c1 { color: #998; font-style: italic } // Comment.Single\n .cs { color: #999; font-weight: bold; font-style: italic } // Comment.Special\n .gd { color: #000; background-color: #fdd } // Generic.Deleted\n .gd .x { color: #000; background-color: #faa } // Generic.Deleted.Specific\n .ge { font-style: italic } // Generic.Emph\n .gr { color: #a00 } // Generic.Error\n .gh { color: #999 } // Generic.Heading\n .gi { color: #000; background-color: #dfd } // Generic.Inserted\n .gi .x { color: #000; background-color: #afa } // Generic.Inserted.Specific\n .go { color: #888 } // Generic.Output\n .gp { color: #555 } // Generic.Prompt\n .gs { font-weight: bold } // Generic.Strong\n .gu { color: #aaa } // Generic.Subheading\n .gt { color: #a00 } // Generic.Traceback\n .kc { font-weight: bold } // Keyword.Constant\n .kd { font-weight: bold } // Keyword.Declaration\n .kp { font-weight: bold } // Keyword.Pseudo\n .kr { font-weight: bold } // Keyword.Reserved\n .kt { color: #458; font-weight: bold } // Keyword.Type\n .m { color: #099 } // Literal.Number\n .s { color: #d14 } // Literal.String\n .na { color: #008080 } // Name.Attribute\n .nb { color: #0086B3 } // Name.Builtin\n .nc { color: #458; font-weight: bold } // Name.Class\n .no { color: #008080 } // Name.Constant\n .ni { color: #800080 } // Name.Entity\n .ne { color: #900; font-weight: bold } // Name.Exception\n .nf { color: #900; font-weight: bold } // Name.Function\n .nn { color: #555 } // Name.Namespace\n .nt { color: #000080 } // Name.Tag\n .nv { color: #008080 } // Name.Variable\n .ow { font-weight: bold } // Operator.Word\n .w { color: #bbb } // Text.Whitespace\n .mf { color: #099 } // Literal.Number.Float\n .mh { color: #099 } // Literal.Number.Hex\n .mi { color: #099 } // Literal.Number.Integer\n .mo { color: #099 } // Literal.Number.Oct\n .sb { color: #d14 } // Literal.String.Backtick\n .sc { color: #d14 } // Literal.String.Char\n .sd { color: #d14 } // Literal.String.Doc\n .s2 { color: #d14 } // Literal.String.Double\n .se { color: #d14 } // Literal.String.Escape\n .sh { color: #d14 } // Literal.String.Heredoc\n .si { color: #d14 } // Literal.String.Interpol\n .sx { color: #d14 } // Literal.String.Other\n .sr { color: #009926 } // Literal.String.Regex\n .s1 { color: #d14 } // Literal.String.Single\n .ss { color: #990073 } // Literal.String.Symbol\n .bp { color: #999 } // Name.Builtin.Pseudo\n .vc { color: #008080 } // Name.Variable.Class\n .vg { color: #008080 } // Name.Variable.Global\n .vi { color: #008080 } // Name.Variable.Instance\n .il { color: #099 } // Literal.Number.Integer.Long\n}\n"],"file":"main.css"} \ No newline at end of file +{"version":3,"sourceRoot":"","sources":["../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_base.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_layout.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EACA;EACA;;;AAKF;AAAA;AAAA;AAGA;AAAA;AAAA;AAAA;EAIE;;;AAKF;AAAA;AAAA;AAGA;EACE;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;;;AAGF;EACE,WChEiB;;;ADqEnB;AAAA;AAAA;AAGA;EACE,aCtEiB;;;AD0EjB;AAAA;EAEE;;;AAMJ;AAAA;AAAA;AAGA;EACE,aC1FiB;;;AD+FnB;AAAA;AAAA;AAGA;EACE,OC3FiB;ED4FjB;;AAEA;EACE;;AAGF;EACE,OCrGe;EDsGf;;AAGF;EACE;;AAEA;EACE;;;AAMN;AAAA;AAAA;AAGA;EACE,OCnHiB;EDoHjB;EACA;EC3FA;ED6FA;EACA;;AAEA;EACE;;;AAMJ;AAAA;AAAA;AAGA;AAAA;EC1GE;ED6GA;EACA;EACA;;;AAGF;EACE;;;AAGF;EACE;EACA;;AAEA;EACE;EACA;EACA;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;EACA;EACA,eC3KiB;ED4KjB,cC5KiB;;AA0BjB;ED4IF;IAUI;IACA;IACA;IACA;;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;;;AAKF;AAAA;AAAA;AAIA;EACI;EACA;EACA;EACA;EACA;EACA;;;AAIF;EACE;;;AAMJ;AAAA;AAAA;AAGA;EACE,eC7NiB;ED8NjB;EACA,YCrNiB;EDsNjB;EACA;EACA;;AAEE;EACE;;AAGJ;EACE;;AAEF;EACE;EACA;EACA;;AAEF;EACE;;;AE3PJ;AAAA;AAAA;AAGA;EACE;EACA;EACA;EAGA;;;AAGF;ED8BE;EC5BA;EACA;EACA;EACA;EACA;;AAEA;EAEE,ODJe;;;ACQnB;EACE;EACA;;AAEA;EACI;;AAGJ;EACE;;AAGF;EACE,OD3Be;EC4Bf,aDhCe;;ACmCf;EACE;;ADRJ;ECVF;IAuBI;IACA;IACA;IACA,kBDvCe;ICwCf;IACA;IACA;;EAEA;IACE;IACA;IACA;IACA;IACA;IACA;;EAGF;IACE;IACA;IACA;IACA;IACA;IACA;IACA;;EAEA;IACE,MD1DW;;EC8Df;IACE;IACA;;EAGF;IACE;IACA;;EAGF;IACE;IACA;IAKA;;EAHA;IACE;;;;AASR;AAAA;AAAA;AAGA;EACE;EACA;;;AAGF;EDtEE;ECwEA;;;AAGF;AAAA;EAEE;EACA;;;AAGF;EDjFE;ECmFA,OD7GiB;EC8GjB;;;AAIF;EACE;EACA;EACA;;;AAGF;EACE;EACA;;;AAGF;EACE;EACA;;;AAGF;EACE;EACA;;;ADhHA;ECoHA;AAAA;IAEE;IACA;;EAGF;IACE;IACA;;;AD5HF;ECiIA;IACE;IACA;IACA;;;AAMJ;AAAA;AAAA;AAGA;EACE;EACA;;;AAGF;ED5IE;;;ACgJF;EDhJE;;;ACoJF;EACE;EACA;;AAEA;EACE,eDzLe;;;AC6LnB;EACE,WDjMiB;ECkMjB,ODzLiB;;;AC4LnB;EACE;EDnKA;;;ACyKF;AAAA;AAAA;AAGA;EACE,eD7MiB;;;ACgNnB;EDhLE;ECkLA;EACA;;ADzLA;ECsLF;IDhLE;;;;AC0LF;EACE,eD3NiB;;AC6NjB;ED7LA;;AANA;ECmMA;ID7LA;;;ACqMA;EDrMA;;AANA;EC2MA;IDrMA;;;AC6MA;ED7MA;;AANA;ECmNA;ID7MA;;;;AE1CF;AAAA;AAAA;AAGA;EACE;;AAGA;EACE;;AAGF;EAAS;EAAa;;AACtB;EAAS;EAAgB;;AACzB;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;EAAa;EAAmB;;AACzC;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;EAAa;;AACtB;EAAS;EAAa;;AACtB;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS;;AACT;EAAS","sourcesContent":["/**\n * Reset some basic elements\n */\nbody, h1, h2, h3, h4, h5, h6,\np, blockquote, pre, hr,\ndl, dd, ol, ul, figure {\n margin: 0;\n padding: 0;\n}\n\n\n\n/**\n * Basic styling\n */\nbody {\n font: $base-font-weight #{$base-font-size}/#{$base-line-height} $base-font-family;\n color: $text-color;\n background-color: $background-color;\n -webkit-text-size-adjust: 100%;\n -webkit-font-feature-settings: \"kern\" 1;\n -moz-font-feature-settings: \"kern\" 1;\n -o-font-feature-settings: \"kern\" 1;\n font-feature-settings: \"kern\" 1;\n font-kerning: normal;\n display: flex;\n min-height: 100vh;\n flex-direction: column;\n}\n\n\n\n/**\n * Set `margin-bottom` to maintain vertical rhythm\n */\nh1, h2, h3, h4, h5, h6,\np, blockquote, pre,\nul, ol, dl, figure,\n%vertical-rhythm {\n margin-bottom: $spacing-unit / 2;\n}\n\n\n\n/**\n * `main` element\n */\nmain {\n display: block; /* Default value of `display` of `main` element is 'inline' in IE 11. */\n}\n\n\n\n/**\n * Images\n */\nimg {\n max-width: 100%;\n vertical-align: middle;\n}\n\n\n\n/**\n * Figures\n */\nfigure > img {\n display: block;\n}\n\nfigcaption {\n font-size: $small-font-size;\n}\n\n\n\n/**\n * Lists\n */\nul, ol {\n margin-left: $spacing-unit;\n}\n\nli {\n > ul,\n > ol {\n margin-bottom: 0;\n }\n}\n\n\n\n/**\n * Headings\n */\nh1, h2, h3, h4, h5, h6 {\n font-weight: $base-font-weight;\n}\n\n\n\n/**\n * Links\n */\na {\n color: $brand-color;\n text-decoration: none;\n\n &:visited {\n color: darken($brand-color, 15%);\n }\n\n &:hover {\n color: $text-color;\n text-decoration: underline;\n }\n\n .social-media-list &:hover {\n text-decoration: none;\n\n .username {\n text-decoration: underline;\n }\n }\n}\n\n\n/**\n * Blockquotes\n */\nblockquote {\n color: $grey-color;\n border-left: 4px solid $grey-color-light;\n padding-left: $spacing-unit / 2;\n @include relative-font-size(1.125);\n letter-spacing: -1px;\n font-style: italic;\n\n > :last-child {\n margin-bottom: 0;\n }\n}\n\n\n\n/**\n * Code formatting\n */\npre,\ncode {\n @include relative-font-size(0.9375);\n border: 1px solid $grey-color-light;\n border-radius: 3px;\n background-color: #eef;\n}\n\ncode {\n padding: 1px 5px;\n}\n\npre {\n padding: 8px 12px;\n overflow-x: auto;\n\n > code {\n border: 0;\n padding-right: 0;\n padding-left: 0;\n }\n}\n\n\n\n/**\n * Wrapper\n */\n.wrapper {\n max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit} * 2));\n max-width: calc(#{$content-width} - (#{$spacing-unit} * 2));\n margin-right: auto;\n margin-left: auto;\n padding-right: $spacing-unit;\n padding-left: $spacing-unit;\n @extend %clearfix;\n\n @include media-query($on-laptop) {\n max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit}));\n max-width: calc(#{$content-width} - (#{$spacing-unit}));\n padding-right: $spacing-unit / 2;\n padding-left: $spacing-unit / 2;\n }\n}\n\n\n\n/**\n * Clearfix\n */\n%clearfix:after {\n content: \"\";\n display: table;\n clear: both;\n}\n\n\n\n/**\n * Icons\n */\n\n.svg-icon {\n width: 16px;\n height: 16px;\n display: inline-block;\n fill: #{$grey-color};\n padding-right: 5px;\n vertical-align: text-top;\n}\n\n.social-media-list {\n li + li {\n padding-top: 5px;\n }\n}\n\n\n\n/**\n * Tables\n */\ntable {\n margin-bottom: $spacing-unit;\n width: 100%;\n text-align: $table-text-align;\n color: lighten($text-color, 18%);\n border-collapse: collapse;\n border: 1px solid $grey-color-light;\n tr {\n &:nth-child(even) {\n background-color: lighten($grey-color-light, 6%);\n }\n }\n th, td {\n padding: ($spacing-unit / 3) ($spacing-unit / 2);\n }\n th {\n background-color: lighten($grey-color-light, 3%);\n border: 1px solid darken($grey-color-light, 4%);\n border-bottom-color: darken($grey-color-light, 12%);\n }\n td {\n border: 1px solid $grey-color-light;\n }\n}\n","@charset \"utf-8\";\n\n// Define defaults for each variable.\n\n$base-font-family: -apple-system, BlinkMacSystemFont, \"Segoe UI\", Roboto, Helvetica, Arial, sans-serif, \"Apple Color Emoji\", \"Segoe UI Emoji\", \"Segoe UI Symbol\" !default;\n$base-font-size: 16px !default;\n$base-font-weight: 400 !default;\n$small-font-size: $base-font-size * 0.875 !default;\n$base-line-height: 1.5 !default;\n\n$spacing-unit: 30px !default;\n\n$text-color: #111 !default;\n$background-color: #fdfdfd !default;\n$brand-color: #2a7ae2 !default;\n\n$grey-color: #828282 !default;\n$grey-color-light: lighten($grey-color, 40%) !default;\n$grey-color-dark: darken($grey-color, 25%) !default;\n\n$table-text-align: left !default;\n\n// Width of the content area\n$content-width: 800px !default;\n\n$on-palm: 600px !default;\n$on-laptop: 800px !default;\n\n// Use media queries like this:\n// @include media-query($on-palm) {\n// .wrapper {\n// padding-right: $spacing-unit / 2;\n// padding-left: $spacing-unit / 2;\n// }\n// }\n@mixin media-query($device) {\n @media screen and (max-width: $device) {\n @content;\n }\n}\n\n@mixin relative-font-size($ratio) {\n font-size: $base-font-size * $ratio;\n}\n\n// Import partials.\n@import\n \"minima/base\",\n \"minima/layout\",\n \"minima/syntax-highlighting\"\n;\n","/**\n * Site header\n */\n.site-header {\n border-top: 5px solid $grey-color-dark;\n border-bottom: 1px solid $grey-color-light;\n min-height: $spacing-unit * 1.865;\n\n // Positioning context for the mobile navigation icon\n position: relative;\n}\n\n.site-title {\n @include relative-font-size(1.625);\n font-weight: 300;\n line-height: $base-line-height * $base-font-size * 2.25;\n letter-spacing: -1px;\n margin-bottom: 0;\n float: left;\n\n &,\n &:visited {\n color: $grey-color-dark;\n }\n}\n\n.site-nav {\n float: right;\n line-height: $base-line-height * $base-font-size * 2.25;\n\n .nav-trigger {\n display: none;\n }\n\n .menu-icon {\n display: none;\n }\n\n .page-link {\n color: $text-color;\n line-height: $base-line-height;\n\n // Gaps between nav items, but not on the last one\n &:not(:last-child) {\n margin-right: 20px;\n }\n }\n\n @include media-query($on-palm) {\n position: absolute;\n top: 9px;\n right: $spacing-unit / 2;\n background-color: $background-color;\n border: 1px solid $grey-color-light;\n border-radius: 5px;\n text-align: right;\n\n label[for=\"nav-trigger\"] {\n display: block;\n float: right;\n width: 36px;\n height: 36px;\n z-index: 2;\n cursor: pointer;\n }\n\n .menu-icon {\n display: block;\n float: right;\n width: 36px;\n height: 26px;\n line-height: 0;\n padding-top: 10px;\n text-align: center;\n\n > svg {\n fill: $grey-color-dark;\n }\n }\n\n input ~ .trigger {\n clear: both;\n display: none;\n }\n\n input:checked ~ .trigger {\n display: block;\n padding-bottom: 5px;\n }\n\n .page-link {\n display: block;\n padding: 5px 10px;\n\n &:not(:last-child) {\n margin-right: 0;\n }\n margin-left: 20px;\n }\n }\n}\n\n\n\n/**\n * Site footer\n */\n.site-footer {\n border-top: 1px solid $grey-color-light;\n padding: $spacing-unit 0;\n}\n\n.footer-heading {\n @include relative-font-size(1.125);\n margin-bottom: $spacing-unit / 2;\n}\n\n.contact-list,\n.social-media-list {\n list-style: none;\n margin-left: 0;\n}\n\n.footer-col-wrapper {\n @include relative-font-size(0.9375);\n color: $grey-color;\n margin-left: -$spacing-unit / 2;\n @extend %clearfix;\n}\n\n.footer-col {\n float: left;\n margin-bottom: $spacing-unit / 2;\n padding-left: $spacing-unit / 2;\n}\n\n.footer-col-1 {\n width: -webkit-calc(35% - (#{$spacing-unit} / 2));\n width: calc(35% - (#{$spacing-unit} / 2));\n}\n\n.footer-col-2 {\n width: -webkit-calc(20% - (#{$spacing-unit} / 2));\n width: calc(20% - (#{$spacing-unit} / 2));\n}\n\n.footer-col-3 {\n width: -webkit-calc(45% - (#{$spacing-unit} / 2));\n width: calc(45% - (#{$spacing-unit} / 2));\n}\n\n@include media-query($on-laptop) {\n .footer-col-1,\n .footer-col-2 {\n width: -webkit-calc(50% - (#{$spacing-unit} / 2));\n width: calc(50% - (#{$spacing-unit} / 2));\n }\n\n .footer-col-3 {\n width: -webkit-calc(100% - (#{$spacing-unit} / 2));\n width: calc(100% - (#{$spacing-unit} / 2));\n }\n}\n\n@include media-query($on-palm) {\n .footer-col {\n float: none;\n width: -webkit-calc(100% - (#{$spacing-unit} / 2));\n width: calc(100% - (#{$spacing-unit} / 2));\n }\n}\n\n\n\n/**\n * Page content\n */\n.page-content {\n padding: $spacing-unit 0;\n flex: 1;\n}\n\n.page-heading {\n @include relative-font-size(2);\n}\n\n.post-list-heading {\n @include relative-font-size(1.75);\n}\n\n.post-list {\n margin-left: 0;\n list-style: none;\n\n > li {\n margin-bottom: $spacing-unit;\n }\n}\n\n.post-meta {\n font-size: $small-font-size;\n color: $grey-color;\n}\n\n.post-link {\n display: block;\n @include relative-font-size(1.5);\n}\n\n\n\n/**\n * Posts\n */\n.post-header {\n margin-bottom: $spacing-unit;\n}\n\n.post-title {\n @include relative-font-size(2.625);\n letter-spacing: -1px;\n line-height: 1;\n\n @include media-query($on-laptop) {\n @include relative-font-size(2.25);\n }\n}\n\n.post-content {\n margin-bottom: $spacing-unit;\n\n h2 {\n @include relative-font-size(2);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.75);\n }\n }\n\n h3 {\n @include relative-font-size(1.625);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.375);\n }\n }\n\n h4 {\n @include relative-font-size(1.25);\n\n @include media-query($on-laptop) {\n @include relative-font-size(1.125);\n }\n }\n}\n","/**\n * Syntax highlighting styles\n */\n.highlight {\n background: #fff;\n @extend %vertical-rhythm;\n\n .highlighter-rouge & {\n background: #eef;\n }\n\n .c { color: #998; font-style: italic } // Comment\n .err { color: #a61717; background-color: #e3d2d2 } // Error\n .k { font-weight: bold } // Keyword\n .o { font-weight: bold } // Operator\n .cm { color: #998; font-style: italic } // Comment.Multiline\n .cp { color: #999; font-weight: bold } // Comment.Preproc\n .c1 { color: #998; font-style: italic } // Comment.Single\n .cs { color: #999; font-weight: bold; font-style: italic } // Comment.Special\n .gd { color: #000; background-color: #fdd } // Generic.Deleted\n .gd .x { color: #000; background-color: #faa } // Generic.Deleted.Specific\n .ge { font-style: italic } // Generic.Emph\n .gr { color: #a00 } // Generic.Error\n .gh { color: #999 } // Generic.Heading\n .gi { color: #000; background-color: #dfd } // Generic.Inserted\n .gi .x { color: #000; background-color: #afa } // Generic.Inserted.Specific\n .go { color: #888 } // Generic.Output\n .gp { color: #555 } // Generic.Prompt\n .gs { font-weight: bold } // Generic.Strong\n .gu { color: #aaa } // Generic.Subheading\n .gt { color: #a00 } // Generic.Traceback\n .kc { font-weight: bold } // Keyword.Constant\n .kd { font-weight: bold } // Keyword.Declaration\n .kp { font-weight: bold } // Keyword.Pseudo\n .kr { font-weight: bold } // Keyword.Reserved\n .kt { color: #458; font-weight: bold } // Keyword.Type\n .m { color: #099 } // Literal.Number\n .s { color: #d14 } // Literal.String\n .na { color: #008080 } // Name.Attribute\n .nb { color: #0086B3 } // Name.Builtin\n .nc { color: #458; font-weight: bold } // Name.Class\n .no { color: #008080 } // Name.Constant\n .ni { color: #800080 } // Name.Entity\n .ne { color: #900; font-weight: bold } // Name.Exception\n .nf { color: #900; font-weight: bold } // Name.Function\n .nn { color: #555 } // Name.Namespace\n .nt { color: #000080 } // Name.Tag\n .nv { color: #008080 } // Name.Variable\n .ow { font-weight: bold } // Operator.Word\n .w { color: #bbb } // Text.Whitespace\n .mf { color: #099 } // Literal.Number.Float\n .mh { color: #099 } // Literal.Number.Hex\n .mi { color: #099 } // Literal.Number.Integer\n .mo { color: #099 } // Literal.Number.Oct\n .sb { color: #d14 } // Literal.String.Backtick\n .sc { color: #d14 } // Literal.String.Char\n .sd { color: #d14 } // Literal.String.Doc\n .s2 { color: #d14 } // Literal.String.Double\n .se { color: #d14 } // Literal.String.Escape\n .sh { color: #d14 } // Literal.String.Heredoc\n .si { color: #d14 } // Literal.String.Interpol\n .sx { color: #d14 } // Literal.String.Other\n .sr { color: #009926 } // Literal.String.Regex\n .s1 { color: #d14 } // Literal.String.Single\n .ss { color: #990073 } // Literal.String.Symbol\n .bp { color: #999 } // Name.Builtin.Pseudo\n .vc { color: #008080 } // Name.Variable.Class\n .vg { color: #008080 } // Name.Variable.Global\n .vi { color: #008080 } // Name.Variable.Instance\n .il { color: #099 } // Literal.Number.Integer.Long\n}\n"],"file":"main.css"} \ No newline at end of file diff --git a/feed.xml b/feed.xml index 1e79489..c542e2a 100644 --- a/feed.xml +++ b/feed.xml @@ -1,4 +1,308 @@ -Jekyll2024-08-29T16:32:33+00:00https://datafusion.apache.org/blog/feed.xmlApache DataFusion Project News &amp; BlogApache DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.Apache DataFusion Comet 0.2.0 Release2024-08-28T00:00:00+00:002024-08-28T00:00:00+00:00https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0Jekyll2024-10-01T19:55:17+00:00https://datafusion.apache.org/blog/feed.xmlApache DataFusion Project News &amp; BlogApache DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.Using StringView / German Style Strings to Make Queries Faster: Part 1- Reading Parquet2024-09-13T00:00:00+00:002024-09-13T00:00:00+00:00https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1 + +

Editor’s Note: This is the first of a two part blog series that was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao’s summer intern project

+ +

This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating it into Apache DataFusion, significantly accelerating string-intensive queries in the ClickBench benchmark by 20%- 200% (Figure 11).

+ +

Getting significant end-to-end performance improvements was non-trivial. Implementing StringView itself was only a fraction of the effort required. Among other things, we had to optimize UTF-8 validation, implement unintuitive compiler optimizations, tune block sizes, and time GC to realize the FDAP ecosystem’s benefit. With other members of the open source community, we were able to overcome performance bottlenecks that could have killed the project. We would like to contribute by explaining the challenges and solutions in more detail so that more of the community can learn from our experience.

+ +

StringView is based on a simple idea: avoid some string copies and accelerate comparisons with inlined prefixes. Like most great ideas, it is “obvious” only after someone describes it clearly. Although simple, straightforward implementation actually slows down performance for almost every query. We must, therefore, apply astute observations and diligent engineering to realize the actual benefits from StringView.

+ +

Although this journey was successful, not all research ideas are as lucky. To accelerate the adoption of research into industry, it is valuable to integrate research prototypes with practical systems. Understanding the nuances of real-world systems makes it more likely that research designs2 will lead to practical system improvements.

+ +

StringView support was released as part of arrow-rs v52.2.0 and DataFusion v41.0.0. You can try it by setting the schema_force_view_types DataFusion configuration option, and we are hard at work with the community to make it the default. We invite everyone to try it out, take advantage of the effort invested so far, and contribute to making it better.

+ +

End to end performance improvements for ClickBench queries

+ +

Figure 1: StringView improves string-intensive ClickBench query performance by 20% - 200%

+ +

What is StringView?

+ +

Diagram of using StringArray and StringViewArray to represent the same string content

+ +

Figure 2: Use StringArray and StringViewArray to represent the same string content.

+ +

The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) +has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) +and was introduced to Arrow as a new StringViewArray3 type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. +StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

+ +

A StringViewArray consists of three components:

+ +
    +
  1. The view array
  2. +
  3. The buffers
  4. +
  5. The buffer pointers (IDs) that map buffer offsets to their physical locations
  6. +
+ +

Each view is 16 bytes long, and its contents differ based on the string’s length:

+ +
    +
  • string length < 12 bytes: the first four bytes store the string length, and the remaining 12 bytes store the inlined string.
  • +
  • string length > 12 bytes: the string is stored in a separate buffer. The length is again stored in the first 4 bytes, followed by the buffer id (4 bytes), the buffer offset (4 bytes), and the prefix (first 4 bytes) of the string.
  • +
+ +

Figure 2 shows an example of the same logical content (left) using StringArray (middle) and StringViewArray (right):

+ +
    +
  • The first string – "Apache DataFusion" – is 17 bytes long, and both StringArray and StringViewArray store the string’s bytes at the beginning of the buffer. The StringViewArray also inlines the first 4 bytes – "Apac" – in the view.
  • +
  • The second string, "InfluxDB" is only 8 bytes long, so StringViewArray completely inlines the string content in the view struct while StringArray stores the string in the buffer as well.
  • +
  • The third string "Arrow Rust Impl" is 15 bytes long and cannot be fully inlined. StringViewArray stores this in the same form as the first string.
  • +
  • The last string "Apache DataFusion" has the same content as the first string. It’s possible to use StringViewArray to avoid this duplication and reuse the bytes by pointing the view to the previous location.
  • +
+ +

StringViewArray provides three opportunities for outperforming StringArray:

+ +
    +
  1. Less copying via the offset + buffer format
  2. +
  3. Faster comparisons using the inlined string prefix
  4. +
  5. Reusing repeated string values with the flexible view layout
  6. +
+ +

The rest of this blog post discusses how to apply these opportunities in real query scenarios to improve performance, what challenges we encountered along the way, and how we solved them.

+ +

Faster Parquet Loading

+ +

Apache Parquet is the de facto format for storing large-scale analytical data commonly stored LakeHouse-style, such as Apache Iceberg and Delta Lake. Efficiently loading data from Parquet is thus critical to query performance in many important real-world workloads.

+ +

Parquet encodes strings (i.e., byte array) in a slightly different format than required for the original Arrow StringArray. The string length is encoded inline with the actual string data (as shown in Figure 4 left). As mentioned previously, StringArray requires the data buffer to be continuous and compact—the strings have to follow one after another. This requirement means that reading Parquet string data into an Arrow StringArray requires copying and consolidating the string bytes to a new buffer and tracking offsets in a separate array. Copying these strings is often wasteful. Typical queries filter out most data immediately after loading, so most of the copied data is quickly discarded.

+ +

On the other hand, reading Parquet data as a StringViewArray can re-use the same data buffer as storing the Parquet pages because StringViewArray does not require strings to be contiguous. For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page. The string "Arrow Rust Impl" is represented by a view with offset 37 and length 15 into that buffer.

+ +

Diagram showing how StringViewArray can avoid copying by reusing decoded Parquet pages.

+ +

Figure 4: StringViewArray avoids copying by reusing decoded Parquet pages.

+ +

Mini benchmark

+ +

Reusing Parquet buffers is great in theory, but how much does saving a copy actually matter? We can run the following benchmark in arrow-rs to find out:

+ +

Our benchmarking machine shows that loading BinaryViewArray is almost 2x faster than loading BinaryArray (see next section about why this isn’t String ViewArray).

+ +

You can read more on this arrow-rs issue: https://github.com/apache/arrow-rs/issues/5904

+ +

From Binary to Strings

+ +

You may wonder why we reported performance for BinaryViewArray when this post is about StringViewArray. Surprisingly, initially, our implementation to read StringViewArray from Parquet was much slower than StringArray. Why? TLDR: Although reading StringViewArray copied less data, the initial implementation also spent much more time validating UTF-8 (as shown in Figure 5).

+ +

Strings are stored as byte sequences. When reading data from (potentially untrusted) Parquet files, a Parquet decoder must ensure those byte sequences are valid UTF-8 strings, and most programming languages, including Rust, include highly optimized routines for doing so.

+ +

Figure showing time to load strings from Parquet and the effect of optimized UTF-8 validation.

+ +

Figure 5: Time to load strings from Parquet. The UTF-8 validation advantage initially eliminates the advantage of reduced copying for StringViewArray.

+ +

A StringArray can be validated in a single call to the UTF-8 validation function as it has a continuous string buffer. As long as the underlying buffer is UTF-84, all strings in the array must be UTF-8. The Rust parquet reader makes a single function call to validate the entire buffer.

+ +

However, validating an arbitrary StringViewArray requires validating each string with a separate call to the validation function, as the underlying buffer may also contain non-string data (for example, the lengths in Parquet pages).

+ +

UTF-8 validation in Rust is highly optimized and favors longer strings (as shown in Figure 6), likely because it leverages SIMD instructions to perform parallel validation. The benefit of a single function call to validate UTF-8 over a function call for each string more than eliminates the advantage of avoiding the copy for StringViewArray.

+ +

Figure showing UTF-8 validation throughput vs string length.

+ +

Figure 6: UTF-8 validation throughput vs string length—StringArray’s contiguous buffer can be validated much faster than StringViewArray’s buffer.

+ +

Does this mean we should only use StringArray? No! Thankfully, there’s a clever way out. The key observation is that in many real-world datasets, 99% of strings are shorter than 128 bytes, meaning the encoded length values are smaller than 128, in which case the length itself is also valid UTF-8 (in fact, it is ASCII).

+ +

This observation means we can optimize validating UTF-8 strings in Parquet pages by treating the length bytes as part of a single large string as long as the length value is less than 128. Put another way, prior to this optimization, the length bytes act as string boundaries, which require a UTF-8 validation on each string. After this optimization, only those strings with lengths larger than 128 bytes (less than 1% of the strings in the ClickBench dataset) are string boundaries, significantly increasing the UTF-8 validation chunk size and thus improving performance.

+ +

The actual implementation is only nine lines of Rust (with 30 lines of comments). You can find more details in the related arrow-rs issue: https://github.com/apache/arrow-rs/issues/5995. As expected, with this optimization, loading StringViewArray is almost 2x faster than loading StringArray.

+ +

Be Careful About Implicit Copies

+ +

After all the work to avoid copying strings when loading from Parquet, performance was still not as good as expected. We tracked the problem to a few implicit data copies that we weren’t aware of, as described in this issue.

+ +

The copies we eventually identified come from the following innocent-looking line of Rust code, where self.buf is a reference counted pointer that should transform without copying into a buffer for use in StringViewArray.

+ +

However, Rust-type coercion rules favored a blanket implementation that did copy data. This implementation is shown in the following code block where the impl<T: AsRef<[u8]>> will accept any type that implements AsRef<[u8]> and copies the data to create a new buffer. To avoid copying, users need to explicitly call from_vec, which consumes the Vec and transforms it into a buffer.

+ +

Diagnosing this implicit copy was time-consuming as it relied on subtle Rust language semantics. We needed to track every step of the data flow to ensure every copy was necessary. To help other users and prevent future mistakes, we also removed the implicit API from arrow-rs in favor of an explicit API. Using this approach, we found and fixed several other unintentional copies in the code base—hopefully, the change will help other downstream users avoid unnecessary copies.

+ +

Help the Compiler by Giving it More Information

+ +

The Rust compiler’s automatic optimizations mostly work very well for a wide variety of use cases, but sometimes, it needs additional hints to generate the most efficient code. When profiling the performance of view construction, we found, counterintuitively, that constructing long strings was 10x faster than constructing short strings, which made short strings slower on StringViewArray than on StringArray!

+ +

As described in the first section, StringViewArray treats long and short strings differently. Short strings (<12 bytes) directly inline to the view struct, while long strings only inline the first 4 bytes. The code to construct a view looks something like this:

+ +

It appears that both branches of the code should be fast: they both involve copying at most 16 bytes of data and some memory shift/store operations. How could the branch for short strings be 10x slower?

+ +

Looking at the assembly code using Compiler Explorer, we (with help from Ao Li) found the compiler used CPU load instructions to copy the fixed-sized 4 bytes to the view for long strings, but it calls a function, ptr::copy_non_overlapping, to copy the inlined bytes to the view for short strings. The difference is that long strings have a prefix size (4 bytes) known at compile time, so the compiler directly uses efficient CPU instructions. But, since the size of the short string is unknown to the compiler, it has to call the general-purpose function ptr::copy_non_coverlapping. Making a function call is significant unnecessary overhead compared to a CPU copy instruction.

+ +

However, we know something the compiler doesn’t know: the short string size is not arbitrary—it must be between 0 and 12 bytes, and we can leverage this information to avoid the function call. Our solution generates 13 copies of the function using generics, one for each of the possible prefix lengths. The code looks as follows, and checking the assembly code, we confirmed there are no calls to ptr::copy_non_overlapping, and only native CPU instructions are used. For more details, see the ticket.

+ +

End-to-End Query Performance

+ +

In the previous sections, we went out of our way to make sure loading StringViewArray is faster than StringArray. Before going further, we wanted to verify if obsessing about reducing copies and function calls has actually improved end-to-end performance in real-life queries. To do this, we evaluated a ClickBench query (Q20) in DataFusion that counts how many URLs contain the word "google":

+ +

This is a relatively simple query; most of the time is spent on loading the “URL” column to find matching rows. The query plan looks like this:

+ +

We ran the benchmark in the DataFusion repo like this:

+ +

With StringViewArray we saw a 24% end-to-end performance improvement, as shown in Figure 7. With the --string-view argument, the end-to-end query time is 944.3 ms, 869.6 ms, 861.9 ms (three iterations). Without --string-view, the end-to-end query time is 1186.1 ms, 1126.1 ms, 1138.3 ms.

+ +

Figure showing StringView improves end to end performance by 24 percent.

+ +

Figure 7: StringView reduces end-to-end query time by 24% on ClickBench Q20.

+ +

We also double-checked with detailed profiling and verified that the time reduction is indeed due to faster Parquet loading.

+ +

Conclusion

+ +

In this first blog post, we have described what it took to improve the +performance of simply reading strings from Parquet files using StringView. While +this resulted in real end-to-end query performance improvements, in our next +post, we explore additional optimizations enabled by StringView in DataFusion, +along with some of the pitfalls we encountered while implementing them.

+ +

Footnotes

+ +
+
    +
  1. +

    Benchmarked with AMD Ryzen 7600x (12 core, 24 threads, 32 MiB L3), WD Black SN770 NVMe SSD (5150MB/4950MB seq RW bandwidth) 

    +
  2. +
  3. +

    Xiangpeng is a PhD student at the University of Wisconsin-Madison 

    +
  4. +
  5. +

    There is also a corresponding BinaryViewArray which is similar except that the data is not constrained to be UTF-8 encoded strings. 

    +
  6. +
  7. +

    We also make sure that offsets do not break a UTF-8 code point, which is cheaply validated

    +
  8. +
+
]]>
Xiangpeng Hao, Andrew Lamb
Using StringView / German Style Strings to make Queries Faster: Part 2 - String Operations2024-09-13T00:00:00+00:002024-09-13T00:00:00+00:00https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2 + +

Editor’s Note: This blog series was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao’s summer intern project

+ +

In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies. +In this second part of the post, we describe the rest of the journey: implementing additional efficient operations for real query processing.

+ +

Faster String Operations

+ +

Faster comparison

+ +

String comparison is ubiquitous; it is the core of +cmp, +min/max, +and like/ilike kernels. StringViewArray is designed to accelerate such comparisons using the inlined prefix—the key observation is that, in many cases, only the first few bytes of the string determine the string comparison results.

+ +

For example, to compare the strings InfluxDB with Apache DataFusion, we only need to look at the first byte to determine the string ordering or equality. In this case, since A is earlier in the alphabet than I, Apache DataFusion sorts first, and we know the strings are not equal. Despite only needing the first byte, comparing these strings when stored as a StringArray requires two memory accesses: 1) load the string offset and 2) use the offset to locate the string bytes. For low-level operations such as cmp that are invoked millions of times in the very hot paths of queries, avoiding this extra memory access can make a measurable difference in query performance.

+ +

For StringViewArray, typically, only one memory access is needed to load the view struct. Only if the result can not be determined from the prefix is the second memory access required. For the example above, there is no need for the second access. This technique is very effective in practice: the second access is never necessary for the more than 60% of real-world strings which are shorter than 12 bytes, as they are stored completely in the prefix.

+ +

However, functions that operate on strings must be specialized to take advantage of the inlined prefix. In addition to low-level comparison kernels, we implemented a wide range of other StringViewArray operations that cover the functions and operations seen in ClickBench queries. Supporting StringViewArray in all string operations takes quite a bit of effort, and thankfully the Arrow and DataFusion communities are already hard at work doing so (see https://github.com/apache/datafusion/issues/11752 if you want to help out).

+ +

Faster take and filter

+ +

After a filter operation such as WHERE url <> '' to avoid processing empty urls, DataFusion will often coalesce results to form a new array with only the passing elements. +This coalescing ensures the batches are sufficiently sized to benefit from vectorized processing in subsequent steps.

+ +

The coalescing operation is implemented using the take and filter kernels in arrow-rs. For StringArray, these kernels require copying the string contents to a new buffer without “holes” in between. This copy can be expensive especially when the new array is large.

+ +

However, take and filter for StringViewArray can avoid the copy by reusing buffers from the old array. The kernels only need to create a new list of views that point at the same strings within the old buffers. +Figure 1 illustrates the difference between the output of both string representations. StringArray creates two new strings at offsets 0-17 and 17-32, while StringViewArray simply points to the original buffer at offsets 0 and 25.

+ +

Diagram showing Zero-copy `take`/`filter` for StringViewArray

+ +

Figure 1: Zero-copy take/filter for StringViewArray

+ +

When to GC?

+ +

Zero-copy take/filter is great for generating large arrays quickly, but it is suboptimal for highly selective filters, where most of the strings are filtered out. When the cardinality drops, StringViewArray buffers become sparse—only a small subset of the bytes in the buffer’s memory are referred to by any view. This leads to excessive memory usage, especially in a filter-then-coalesce scenario. For example, a StringViewArray with 10M strings may only refer to 1M strings after some filter operations; however, due to zero-copy take/filter, the (reused) 10M buffers can not be released/reused.

+ +

To release unused memory, we implemented a garbage collection (GC) routine to consolidate the data into a new buffer to release the old sparse buffer(s). As the GC operation copies strings, similarly to StringArray, we must be careful about when to call it. If we call GC too early, we cause unnecessary copying, losing much of the benefit of StringViewArray. If we call GC too late, we hold large buffers for too long, increasing memory use and decreasing cache efficiency. The Polars blog on StringView also refers to the challenge presented by garbage collection timing.

+ +

arrow-rs implements the GC process, but it is up to users to decide when to call it. We leverage the semantics of the query engine and observed that the CoalseceBatchesExec operator, which merge smaller batches to a larger batch, is often used after the record cardinality is expected to shrink, which aligns perfectly with the scenario of GC in StringViewArray. +We, therefore, implemented the GC procedure inside CoalseceBatchesExec1 with a heuristic that estimates when the buffers are too sparse.

+ +

The art of function inlining: not too much, not too little

+ +

Like string inlining, function inlining is the process of embedding a short function into the caller to avoid the overhead of function calls (caller/callee save). +Usually, the Rust compiler does a good job of deciding when to inline. However, it is possible to override its default using the #[inline(always)] directive. +In performance-critical code, inlined code allows us to organize large functions into smaller ones without paying the runtime cost of function invocation.

+ +

However, function inlining is not always better, as it leads to larger function bodies that are harder for LLVM to optimize (for example, suboptimal register spilling) and risk overflowing the CPU’s instruction cache. We observed several performance regressions where function inlining caused slower performance when implementing the StringViewArray comparison kernels. Careful inspection and tuning of the code was required to aid the compiler in generating efficient code. More details can be found in this PR: https://github.com/apache/arrow-rs/pull/5900.

+ +

Buffer size tuning

+ +

StringViewArray permits multiple buffers, which enables a flexible buffer layout and potentially reduces the need to copy data. However, a large number of buffers slows down the performance of other operations. +For example, get_array_memory_size needs to sum the memory size of each buffer, which takes a long time with thousands of small buffers. +In certain cases, we found that multiple calls to concat_batches lead to arrays with millions of buffers, which was prohibitively expensive.

+ +

For example, consider a StringViewArray with the previous default buffer size of 8 KB. With this configuration, holding 4GB of string data requires almost half a million buffers! Larger buffer sizes are needed for larger arrays, but we cannot arbitrarily increase the default buffer size, as small arrays would consume too much memory (most arrays require at least one buffer). Buffer sizing is especially problematic in query processing, as we often need to construct small batches of string arrays, and the sizes are unknown at planning time.

+ +

To balance the buffer size trade-off, we again leverage the query processing (DataFusion) semantics to decide when to use larger buffers. While coalescing batches, we combine multiple small string arrays and set a smaller buffer size to keep the total memory consumption low. In string aggregation, we aggregate over an entire Datafusion partition, which can generate a large number of strings, so we set a larger buffer size (2MB).

+ +

To assist situations where the semantics are unknown, we also implemented a classic dynamic exponential buffer size growth strategy, which starts with a small buffer size (8KB) and doubles the size of each new buffer up to 2MB. We implemented this strategy in arrow-rs and enabled it by default so that other users of StringViewArray can also benefit from this optimization. See this issue for more details: https://github.com/apache/arrow-rs/issues/6094.

+ +

End-to-end query performance

+ +

We have made significant progress in optimizing StringViewArray filtering operations. Now, let’s test it in the real world to see how it works!

+ +

Let’s consider ClickBench query 22, which selects multiple string fields (URL, Title, and SearchPhase) and applies several filters.

+ +

We ran the benchmark using the following command in the DataFusion repo. Again, the --string-view option means we use StringViewArray instead of StringArray.

+ +

To eliminate the impact of the faster Parquet reading using StringViewArray (see the first part of this blog), Figure 2 plots only the time spent in FilterExec. Without StringViewArray, the filter takes 7.17s; with StringViewArray, the filter only takes 4.86s, a 32% reduction in time. Moreover, we see a 17% improvement in end-to-end query performance.

+ +

Figure showing StringViewArray reduces the filter time by 32% on ClickBench query 22.

+ +

Figure 2: StringViewArray reduces the filter time by 32% on ClickBench query 22.

+ +

Faster String Aggregation

+ +

So far, we have discussed how to exploit two StringViewArray features: reduced copy and faster filtering. This section focuses on reusing string bytes to repeat string values.

+ +

As described in part one of this blog, if two strings have identical values, StringViewArray can use two different views pointing at the same buffer range, thus avoiding repeating the string bytes in the buffer. This makes StringViewArray similar to an Arrow DictionaryArray that stores Strings—both array types work well for strings with only a few distinct values.

+ +

Deduplicating string values can significantly reduce memory consumption in StringViewArray. However, this process is expensive and involves hashing every string and maintaining a hash table, and so it cannot be done by default when creating a StringViewArray. We introduced an opt-in string deduplication mode in arrow-rs for advanced users who know their data has a small number of distinct values, and where the benefits of reduced memory consumption outweigh the additional overhead of array construction.

+ +

Once again, we leverage DataFusion query semantics to identify StringViewArray with duplicate values, such as aggregation queries with multiple group keys. For example, some ClickBench queries group by two columns:

+ +
    +
  • UserID (an integer with close to 1 M distinct values)
  • +
  • MobilePhoneModel (a string with less than a hundred distinct values)
  • +
+ +

In this case, the output row count is count(distinct UserID) * count(distinct MobilePhoneModel), which is 100M. Each string value of MobilePhoneModel is repeated 1M times. With StringViewArray, we can save space by pointing the repeating values to the same underlying buffer.

+ +

Faster string aggregation with StringView is part of a larger project to improve DataFusion aggregation performance. We have a proof of concept implementation with StringView that can improve the multi-column string aggregation by 20%. We would love your help to get it production ready!

+ +

StringView Pitfalls

+ +

Most existing blog posts (including this one) focus on the benefits of using StringViewArray over other string representations such as StringArray. As we have discussed, even though it requires a significant engineering investment to realize, StringViewArray is a major improvement over StringArray in many cases.

+ +

However, there are several cases where StringViewArray is slower than StringArray. For completeness, we have listed those instances here:

+ +
    +
  1. Tiny strings (when strings are shorter than 8 bytes): every element of the StringViewArray consumes at least 16 bytes of memory—the size of the view struct. For an array of tiny strings, StringViewArray consumes more memory than StringArray and thus can cause slower performance due to additional memory pressure on the CPU cache.
  2. +
  3. Many repeated short strings: Similar to the first point, StringViewArray can be slower and require more memory than a DictionaryArray because 1) it can only reuse the bytes in the buffer when the strings are longer than 12 bytes and 2) 32-bit offsets are always used, even when a smaller size (8 bit or 16 bit) could represent all the distinct values.
  4. +
  5. Filtering: As we mentioned above, StringViewArrays often consume more memory than the corresponding StringArray, and memory bloat quickly dominates the performance without GC. However, invoking GC also reduces the benefits of less copying so must be carefully tuned.
  6. +
+ +

Conclusion and Takeaways

+ +

In these two blog posts, we discussed what it takes to implement StringViewArray in arrow-rs and then integrate it into DataFusion. Our evaluations on ClickBench queries show that StringView can improve the performance of string-intensive workloads by up to 2x.

+ +

Given that DataFusion already performs very well on ClickBench, the level of end-to-end performance improvement using StringViewArray shows the power of this technique and, of course, is a win for DataFusion and the systems that build upon it.

+ +

StringView is a big project that has received tremendous community support. Specifically, we would like to thank @tustvold, @ariesdevil, @RinChanNOWWW, @ClSlaid, @2010YOUY01, @chloro-pn, @a10y, @Kev1n8, @Weijun-H, @PsiACE, @tshauck, and @xinlifoobar for their valuable contributions!

+ +

As the introduction states, “German Style Strings” is a relatively straightforward research idea that avoid some string copies and accelerates comparisons. However, applying this (great) idea in practice requires a significant investment in careful software engineering. Again, we encourage the research community to continue to help apply research ideas to industrial systems, such as DataFusion, as doing so provides valuable perspectives when evaluating future research questions for the greatest potential impact.

+ +

Footnotes

+ +
+
    +
  1. +

    There are additional optimizations possible in this operation that the community is working on, such as https://github.com/apache/datafusion/issues/7957

    +
  2. +
+
]]>
Xiangpeng Hao, Andrew Lamb
Apache DataFusion Comet 0.2.0 Release2024-08-28T00:00:00+00:002024-08-28T00:00:00+00:00https://datafusion.apache.org/blog/2024/08/28/datafusion-comet-0.2.0 @@ -1468,465 +1772,4 @@ allocation using the arrow Row format

hits_0.parquet, one of the files from the partitioned ClickBench dataset, which has 100,000 rows and is 117 MB in size. The entire dataset has 100,000,000 rows in a single 14 GB Parquet file. The script did not complete on the entire dataset after 40 minutes, and used 212 GB RAM at peak. 

-]]>
alamb, Dandandan, tustvold
Apache Arrow DataFusion 26.0.02023-06-24T00:00:00+00:002023-06-24T00:00:00+00:00https://datafusion.apache.org/blog/2023/06/24/datafusion-25.0.0 - -

It has been a whirlwind 6 months of DataFusion development since our -last update: the community has grown, many features have been added, -performance improved and we are discussing branching out to our own -top level Apache Project.

- -

Background

- -

Apache Arrow DataFusion is an extensible query engine and database -toolkit, written in Rust, that uses Apache Arrow as its in-memory -format.

- -

DataFusion, along with Apache Calcite, Facebook’s Velox and -similar technology are part of the next generation “Deconstructed -Database” architectures, where new systems are built on a foundation -of fast, modular components, rather as a single tightly integrated -system.

- -

While single tightly integrated systems such as Spark, DuckDB and -Pola.rs are great pieces of technology, our community believes that -anyone developing new data heavy application, such as those common in -machine learning in the next 5 years, will require a high -performance, vectorized, query engine to remain relevant. The only -practical way to gain access to such technology without investing many -millions of dollars to build a new tightly integrated engine, is -though open source projects like DataFusion and similar enabling -technologies such as Apache Arrow and Rust.

- -

DataFusion is targeted primarily at developers creating other data -intensive analytics, and offers:

- -
    -
  • High performance, native, parallel streaming execution engine
  • -
  • Mature SQL support, featuring subqueries, window functions, grouping sets, and more
  • -
  • Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
  • -
  • Native DataFrame API and python bindings
  • -
  • Well documented source code and architecture, designed to be customized to suit downstream project needs
  • -
  • High quality, easy to use code released every 2 weeks to crates.io
  • -
  • Welcoming, open community, governed by the highly regarded and well understood Apache Software Foundation
  • -
- -

The rest of this post highlights some of the improvements we have made -to DataFusion over the last 6 months and a preview of where we are -heading. You can see a list of all changes in the detailed -CHANGELOG.

- -

(Even) Better Performance

- -

Various benchmarks show DataFusion to be quite close or even -faster to the state of the art in analytic performance (at the moment -this seems to be DuckDB). We continually work on improving performance -(see #5546 for a list) and would love additional help in this area.

- -

DataFusion now reads single large Parquet files significantly faster by -parallelizing across multiple cores. Native speeds for reading JSON -and CSV files are also up to 2.5x faster thanks to improvements -upstream in arrow-rs JSON reader and CSV reader.

- -

Also, we have integrated the arrow-rs Row Format into DataFusion resulting in up to 2-3x faster sorting and merging.

- -

Improved Documentation and Website

- -

Part of growing the DataFusion community is ensuring that DataFusion’s -features are understood and that it is easy to contribute and -participate. To that end the website has been cleaned up, the -architecture guide expanded, the roadmap updated, and several -overview talks created:

- - - -

New Features

- -

More Streaming, Less Memory

- -

We have made significant progress on the streaming execution roadmap -such as unbounded datasources, streaming group by, sophisticated -sort and repartitioning improvements in the optimizer, and support -for symmetric hash join (read more about that in the great Synnada -Blog Post on the topic). Together, these features both 1) make it -easier to build streaming systems using DataFusion that can -incrementally generate output before (or ever) seeing the end of the -input and 2) allow general queries to use less memory and generate their -results faster.

- -

We have also improved the runtime memory management system so that -DataFusion now stays within its declared memory budget generate -runtime errors.

- -

DML Support (INSERT, DELETE, UPDATE, etc)

- -

Part of building high performance data systems includes writing data, -and DataFusion supports several features for creating new files:

- - - -

We are working on easier to use COPY INTO syntax, better support -for writing parquet, JSON, and AVRO, and more – see our tracking epic -for more details.

- -

Timestamp and Intervals

- -

One mark of the maturity of a SQL engine is how it handles the tricky -world of timestamp, date, times and interval arithmetic. DataFusion is -feature complete in this area and behaves as you would expect, -supporting queries such as

- -
SELECT now() + '1 month' FROM my_table;
-
- -

We still have a long tail of date and time improvements, which we are working on as well.

- -

Querying Structured Types (List and Structs)

- -

Arrow and Parquet support nested data well and DataFusion lets you -easily query such Struct and List. For example, you can use -DataFusion to read and query the JSON Datasets for Exploratory OLAP - -Mendeley Data like this:

- -
----------
--- Explore structured data using SQL
-----------
-SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
-+---------------------------------------------------------------------------------------------------------------------------+
-| delete                                                                                                                    |
-+---------------------------------------------------------------------------------------------------------------------------+
-| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
-| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
-| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
-| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
-| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
-| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
-| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
-| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
-| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
-| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
-+---------------------------------------------------------------------------------------------------------------------------+
-
-----------
--- Select some deeply nested fields
-----------
-SELECT
-  delete['status']['id']['$numberLong'] as delete_id,
-  delete['status']['user_id'] as delete_user_id
-FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;
-
-+--------------------+----------------+
-| delete_id          | delete_user_id |
-+--------------------+----------------+
-| 135037425050320896 | 334902461      |
-| 134703982051463168 | 405383453      |
-| 134773741740765184 | 64823441       |
-| 132543659655704576 | 45917834       |
-| 133786431926697984 | 67229952       |
-| 134619093570560002 | 182430773      |
-| 134019857527214080 | 257396311      |
-| 133931546469076993 | 124539548      |
-| 134397743350296576 | 139836391      |
-| 127833661767823360 | 244442687      |
-+--------------------+----------------+
-
- -

Subqueries All the Way Down

- -

DataFusion can run many different subqueries by rewriting them to -joins. It has been able to run the full suite of TPC-H queries for at -least the last year, but recently we have implemented significant -improvements to this logic, sufficient to run almost all queries in -the TPC-DS benchmark as well.

- -

Community and Project Growth

- -

The six months since our last update saw significant growth in -the DataFusion community. Between versions 17.0.0 and 26.0.0, -DataFusion merged 711 PRs from 107 distinct contributors, not -including all the work that goes into our core dependencies such as -arrow, -parquet, and -object_store, that much of -the same community helps support.

- -

In addition, we have added 7 new committers and 1 new PMC member to -the Apache Arrow project, largely focused on DataFusion, and we -learned about some of the cool new systems which are using -DataFusion. Given the growth of the community and interest in the -project, we also clarified the mission statement and are -discussing “graduate”ing DataFusion to a new top level -Apache Software Foundation project.

- - - -

How to Get Involved

- -

Kudos to everyone in the community who has contributed ideas, -discussions, bug reports, documentation and code. It is exciting to be -innovating on the next generation of database architectures together!

- -

If you are interested in contributing to DataFusion, we would love to -have you join us. You can try out DataFusion on some of your own -data and projects and let us know how it goes or contribute a PR with -documentation, tests or code. A list of open issues suitable for -beginners is here.

- -

Check out our Communication Doc for more ways to engage with the -community.

]]>
pmc
Apache Arrow DataFusion 16.0.0 Project Update2023-01-19T00:00:00+00:002023-01-19T00:00:00+00:00https://datafusion.apache.org/blog/2023/01/19/datafusion-16.0.0 - -

Introduction

- -

DataFusion is an extensible -query execution framework, written in Rust, -that uses Apache Arrow as its -in-memory format. It is targeted primarily at developers creating data -intensive analytics, and offers mature -SQL support, -a DataFrame API, and many extension points.

- -

Systems based on DataFusion perform very well in benchmarks, -especially considering they operate directly on parquet files rather -than first loading into a specialized format. Some recent highlights -include clickbench and the -Cloudfuse.io standalone query -engines page.

- -

DataFusion is also part of a longer term trend, articulated clearly by -Andy Pavlo in his 2022 Databases -Retrospective. -Database frameworks are proliferating and it is likely that all OLAP -DBMSs and other data heavy applications, such as machine learning, -will require a vectorized, highly performant query engine in the next -5 years to remain relevant. The only practical way to make such -technology so widely available without many millions of dollars of -investment is though open source engine such as DataFusion or -Velox.

- -

The rest of this post describes the improvements made to DataFusion -over the last three months and some hints of where we are heading.

- -

Community Growth

- -

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

- -

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

- - -

Several new systems based on DataFusion were recently added:

- - - -

Performance 🚀

- -

Performance and efficiency are core values for -DataFusion. While there is still a gap between DataFusion and the best of -breed, tightly integrated systems such as DuckDB -and Polars, DataFusion is -closing the gap quickly. Performance highlights from the last three -months:

- -
    -
  • Up to 30% Faster Sorting and Merging using the new Row Format
  • -
  • Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
  • -
  • 70% faster IN expressions evaluation (#4057)
  • -
  • Sort and partition aware optimizations (#3969 and #4691)
  • -
  • Filter selectivity analysis (#3868)
  • -
- -

Runtime Resource Limits

- -

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

- -

In version 16.0.0, it is possible to limit DataFusion’s memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

- -

SQL Window Functions

- -

SQL Window Functions are useful for a variety of analysis and DataFusion’s implementation support expanded significantly:

- -
    -
  • Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
  • -
  • Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
  • -
  • Support for the NTILE window function (#4676)
  • -
  • Support for GROUPS mode (#4155)
  • -
- -

Improved Joins

- -

Joins are often the most complicated operations to handle well in -analytics systems and DataFusion 16.0.0 offers significant improvements -such as

- -
    -
  • Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219)
  • -
  • Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y
  • -
  • Better performance on non-equijoins (#4562)
  • -
- -

Streaming Execution

- -

One emerging use case for Datafusion is as a foundation for -streaming-first data platforms. An important prerequisite -is support for incremental execution for queries that can be computed -incrementally.

- -

With this release, DataFusion now supports the following streaming features:

- -
    -
  • Data ingestion from infinite files such as FIFOs (#4694),
  • -
  • Detection of pipeline-breaking queries in streaming use cases (#4694),
  • -
  • Automatic input swapping for joins so probe side is a data stream (#4694),
  • -
  • Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
  • -
  • Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).
  • -
- -

These are a major steps forward, and we plan even more improvements over the next few releases.

- -

Better Support for Distributed Catalogs

- -

16.0.0 has been enhanced support for asynchronous catalogs (#4607) -to better support distributed metadata stores such as -Delta.io and Apache -Iceberg which require asynchronous I/O -during planning to access remote catalogs. Previously, DataFusion -required synchronous access to all relevant catalog information.

- -

Additional SQL Support

-

SQL support continues to improve, including some of these highlights:

- -
    -
  • Add TPC-DS query planning regression tests #4719
  • -
  • Support for PREPARE statement #4490
  • -
  • Automatic coercions ast between Date and Timestamp #4726
  • -
  • Support type coercion for timestamp and utf8 #4312
  • -
  • Full support for time32 and time64 literal values (ScalarValue) #4156
  • -
  • New functions, incuding uuid() #4041, current_time #4054, current_date #4022
  • -
  • Compressed CSV/JSON support #3642
  • -
- -

The community has also invested in new sqllogic based tests to keep improving DataFusion’s quality with less effort.

- -

Plan Serialization and Substrait

- -

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

- -

How to Get Involved

- -

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

- -

If you are interested in contributing to DataFusion, we would love to -have you join us. You can try out DataFusion on some of your own -data and projects and let us know how it goes or contribute a PR with -documentation, tests or code. A list of open issues suitable for -beginners is -here.

- -

Check out our Communication Doc on more -ways to engage with the community.

- -

Appendix: Contributor Shoutout

- -

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

- -
   113	Andrew Lamb
-    58	jakevin
-    46	Raphael Taylor-Davies
-    30	Andy Grove
-    19	Batuhan Taskaya
-    19	Remzi Yang
-    17	ygf11
-    16	Burak
-    16	Jeffrey
-    16	Marco Neumann
-    14	Kun Liu
-    12	Yang Jiang
-    10	mingmwang
-     9	Daniël Heres
-     9	Mustafa akur
-     9	comphead
-     9	mvanschellebeeck
-     9	xudong.w
-     7	dependabot[bot]
-     7	yahoNanJing
-     6	Brent Gardner
-     5	AssHero
-     4	Jiayu Liu
-     4	Wei-Ting Kuo
-     4	askoa
-     3	André Calado Coroado
-     3	Jie Han
-     3	Jon Mease
-     3	Metehan Yıldırım
-     3	Nga Tran
-     3	Ruihang Xia
-     3	baishen
-     2	Berkay Şahin
-     2	Dan Harris
-     2	Dongyan Zhou
-     2	Eduard Karacharov
-     2	Kikkon
-     2	Liang-Chi Hsieh
-     2	Marko Milenković
-     2	Martin Grigorov
-     2	Roman Nozdrin
-     2	Tim Van Wassenhove
-     2	r.4ntix
-     2	unconsolable
-     2	unvalley
-     1	Ajaya Agrawal
-     1	Alexander Spies
-     1	ArkashaJavelin
-     1	Artjoms Iskovs
-     1	BoredPerson
-     1	Christian Salvati
-     1	Creampanda
-     1	Data Psycho
-     1	Francis Du
-     1	Francis Le Roy
-     1	LFC
-     1	Marko Grujic
-     1	Matt Willian
-     1	Matthijs Brobbel
-     1	Max Burke
-     1	Mehmet Ozan Kabak
-     1	Rito Takeuchi
-     1	Roman Zeyde
-     1	Vrishabh
-     1	Zhang Li
-     1	ZuoTiJia
-     1	byteink
-     1	cfraz89
-     1	nbr
-     1	xxchan
-     1	yujie.zhang
-     1	zembunia
-     1	哇呜哇呜呀咦耶
-
]]>
pmc
\ No newline at end of file +]]>
alamb, Dandandan, tustvold
\ No newline at end of file diff --git a/img/string-view-1/figure1-performance.png b/img/string-view-1/figure1-performance.png new file mode 100644 index 0000000..628f9aa Binary files /dev/null and b/img/string-view-1/figure1-performance.png differ diff --git a/img/string-view-1/figure2-string-view.png b/img/string-view-1/figure2-string-view.png new file mode 100644 index 0000000..9a2cd63 Binary files /dev/null and b/img/string-view-1/figure2-string-view.png differ diff --git a/img/string-view-1/figure4-copying.png b/img/string-view-1/figure4-copying.png new file mode 100644 index 0000000..cf94219 Binary files /dev/null and b/img/string-view-1/figure4-copying.png differ diff --git a/img/string-view-1/figure5-loading-strings.png b/img/string-view-1/figure5-loading-strings.png new file mode 100644 index 0000000..d287efa Binary files /dev/null and b/img/string-view-1/figure5-loading-strings.png differ diff --git a/img/string-view-1/figure6-utf8-validation.png b/img/string-view-1/figure6-utf8-validation.png new file mode 100644 index 0000000..98185ee Binary files /dev/null and b/img/string-view-1/figure6-utf8-validation.png differ diff --git a/img/string-view-1/figure7-end-to-end.png b/img/string-view-1/figure7-end-to-end.png new file mode 100644 index 0000000..bb5ff40 Binary files /dev/null and b/img/string-view-1/figure7-end-to-end.png differ diff --git a/img/string-view-2/figure1-zero-copy-take.png b/img/string-view-2/figure1-zero-copy-take.png new file mode 100644 index 0000000..44363f9 Binary files /dev/null and b/img/string-view-2/figure1-zero-copy-take.png differ diff --git a/img/string-view-2/figure2-filter-time.png b/img/string-view-2/figure2-filter-time.png new file mode 100644 index 0000000..893fa09 Binary files /dev/null and b/img/string-view-2/figure2-filter-time.png differ diff --git a/index.html b/index.html index f5bc5f6..7b3aeee 100644 --- a/index.html +++ b/index.html @@ -38,7 +38,17 @@

Posts

-