diff --git a/_posts/2025-01-07-arrow-result-transfer.md b/_posts/2025-01-07-arrow-result-transfer.md index 922012c5d3f..1d3403ba754 100644 --- a/_posts/2025-01-07-arrow-result-transfer.md +++ b/_posts/2025-01-07-arrow-result-transfer.md @@ -66,17 +66,21 @@ So it is increasingly common for both the source format and the target format of The Arrow format is a columnar data format. The column-oriented layout of data in the Arrow format is similar—and in some cases identical—to the layout of data in many widely used columnar data source systems and destination systems. -### 2. The Arrow format is schema-aware and type-safe. +### 2. The Arrow format is self-describing and type-safe. -When a data format includes schema information and enforces type consistency, the destination system can safely and efficiently determine the types of the columns in the data and can rule out the possibility of type errors when processing the data. By contrast, when a data format does not include schema information, the destination system must scan the structure and contents of the data to infer the types—a slow and error-prone process—otherwise it must look up the schema information in a separate system or require that the source system provide a separate schema describing the data. Similarly, when a data format does not enforce type consistency, the destination system must check the validity of each individual value in the data—a computationally expensive process—or else handle type errors when processing the data. These steps add up to large deserialization overheads. +In a self-describing data format, the schema (the names and types of the columns) and other metadata that describes the data’s structure is included with the data. A self-describing format provides the receiving system with all the information it needs to safely and efficiently process the data. By contrast, when a format is not self-describing, the receiving system must scan the data to infer its schema and structure (a slow and error-prone process) or obtain the schema separately. -The Arrow format includes schema information and enforces type consistency. Arrow’s type system is similar—and in some cases identical—to the type systems in many widely used data source systems and destination systems. This includes most columnar systems and many row-oriented systems such as Apache Spark and various relational databases. These systems can quickly and safely convert data values between their native type systems and the Arrow type system. +An important property of some self-describing data formats is the ability to enforce type safety. When a format enforces type safety, the receiving system can rule out the possibility of type errors when processing the data. By contrast, when a format does not enforce type safety, the receiving system must check the validity of each individual value in the data (a computationally expensive process) or else handle type errors when processing the data. + +When reading data from a non-self-describing, type-unsafe format (such as CSV), all this scanning and inferring and checking contributes to large deserialization overheads. Worse, such formats can cause ambiguities, debugging difficulties, maintenance challenges, and security vulnerabilities. + +The Arrow format is self-describing and enforces type safety. Furthermore, Arrow’s type system is similar—and in some cases identical—to the type systems of many widely used data sources and destinations. This includes most columnar data systems and many row-oriented systems such as Apache Spark and various relational databases. These systems can quickly and safely convert data values between their native types and the corresponding Arrow types. ### 3. The Arrow format enables zero-copy. A zero-copy operation is one in which data is transferred from one medium to another without creating any intermediate copies. When a data format supports zero-copy operations, this means that its structure in memory is the same as its structure on disk or on the network. So, for example, the data can be read off of the network directly into a usable data structure in memory without performing any intermediate copies or conversions. -The Arrow format supports zero-copy operations. Arrow defines a column-oriented tabular data structure called a [record batch](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc){:target="_blank"} which can be held in memory, sent over a network, or stored on disk. The binary structure of an Arrow record batch is the same regardless of which medium it is on. Also, to hold schema information and other metadata, Arrow uses FlatBuffers, a format created by Google which also has the same binary structure regardless of which medium it is on. +The Arrow format supports zero-copy operations. Arrow defines a column-oriented tabular data structure called a [record batch](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc){:target="_blank"} which can be held in memory, sent over a network, or stored on disk. The binary structure of an Arrow record batch is the same regardless of which medium it is on. Also, to hold schemas and other metadata, Arrow uses FlatBuffers, a format created by Google which also has the same binary structure regardless of which medium it is on. As a result of these design choices, Arrow can serve not only as a transfer format but also as an in-memory format and on-disk format. This is in contrast to text-based formats such as JSON and CSV, which encode data values as plain text strings separated by delimiters and other structural syntax. To load data from these formats into a usable in-memory data structure, the data must be parsed and decoded. This is also in contrast to binary formats such as Parquet and ORC, which use encodings and compression to reduce the size of the data on disk. To load data from these formats into a usable in-memory data structure, it must be decompressed and decoded.[^3] @@ -88,7 +92,7 @@ The Arrow format was designed to be highly efficient as an in-memory format for A streamable data format is one that can be processed sequentially, one chunk at a time, without waiting for the full dataset. When data is being transmitted in a streamable format, the receiving system can begin processing it as soon as the first chunk arrives. This can speed up data transfer in several ways: transfer time can overlap with processing time; the receiving system can use memory more efficiently; and multiple streams can be transferred in parallel, speeding up transmission, deserialization, and processing. -CSV is an example of a streamable data format, because the column names (if included) are in a header at the top of the file, and the lines in the file can be processed sequentially. Parquet and ORC are examples of data formats that do not enable streaming, because the schema and other metadata, which is required to interpret the data, is held in a footer at the bottom of the file, making it necessary to download the entire file before processing any data. +CSV is an example of a streamable data format, because the column names (if included) are in a header at the top of the file, and the lines in the file can be processed sequentially. Parquet and ORC are examples of data formats that do not enable streaming, because the schema and other metadata, which is required to process the data, is held in a footer at the bottom of the file, making it necessary to download the entire file before any processing can begin. Arrow is a streamable data format. A dataset can be represented in Arrow as a sequence of record batches that all have the same schema. Arrow defines a [streaming format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format){:target="_blank"} consisting of the schema followed by one or more record batches. A system receiving an Arrow stream can process the record batches sequentially as they arrive.