Skip to content

Commit

Permalink
Fix footnote placement
Browse files Browse the repository at this point in the history
  • Loading branch information
ianmcook committed Jan 7, 2025
1 parent f22647e commit ec9628a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion _posts/2025-01-07-arrow-result-transfer.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Transferring a query result from a source to a destination involves three steps:
2. Transmit the data over the network in the transfer format.[^1]
3. At the destination, deserialize the transfer format into the target format.

In the era of slower networks, the transmission step was usually the bottleneck, so there was little incentive to speed up the serialization and deserialization steps.[^2] Instead, the emphasis was on making the transferred data smaller, typically using compression, to reduce the transmission time. It was during this era that the most widely used database connectivity APIs (ODBC and JDBC) and database client protocols (such as the MySQL client/server protocol and the PostgreSQL frontend/backend protocol) were designed. But as networks have become faster and transmission times have dropped, the bottleneck has shifted to the serialization and deserialization steps. This is especially true for queries that produce the larger result sizes characteristic of many data engineering and data analytics pipelines.
In the era of slower networks, the transmission step was usually the bottleneck, so there was little incentive to speed up the serialization and deserialization steps. Instead, the emphasis was on making the transferred data smaller, typically using compression, to reduce the transmission time. It was during this era that the most widely used database connectivity APIs (ODBC and JDBC) and database client protocols (such as the MySQL client/server protocol and the PostgreSQL frontend/backend protocol) were designed. But as networks have become faster and transmission times have dropped, the bottleneck has shifted to the serialization and deserialization steps.[^2] This is especially true for queries that produce the larger result sizes characteristic of many data engineering and data analytics pipelines.

Yet many query results today continue to flow through legacy APIs and protocols that add massive serialization and deserialization (“ser/de”) overheads by forcing data into inefficient transfer formats. In a [2021 paper](https://www.vldb.org/pvldb/vol14/p534-li.pdf), Tianyu Li et al. presented an example using ODBC and the PostgreSQL protocol in which 99.996% of total query time was spent on ser/de. That is arguably an extreme case, but we have observed 90% or higher in many real-world cases. Today, for data engineering and data analytics queries, there is a strong incentive to choose a transfer format that speeds up ser/de.

Expand Down

0 comments on commit ec9628a

Please sign in to comment.