The Deephaven CSV Library is a high-performance, column-oriented, type inferencing CSV parser. It differs from other CSV libraries in that it organizes data into columns rather than rows, which allows for more efficient storage and retrieval. It also can dynamically infer the types of those columns based on the input, so the caller is not required to specify the column types beforehand. Finally it provides a way for the caller to specify the underlying data structures used for columnar storage, This allows the library to store its data directly in the caller's preferred data structure, without the inefficiency of going through intermediate temporary objects.
The Deephaven CSV Library is agnostic about what data sink you use, and it works equally well with Java arrays, your own custom column type, or perhaps even streaming to a file. But along with this flexibility comes extra programming effort on the part of the implementor: instead of telling the library what column data structures to use, the caller provides a "factory" capable of constructing any requested column type, and the library then dynamically decides which ones it needs as it parses the input data. While it is tempting to just use ArrayList or some other catch-all collection, this is not as efficient as type-specific collectors, and makes a large impact on performance as data sizes increase. Instead, it is common practice in high-performance libraries to provide multiple, very similar but distinct implementations, one for each primitive type. For example, your high-performance application might have YourCharColumnType, YourIntColumnType, YourDoubleColumnType, and the like. Unfortunately this translates into a certain amount of tedium for the implementor, who needs to provide implementations for each type and code to move data from the CSV library to them.
With this guide we hope to make it clear what the caller needs to implement, and also to provide a reference implementation for people to use as a starting point.
To help you get started, the library provides a "sink factory" that uses Java arrays for the underlying column representation. This version is best suited for simple examples and for learning how to use the library. Developers of production applications will likely want to define their own column representations and to create the sink factory that supplies them. The documentation in ADVANCED.md describes how to do this. For now, we show how to process data using the builtin sink factory for arrays:
final InputStream inputStream = ...;
final CsvSpecs specs = CsvSpecs.csv();
final CsvReader.Result result = CsvReader.read(specs, inputStream, SinkFactory.arrays());
final long numRows = result.numRows();
for (CsvReader.ResultColumn col : result) {
switch (col.dataType()) {
case BOOLEAN_AS_BYTE: {
byte[] data = (byte[]) col.data();
// Process this boolean-as-byte column.
// Be sure to use numRows rather than data.length, because
// the underlying array might have excess capacity.
process(data, numRows);
break;
}
case SHORT: {
short[] data = (short[]) col.data();
// Process this short column.
process(data, numRows);
break;
}
// etc...
}
}
If your application uses reserved null sentinel values, there is an overload of SinkFactory.arrays() that allows you to specify those values.
This project produces two JARs:
deephaven-csv
: the primary dependency- (optional, but recommended)
deephaven-csv-fast-double-parser
: a fast double parser
To depend on Deephaven CSV from Gradle, add the following dependency(s) to your build.gradle file:
implementation 'io.deephaven:deephaven-csv:0.14.0'
// Optional dependency for faster double parsing
// runtimeOnly 'io.deephaven:deephaven-csv-fast-double-parser:0.14.0'
To depend on Deephaven CSV from Maven, add the following dependency(s) to your pom.xml file:
<dependency>
<groupId>io.deephaven</groupId>
<artifactId>deephaven-csv</artifactId>
<version>0.14.0</version>
</dependency>
<!-- Optional dependency for faster double parsing -->
<!--<dependency>-->
<!-- <groupId>io.deephaven</groupId>-->
<!-- <artifactId>deephaven-csv-fast-double-parser</artifactId>-->
<!-- <version>0.14.0</version>-->
<!-- <scope>runtime</scope>-->
<!--</dependency>-->
To run the main tests:
./gradlew check
./gradlew build
Spotless is used for code formatting.
To auto-format your code, you can run:
./gradlew spotlessApply
If you are doing local development and want to consume deephaven-csv
changes in other components, you can publish to maven local:
./gradlew publishToMavenLocal
To run the all of the JMH benchmarks locally, you can run:
./gradlew jmh
This will produce a textual output to the screen, as well as machine-readable results at build/results/jmh/results.json
.
To run specific JMH benchmarks, you can run:
./gradlew jmh -Pjmh.includes="<regex>"
If you prefer, you can run the benchmarks directly via the JMH jar:
./gradlew jmhJar
java -jar build/libs/deephaven-csv-0.14.0-SNAPSHOT-jmh.jar -prof gc -rf JSON
java -jar build/libs/deephaven-csv-0.14.0-SNAPSHOT-jmh.jar -prof gc -rf JSON <regex>
The JMH jar is the preferred way to run official benchmarks, and provides a common bytecode for sharing the benchmarks among multiple environments.
JMH Visualizer is a convenient tool for visualizing JMH results.
The benchmarks have tests to ensure that the benchmark implementations are producing the correct results. To run the benchmark tests, run:
./gradlew jmhTest