diff --git a/docs/projection/index.html b/docs/projection/index.html new file mode 100644 index 00000000..0477848e --- /dev/null +++ b/docs/projection/index.html @@ -0,0 +1,55 @@ +documentation: Projection

Projection

+ +

Schema projection is a way of optimization of reads. When calling ParquetReader.as[MyData] Parquet4s reads the whole content of each Parquet record even when you provide a case class that maps only a part of stored columns. The same happens when you use generic records by calling ParquetReader.generic. However, you can explicitly tell Parquet4s to use a different schema. In effect, all columns not matching your schema will be skipped and not read. You can define the projection schema in numerous ways:

+ +
    +
  1. by defining case class for typed read using projectedAs,
  2. +
  3. by defining generic column projection (allows reference to nested fields and aliases) using projectedGeneric,
  4. +
  5. by providing your own instance of Parquet’s MessageType for generic read using projectedGeneric.
  6. +
+ +
import com.github.mjakubowski84.parquet4s.{Col, ParquetIterable, ParquetReader, Path, RowParquetRecord}
+import org.apache.parquet.schema.MessageType
+
+// typed read
+case class MyData(column1: Int, columnX: String)
+val myData: ParquetIterable[MyData] = 
+  ParquetReader
+    .projectedAs[MyData]
+    .read(Path("file.parquet"))
+
+// generic read with column projection
+val records1: ParquetIterable[RowParquetRecord] = 
+  ParquetReader
+    .projectedGeneric(
+      Col("column1").as[Int],
+      Col("columnX").as[String].alias("my_column"),
+    )
+    .read(Path("file.parquet"))
+
+// generic read with own instance of Parquet schema
+val schemaOverride: MessageType = ???
+val records2: ParquetIterable[RowParquetRecord] = 
+  ParquetReader
+    .projectedGeneric(schemaOverride)
+    .read(Path("file.parquet"))
+
+ +
\ No newline at end of file