Skip to content

Latest commit

 

History

History
73 lines (53 loc) · 2.68 KB

data-sink.md

File metadata and controls

73 lines (53 loc) · 2.68 KB

DataSink

Description

The DataSink framework is a utility framework that helps configuring and writing DataFrames.

This framework provides for reading from a given path with the specified format like avro, parquet, orc, json, csv, jdbc...

The framework is composed of two main traits:

  • DataSink, which is created based on a FileSourceConfiguration class and provides two main functions:
    def writer(data: DataFrame): Try[DataFrameWriter[Row]]
    def write(data: DataFrame): Try[DataFrame]
  • DataSinkConfiguration: a marker trait to define DataSink configuration classes

Usage

The framework provides the following predefined DataSink implementations:

For convenience the DataAwareSinkFactory trait and the default implementation are provided. To create a DataSink out of a given TypeSafe Config instance, one can call

DataAwareSinkFactory( someDataSinkConfigurationInstance )

Also, in order to easily extract the configuration from a given TypeSafe Config instance, the FormatAwareDataSinkConfiguration factory is provided.

FormatAwareDataSinkConfiguration( someTypesafeConfigurationInstance )

There is a convenience implicit decorator for DataFrames, available through the

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

import statements. The org.tupol.spark.io package contains the implicit factories for data sinks and the org.tupol.spark.implicits contains the actual DataFrame decorator.

This allows us to create the DataSink by calling the sink() function on a DataFrame, passing a DataSinkConfiguration configuration instance.

import org.tupol.spark.io.{pureconf, _}
import org.tupol.spark.io.implicits._

def dataFrame: DataFrame = ???
def dataSinkConfiguration: DataSinkConfiguration = ???
val dataSink: DataSink[_, _, _] = dataFrame.sink(dataSinkConfiguration)

Configuration Parameters