The DataSink
framework is a utility framework that helps configuring and writing DataFrame
s.
This framework provides for reading from a given path with the specified format like avro
, parquet
, orc
, json
,
csv
, jdbc
...
The framework is composed of two main traits:
DataSink
, which is created based on aFileSourceConfiguration
class and provides two main functions:def writer(data: DataFrame): Try[DataFrameWriter[Row]] def write(data: DataFrame): Try[DataFrame]
DataSinkConfiguration
: a marker trait to defineDataSink
configuration classes
The framework provides the following predefined DataSink
implementations:
- FileDataSink
- JdbcDataSink
- GenericDataSink
- FileStreamDataSink
- KafkaStreamDataSink
- GenericStreamDataSink
For convenience the DataAwareSinkFactory
trait and the default implementation are provided.
To create a DataSink
out of a given TypeSafe Config
instance, one can call
DataAwareSinkFactory( someDataSinkConfigurationInstance )
Also, in order to easily extract the configuration from a given TypeSafe Config
instance,
the FormatAwareDataSinkConfiguration
factory is provided.
FormatAwareDataSinkConfiguration( someTypesafeConfigurationInstance )
There is a convenience implicit decorator for DataFrames, available through the
import org.tupol.spark.io._
import org.tupol.spark.io.implicits._
import statements.
The org.tupol.spark.io
package contains the implicit factories for data sinks and the org.tupol.spark.implicits
contains the actual DataFrame
decorator.
This allows us to create the DataSink
by calling the sink()
function on a DataFrame,
passing a DataSinkConfiguration
configuration instance.
import org.tupol.spark.io.{pureconf, _}
import org.tupol.spark.io.implicits._
def dataFrame: DataFrame = ???
def dataSinkConfiguration: DataSinkConfiguration = ???
val dataSink: DataSink[_, _, _] = dataFrame.sink(dataSinkConfiguration)