The GenericStreamDataSink
framework is a utility framework that helps configuring and writing DataFrame
s to streams.
The framework is composed of two classes:
GenericStreamDataSink
, which is created based on aGenericStreamDataSinkConfiguration
class and provides two main functions:def writer(data: DataFrame): Try[DataStreamWriter[Row]] def write(implicit spark: SparkSession): Try[StreamingQuery]
GenericStreamDataSinkConfiguration
: the necessary configuration parameters
Sample code
import org.tupol.spark.io._
implicit val sparkSession: SparkSession = ???
val sourceConfiguration: GenericStreamDataSinkConfiguration = ???
val dataframe = GenericStreamDataSink(sourceConfiguration).write(data)
Optionally, one can use the implicit decorator for the SparkSession
available by importing org.tupol.spark.io.implicits._
.
Sample code
import org.tupol.spark.io._
import org.tupol.spark.io.implicits._
val sourceConfiguration: GenericStreamDataSinkConfiguration = ???
val dataframe = data.streamingSink(sourceConfiguration).write
format
Required- the type of the input file and the corresponding source / parser
- possible values are:
kafka
- file sources:
xml
,csv
,json
,parquet
,avro
,orc
andtext
trigger
Optionaltype
: possible values: "continuous", "once", "available-now", "processing-time"interval
: mandatory for "continuous", "processing-time"
queryName
Optionalpartition.columns
OptionaloutputMode
OptionalcheckpointLocation
Optional
options
Requiredpath
Required- For more details check the File Data Sink
options
RequiredkafkaBootstrapServers
Requiredtopic
Required