Skip to content

Latest commit

 

History

History
75 lines (53 loc) · 2.33 KB

streaming-generic-data-sink.md

File metadata and controls

75 lines (53 loc) · 2.33 KB

GenericStreamDataSink

Description

The GenericStreamDataSink framework is a utility framework that helps configuring and writing DataFrames to streams.

The framework is composed of two classes:

  • GenericStreamDataSink, which is created based on a GenericStreamDataSinkConfiguration class and provides two main functions:
    def writer(data: DataFrame): Try[DataStreamWriter[Row]]
    def write(implicit spark: SparkSession): Try[StreamingQuery]
  • GenericStreamDataSinkConfiguration: the necessary configuration parameters

Sample code

import org.tupol.spark.io._

implicit val sparkSession: SparkSession = ???
val sourceConfiguration: GenericStreamDataSinkConfiguration = ???
val dataframe = GenericStreamDataSink(sourceConfiguration).write(data)

Optionally, one can use the implicit decorator for the SparkSession available by importing org.tupol.spark.io.implicits._.

Sample code

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

val sourceConfiguration: GenericStreamDataSinkConfiguration = ???
val dataframe = data.streamingSink(sourceConfiguration).write

Configuration Parameters

Common Parameters

  • format Required
    • the type of the input file and the corresponding source / parser
    • possible values are:
      • kafka
      • file sources: xml, csv, json, parquet, avro, orc and text
  • trigger Optional
    • type: possible values: "continuous", "once", "available-now", "processing-time"
    • interval: mandatory for "continuous", "processing-time"
  • queryName Optional
  • partition.columns Optional
  • outputMode Optional
  • checkpointLocation Optional

File Parameters

  • options Required

Kafka Parameters

  • options Required
    • kafkaBootstrapServers Required
    • topic Required

References