Name	Name	Last commit message	Last commit date
parent directory ..
images	images
OracleCDC_to_DeltaLake.zip	OracleCDC_to_DeltaLake.zip
README.md	README.md

Oracle CDC to Databricks Delta Lake

Important: These instructions assume you have access to StreamSets Data Collector (v3.15+) and have performed all the prerequisites for Oracle and Databricks Delta Lake

For help installing StreamSets Data Collector, see StreamSets Data Collector Installation.
For help with Oracle CDC Client prerequisites, see Oracle CDC Client.
For help with Databricks Delta Lake prerequisites, see Databricks Delta Lake.

For more information, see Loading Data into Databricks Delta Lake in StreamSets Data Collector documentation.

Here is a link to a short video on using this pipeline template: Video Link

OVERVIEW

This pipeline demonstrates how to read change data capture (CDC) data from an Oracle database and replicate the changes to Databricks Delta Lake.

Disclaimer: This pipeline is meant to serve as a template for performing Oracle CDC to Databricks Delta Lake. Some of the parameters, tables and fields may be different for your environment and may need additional customizations. Please consult the StreamSets documentation (linked below) for full information on configuration of each stage used below. For example, this pipeline has a single table defined with keys. If you want to handle multiple tables, you will need to add them to the Delta Lake destination --> Data tab.

USING THE TEMPLATE

NOTE: Templates are supported in StreamSets Control Hub. If you do not have Control Hub, you can import the template pipeline in Data Collector but will need to do that each time you want to use the template.

PIPELINE

DOCUMENTATION

Oracle CDC Client

Expression Evaluator

Delta Lake Destination

STEP-BY-STEP

Step 1: Download the pipeline

Click Here to download the pipeline and save it to your drive.

Step 2: Import the pipeline

Click the down arrow next to the "Create New Pipeline" and select "Import Pipeline From Archive".

Click "Browse" and locate the pipeline file you just downloaded, click "OK", then click "Import"

Step 3: Configure the parameters

Click on the pipeline you just imported to open it and click on the "Parameters" tab and fill in the appropriate information for your environment.

Important: The pipeline template uses the most common default settings for things like the region, staging location, etc. All of these are configurable and if you need to change those, you can opt to not use the built-in parameters and choose the appropriate settings yourself. Please refer to the documentation listed in this document for all the available options.

The following parameters are set up for this pipeline:

`oracle_schema`	Schema to use. You can enter a schema name or use SQL LIKE syntax to specify a set of schemas. The origin submits the schema name in all caps by default. To use a lower or mixed-case name, select the Case-Sensitive Names property.
`oracle_table_name_pattern`	Table name pattern that specifies the tables to track. You can enter a table name or use SQL LIKE syntax to specify a set of tables. The origin submits table names in all caps by default. To use lower or mixed-case names, select the Case-Sensitive Names property.
`oracle_CDC_connection_string`	Connection string used to connect to the database. Note: If you include the JDBC credentials in the connection string, use the user account created for the origin. Common user accounts for Oracle 12c, 18c, or 19c multitenant databases start with `c##`.
`oracle_username`	User name for the JDBC connection. Use the user account created for the origin. Common user accounts for Oracle 12c, 18c, or 19c multitenant databases start with `c##`. For more information, see Task 3. Create a User Account.
`oracle_password`	Password for the account. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores.
`deltalake_jdbc_url`	JDBC URL used to connect to the Databricks cluster. For example: `jdbc:spark://dbc-7g9hba4d-a123.cloud.databricks.com:443/default;transportMode=http` `:ssl=1;httpPath=sql/protocolv1/o/89266567230988377/1123-1001003-abc1;AuthMech=3;UID=token;` Tip: In Databricks, you can locate the JDBC URL for your cluster on the JDBC/ODBC tab in the cluster configuration details. As a best practice, remove the `PWD` parameter from the URL, and then enter the personal access token value in the Token property below.
`deltalake_token`	Personal access token used to connect to the Databricks cluster. Tip: To secure sensitive information such as tokens, you can use runtime resources or credential stores.
`deltalake_directory_for_table_location`	Directory for the Delta table location, specified as a path on Databricks File System (DBFS). The destination adds the specified Table Name value as a subdirectory to create the final table location. For example, if you enter /mnt/deltalake as the directory for the table location and you enter `sales.accounts` as the table name, the final table location is /mnt/deltalake/sales.accounts. When you specify a location, the destination creates an unmanaged Delta table. When you do not specify a location, the destination creates a managed Delta table. For more information, see the Delta Lake documentation. Available when data drift and automatic table creation are enabled.
`deltalake_S3_bucket`	NOTE This template uses AWS S3 as the staging location. If you want to use ADLS, you will need to change it in the Delta Lake Destination --> Staging tab and provide all necessary configuration. Bucket name or path to the existing Amazon S3 location to write the staged files. Enter the bucket name or enter the full bucket path in the following format: `<bucket>/<prefix>` Available when using the Amazon S3 staging location.
`deltalake_S3_access_key`	AWS access key ID. Required when not using IAM roles with IAM instance profile credentials. Available when using the Amazon S3 staging location.
`deltalake_S3_secret_key`	AWS secret access key. Required when not using IAM roles with IAM instance profile credentials. Available when using the Amazon S3 staging location. Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
`deltalake_database_name`	The Delta Lake database name.
`deltalake_table`	The Delta Lake table name. NOTE The template can handle multiple different tables, but you need to configure all the tables and key columns in the Delta Lake destination --> Data tab. See Appendix 1 below for details.
`deltalake_key_column`	The Delta Lake key column. NOTE The template can handle multiple different tables, but you need to configure all the tables and key columns in the Delta Lake destination --> Data tab. See Appendix 1 below for details.

Step 4: Run the pipeline

Click the "START" button to run the pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oracle CDC to Delta Lake

Oracle CDC to Delta Lake

README.md

Oracle CDC to Databricks Delta Lake

Oracle CDC to Databricks Delta Lake

OVERVIEW

USING THE TEMPLATE

PIPELINE

DOCUMENTATION