Working with StreamSets Data Collector and Microsoft Azure

These tutorials explain how to use StreamSets Data Collector to integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage:

Ingesting Data from Blob Storage into Apache Kafka on HDInsight
Ingesting Data from Apache Kafka on HDInsight into Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage

Pre-requisites

In order to work through these tutorials, ensure you already have the following setup ready, otherwise follow the instructions described below:

Login to the Azure Portal
Create a Virtual Network (vNet): This vNet will allow us to enable communication between the clusters created in the next steps to communicate privately with each other.
Data storage: Create a Blob Container
Install StreamSets Data Collector for HDInsight Also check StreamSets Documentation for more details on the installation process.
Create an Apache Kafka on HDInsight cluster in the same vNet as above. Use Azure Storage for the Kafka cluster.
Configure StreamSets Data Collector to connect to HDInsight cluster

Configure connection to HDInsight cluster by creating symlinks to the configuration files.
- ssh into the SDC node using the SSH Endpoint of your cluster:
```
 ssh sshuser@<ssh endpoint>
```
- Navigate to the StreamSets Resources Directory and create a directory to hold cluster configuration symlinks
```
 cd /var/lib/sdc-resources
 sudo mkdir hadoop-conf
 cd hadoop-conf
```
- Symlink all *.xml files from /etc/hadoop/conf and hive-site.xml from /etc/hive/conf:
```
 sudo ln -s /etc/hadoop/conf/*.xml .
 sudo ln -s /etc/hive/conf/hive-site.xml .
```
Download and install SQL Server JDBC Driver for StreamSets Data Collector
- Download: https://docs.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server?view=sql-server-2017
- Install: https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Configuration/ExternalLibs.html#concept_amy_pzs_gz
Capture and note down the Kafka Broker URI and Zookeeper Configuration from Ambari:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Working with StreamSets Data Collector and Microsoft Azure

Pre-requisites

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Working with StreamSets Data Collector and Microsoft Azure

Pre-requisites