These tutorials explain how to use StreamSets Data Collector to integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage:
- Ingesting Data from Blob Storage into Apache Kafka on HDInsight
- Ingesting Data from Apache Kafka on HDInsight into Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage
In order to work through these tutorials, ensure you already have the following setup ready, otherwise follow the instructions described below:
-
Login to the Azure Portal
-
Create a Virtual Network (vNet): This vNet will allow us to enable communication between the clusters created in the next steps to communicate privately with each other.
-
Data storage: Create a Blob Container
-
Install StreamSets Data Collector for HDInsight Also check StreamSets Documentation for more details on the installation process.
-
Create an Apache Kafka on HDInsight cluster in the same vNet as above. Use Azure Storage for the Kafka cluster.
-
Configure StreamSets Data Collector to connect to HDInsight cluster
Configure connection to HDInsight cluster by creating symlinks to the configuration files.
-
ssh into the SDC node using the SSH Endpoint of your cluster:
ssh sshuser@<ssh endpoint>
-
Navigate to the StreamSets Resources Directory and create a directory to hold cluster configuration symlinks
cd /var/lib/sdc-resources sudo mkdir hadoop-conf cd hadoop-conf
-
Symlink all *.xml files from /etc/hadoop/conf and hive-site.xml from /etc/hive/conf:
sudo ln -s /etc/hadoop/conf/*.xml . sudo ln -s /etc/hive/conf/hive-site.xml .
-
-
Download and install SQL Server JDBC Driver for StreamSets Data Collector
-
Capture and note down the Kafka Broker URI and Zookeeper Configuration from Ambari: