Skip to content

Latest commit

 

History

History
executable file
·
56 lines (36 loc) · 3.09 KB

readme.md

File metadata and controls

executable file
·
56 lines (36 loc) · 3.09 KB

Working with StreamSets Data Collector and Microsoft Azure

These tutorials explain how to use StreamSets Data Collector to integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage:

Pre-requisites

In order to work through these tutorials, ensure you already have the following setup ready, otherwise follow the instructions described below:

  1. Login to the Azure Portal

  2. Create a Virtual Network (vNet): This vNet will allow us to enable communication between the clusters created in the next steps to communicate privately with each other.

  3. Data storage: Create a Blob Container

  4. Install StreamSets Data Collector for HDInsight Also check StreamSets Documentation for more details on the installation process.

  5. Create an Apache Kafka on HDInsight cluster in the same vNet as above. Use Azure Storage for the Kafka cluster.

  6. Configure StreamSets Data Collector to connect to HDInsight cluster

    Configure connection to HDInsight cluster by creating symlinks to the configuration files.

    • ssh into the SDC node using the SSH Endpoint of your cluster:

       ssh sshuser@<ssh endpoint>
      
    • Navigate to the StreamSets Resources Directory and create a directory to hold cluster configuration symlinks

       cd /var/lib/sdc-resources
       sudo mkdir hadoop-conf
       cd hadoop-conf
      
    • Symlink all *.xml files from /etc/hadoop/conf and hive-site.xml from /etc/hive/conf:

       sudo ln -s /etc/hadoop/conf/*.xml .
       sudo ln -s /etc/hive/conf/hive-site.xml .
      

    image alt text

  7. Download and install SQL Server JDBC Driver for StreamSets Data Collector

  8. Capture and note down the Kafka Broker URI and Zookeeper Configuration from Ambari:

    image alt text