This covers the steps needed to complete before starting on any other StreamSets Control Hub job related tutorials in this set.
- Python 3.4+ and pip3 installed
- StreamSets for SDK Installed and activated
- Access to StreamSets Control Hub with an user account in your organization
- At least one StreamSets Data Collector instance registered with the above StreamSets Control Hub instance
Note: Make sure that the user account has proper access to do the following tasks this blog post covers. The easiest way for this, is to do those tasks using the Web UI of the StreamSets Control Hub first and fix any access problems before embarking on the path below.
While creating this tutorial following was used:
- Python 3.6
- StreamSets for SDK 3.8.0
- All StreamSets Data Collector with version 3.17.0
In this preparation, 2 jobs are created with following names:
- Job for Kirti-HelloWorld
- Job for Kirti-DevRawDataSource
This page details on how to create them using SDK for Python. Optionally, you can create them using UI in the browser too. Just follow all the details needed for the jobs.
On a terminal, type the following command to open a Python 3 interpreter.
$ python3
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Let’s assume the StreamSets Control Hub is running at http://sch.streamsets.com Create an object called control_hub which is connected to the above.
from streamsets.sdk import ControlHub
# Replace the argument values according to your setup
control_hub = ControlHub(server_url='http://sch.streamsets.com',
username='user@organization1',
password='password')
Create a job either using UI or using SDK for Python.
Here is a sample job created using SDK for Python. For this tutorial purpose, create the job with
- tags e.g. tags=['kirti-job-dev-tag']
- datacollector-labels e.g. data_collector_labels = ['kirti-dev']
- Time series analysis enabled
# Create a pipeline
builder = control_hub.get_pipeline_builder()
dev_raw_data_source = builder.add_stage('Dev Data Generator')
trash = builder.add_stage('Trash')
dev_raw_data_source >> trash # connect the Dev Raw Data Source origin to the Trash destination.
pipeline = builder.build('Kirti-HelloWorld')
control_hub.publish_pipeline(pipeline)
# Create a job for the above
job_builder = control_hub.get_job_builder()
job = job_builder.build('Job for Kirti-HelloWorld', pipeline=pipeline, tags=['kirti-job-dev-tag'])
job.data_collector_labels = ['kirti-dev']
job.enable_time_series_analysis = True
control_hub.add_job(job)
After the above code is executed, one can see the job in the UI as following. Note the datacollector-label here.
Create another job either using UI or using SDK for Python.
Here is a sample job created using SDK for Python. For this tutorial purpose, create the job with
- tags e.g. tags=['kirti-job-dev-RawDS-tag']
- datacollector-labels e.g. data_collector_labels = ['kirti-dev']
- Time series analysis enabled
# Create second pipeline
builder = control_hub.get_pipeline_builder()
dev_raw_data_source = builder.add_stage('Dev Raw Data Source')
trash = builder.add_stage('Trash')
dev_raw_data_source >> trash # connect the Dev Raw Data Source origin to the Trash destination.
pipeline = builder.build('Kirti-DevRawDataSource')
control_hub.publish_pipeline(pipeline)
# Create a job for the above
job_builder = control_hub.get_job_builder()
job = job_builder.build('Job for Kirti-DevRawDataSource', pipeline=pipeline, tags=['kirti-job-dev-RawDS-tag'])
job.data_collector_labels = ['kirti-dev']
job.enable_time_series_analysis = True
control_hub.add_job(job)
Now with this preparation, you are ready to start on other tutorials in this set. To get to know more details about SDK for Python, check the SDK documentation.
If you encounter any problems with this tutorial, please file an issue in the tutorials project.