You must be signed in to change notification settings - Fork 20
How to connect to IAE via HDFS Toolkit
This document describes a step by step instruction to connect to the Hadoop file system on IBM cloud via streamsx.hdfs Toolkit.
Create an IBM Analytics Engine service on IBM cloud.
Select the AE1.0 Spark and Hadoop for HDP2.x or AE1.2 for HDP3.1 as Software package
https://console.bluemix.net/catalog/?search=Analytics%20EngineFor more information about IBM Analytics Engine check this link:
https://console.bluemix.net/docs/services/AnalyticsEngine/index.html#introduction -
Create a service credential for Analytics Engine service on IBM cloud.
The following sample shows an IBM Analytics Engine service credentials:
"apikey": "xyxyxyxyxyxyxyxyxyxyxyxyxyx",
"cluster": {
"cluster_id": "20180404-125209-123-VNhbnQRZ",
"service_endpoints": {
"ambari_console": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:9443",
"hive_hdfs": "hdfs:hive2://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/;ssl=true;transportMode=http;httpPath=gateway/default/hive",
"livy": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/livy/v1/batches",
"notebook_gateway": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/jkg/",
"notebook_gateway_websocket": "wss://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/jkgws/",
"oozie_rest": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/oozie",
"phoenix_hdfs": "hdfs:phoenix:thin:url=https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/avatica;authentication=BASIC;serialization=PROTOBUF",
"spark_history_server": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/sparkhistory",
"spark_sql": "hdfs:hive2://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/;ssl=true;transportMode=http;httpPath=gateway/default/spark",
"ssh": "ssh [email protected]",
"webhdfs": "https://chs-nxr-123-mn001.bi.services.us-south.bluemix.net:8443/gateway/default/webhdfs/v1/"
"service_endpoints_ip": {
"ssh": "ssh [email protected]",
"webhdfs": ""
The user name and the password is not more a part of credentials:
The user is clsadmin
and you need to reset the password manually.
For a webdfs connection to the Hadoop file system, we need 3 parameters from Analytics Engine service credentials:
The hdfsUri has to be set to "webhdfs://IAE-server:port"
You can find the ip address and the port of HDFS server in line :
"webhdfs": ""
in service credential.
For example in this case you have to set the value of hdfsUri
to :
hdfsUri : "webhdfs://";
The value of hdfsUser has to be set to the user name of IAE.
You can find the user name in the following line of service credential:
"ssh": "ssh [email protected]"
The value of hdfsPassword has to be set to the password of IAE.
Reset the password as described in:
And copy the created password.
For example:
$hdfsUri = "webhdfs://";
$hdfsUser = "clsadmin";
$hdfsPassword = "IAEPASSWORD";
Replace the value of "$hdfsUri"
and "$hdfsPassword"
parameters with your IBM Analytics Engine service credential in the following SPL file.
If you get the following SSL connection error:
javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
You have to set the TLS version on your JAVA client.
There are two properties that a java application can use to specify the TLS version of the SSL handshake.
The streams JAVA operators have possibility to use the vmArg
parameter to specify additional JVM arguments.
Add the following parameter in every HDFS operator.
vmArg :"-Dhttps.protocols=TLSv1.2";
Create a new project CloudSample in your workspace:
you can find the SPL and Make file in:
The CloudSample runs with HDFS Toolkit version 4.0.0 or higher.
* This SPL application shows how to connect to a Hadoop instance running on Cloud via webhdfs
* Specify the name of a directory to read HDFS files as a submission time parameter.
* Additional required parameters are the hdfsUri of the HDFS server, username and password for authentication.
* To get these credentials:
* Create a Analytics Engine service on IBM cloud.
* https://console.bluemix.net/catalog/?search=Analytics%20Engine
* IBM Analytics Engine documentation
* https://console.bluemix.net/docs/services/AnalyticsEngine/index.html#introduction
* Create a service credential for Analytics Engine service on IBM cloud.
* And replace the value of hdfsUser and hdfsPassword in this spl file with values
* from user and password in IAE credential
* The value of $hdfsUri is webhdfs://<host>:<port>
* Reset the password as described in: <br/>
* https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-retrieve-cluster-credentials
* And copy the created password.
* It is also possible to set the value of these parameters during the submission.
* If you want to run this sample in on-perm Streams server, you have to unset
* The HADOOP_HOME environment variable
* The CloudSample composite first creates some files in directory via HDFS2FileSink.
* Then HDFS2DirectoryScan reads the file names located in the test directory.
* The HDFS2FileSource reads lines from files located in the test directory
* See the toolkit's documentation for compile and run instructions.
* @param hdfsUri HDFS URI to connect to, of the form webhdfs://<host>:<port>
* @param hdfsUser User to connect to HDFS.
* @param hdfsPassword Password to connect to HDFS.
* @param directory directory to read and write files.
use com.ibm.streamsx.hdfs::* ;
composite CloudSample
expression<rstring> $hdfsUri : getSubmissionTimeValue("hdfsUri", "webhdfs://") ;
expression<rstring> $hdfsUser : getSubmissionTimeValue("hdfsUser", "clsadmin") ;
expression<rstring> $hdfsPassword : getSubmissionTimeValue("hdfsPassword", "IAE-Password") ;
expression<rstring> $directory : getSubmissionTimeValue("directory", "testDirectory") ;
// The pulse is a Beacon operator that generates counter.
stream<int32 counter> pulse = Beacon()
state : mutable int32 i = 0 ;
initDelay : 1.0 ;
iterations : 25u ;
pulse : counter = i ++ ;
// creates lines and file names for HDFS2FileSink
stream<rstring line, rstring filename> CreateLinesFileNames= Custom(pulse)
state :
mutable int32 count = 0 ;
mutable timestamp ts = getTimestamp() ;
mutable rstring strTimestamp = "" ;
onTuple pulse :
// every 5 lines in a new file
if ( (counter % 5) == 0)
ts = getTimestamp() ;
// create date time in yyyymmdd-hhMMss format
strTimestamp = (rstring) year(ts) +((month(ts) < 9u) ? "0" : "")
+ (rstring)(month(ts) + 1u) +((day(ts) < 10u) ? "0": "")
+ (rstring) day(ts) + "-" +((hour(ts) < 10u) ? "0" : "")
+ (rstring) hour(ts) +((minute(ts) < 10u) ? "0" :"")
+ (rstring) minute(ts) +((second(ts) < 10u) ? "0" : "")
+ (rstring) second(ts) ;
submit({ line = "HDFS 4.0 and Streams test with webhdfs " + strTimestamp + " " +(rstring) counter,
filename = "/user/" + $hdfsUser + "/" + $directory + "/" + strTimestamp + "-hdfs.out" }, CreateLinesFileNames) ;
// writes tuples that arrive on input port from CreateLinesFileNames to the output file.
// The file names created also by CreateLinesFileNames on input port
stream<rstring out_file_name, uint64 size> HdfsFileSink = HDFS2FileSink(CreateLinesFileNames)
onTuple CreateLinesFileNames :
printStringLn("HDFS2FileSink message : " + line) ;
hdfsUri : $hdfsUri ;
hdfsUser : $hdfsUser ;
hdfsPassword : $hdfsPassword ;
fileAttributeName : "filename" ;
vmArg : "-Dhttps.protocols=TLSv1.2";
// print out the file name and the size of file
() as PrintHdfsSink = Custom(HdfsFileSink)
onTuple HdfsFileSink :
printStringLn("HDFS2FileSink Wrote " +(rstring) size + " bytes to file " + out_file_name) ;
// scan the given directory from HDFS, default to . which is the user's home directory
stream<rstring fileNames> HdfsDirectoryScan = HDFS2DirectoryScan()
initDelay : 10.0 ;
directory : $directory ;
hdfsUri : $hdfsUri ;
hdfsUser : $hdfsUser ;
hdfsPassword : $hdfsPassword ;
vmArg : "-Dhttps.protocols=TLSv1.2";
//print out the names of each file found in the directory
() as PrintDirectoryScan = Custom(HdfsDirectoryScan)
onTuple HdfsDirectoryScan :
printStringLn("HDFS2DirectoryScan found file in directory: " + fileNames) ;
// use the file name from directory scan to read the file
stream<rstring lines> HdfsFileSource = HDFS2FileSource(HdfsDirectoryScan)
hdfsUri : $hdfsUri ;
hdfsUser : $hdfsUser ;
hdfsPassword : $hdfsPassword ;
vmArg : "-Dhttps.protocols=TLSv1.2";
//print out the lines from file found in the directory
() as PrintHdfsFileSource = Custom(HdfsFileSource)
onTuple HdfsFileSource :
printStringLn("HdfsFileSource line : " + lines) ;
To create this SPL application the new version of HDFS toolkit (4.0.0) is required.
Download and copy the latest version of streamsx.hdfs in your workspace.
The Makefile makes also the toolkit.
# Copyright (C)2014, 2018 International Business Machines Corporation and
# others. All Rights Reserved.
.PHONY: all distributed clean
#HDFS_TOOLKIT_INSTALL = $(STREAMS_INSTALL)/toolkits/com.ibm.streamsx.hdfs
HDFS_TOOLKIT_INSTALL = ../../com.ibm.streamsx.hdfs
all: distributed
cd ../../com.ibm.streamsx.hdfs; ant;
rm -rf output
Be aware of tabs in Makefile
Download the HDFS Toolkit version 4.0.0 from:
And extract it in your workspace directory.
Change the database credentials in samples/CloudSample/CloudSample.spl file with your IBM Analytics Engine credentials and run
$> cd samples/CloudSample
$> make
If you want to run this sample in on-perm Streams server, you have to unset the HADOOP_HOME environment variable
Start the application with
$> output/bin/standalone
Or you can submit the job on your local Streams server with:
$ streamtool submitjob output/CloudSample.sab
Create a Streaming Analytics on IBM Cloud
Start the service
Lunch the application
It starts the IBM Streams console.
Now it is possible to submit a SAB file as streams job with IBM Streams console.
The SAB file is located in your project output directory:
The SAB file includes all Hadoop client JAR libraries.
To check the result, login on IAE server and check the contain of hadoop file system.
$> ssh clsadmin@<your-IAE-ip-address>
$ > hadoop fs -ls /user/clsadmin/testDirectory
[clsadmin@chs-nxr-593-mn003 ~]$ hadoop fs -ls /user/clsadmin/testDirectory
Found 4 items
-rw-r--r-- 3 clsadmin biusers 285 2018-04-19 12:23 /user/clsadmin/testDirectory/20180419-142316-hdfs.out
-rw-r--r-- 3 clsadmin biusers 285 2018-04-19 12:23 /user/clsadmin/testDirectory/20180419-142317-hdfs.out
-rw-r--r-- 3 clsadmin biusers 290 2018-04-19 12:23 /user/clsadmin/testDirectory/20180419-142319-hdfs.out
-rw-r--r-- 3 clsadmin biusers 290 2018-04-19 12:23 /user/clsadmin/testDirectory/20180419-142321-hdfs.out
For more details about the hadoop commands check this link: