Skip to content

5. Apache Hadoop

Fadhil Yori Hibatullah edited this page Sep 14, 2021 · 4 revisions

Back

Note: This command runs on /home/{YOURUSERNAME} (current active directory, how to check? run pwd command on the terminal) by default, if you want to change the installation path of Hadoop, simply change the current active dir /home/{YOURUSERNAME} with you want using the cd command.

  1. Required additional step before the Hadoop installation :

    • Java installation (minimum version 8) OpenJDK/Oracle
    • Set the JAVA_HOME environment variable
  2. Download Hadoop

    $ wget https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
    
  3. Extract the Hadoop archive

    $ tar -xzf hadoop-3.2.1.tar.gz
    
  4. Rename the Hadoop directory name

    $ mv hadoop-3.2.1 hadoop
    
  5. (optional) Remove the downloaded archive to save space

    $ rm -f hadoop-3.2.1.tar.gz
    
  6. Add Hadoop to user PATH, edit /home/{YOURUSERNAME}/.bashrc with any text editor (ex: nano .bashrc), add this line at the bottom:

    PATH=$PATH:/home/{YOURUSERNAME}/hadoop/bin:/home/{YOURUSERNAME}/hadoop/sbin
    
  7. Load the new environment:

    $ source ~/.bashrc
    
  8. Check the Hadoop version using this command:

    $ hadoop version
    

Single Node Cluster Installation

Note: Run this installation if you want only install the Hadoop on one pc

  1. Make sure you can connect using ssh to localhost without a password if ssh still asks the password run these commands and then check again:

    $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    $ chmod 0600 ~/.ssh/authorized_keys
    
  2. Configure the Hadoop, edit the /home/{YOURUSERNAME}/hadoop/etc/hadoop/core-site.xml file, change the file content with this:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>
    

    Note: if you have portainer installed on port 9000 change the Hadoop port to other than 9000 or change the portainer port. Don't forget to change the KaspaCoreSystem application.conf too

  3. Edit the /home/{YOURUSERNAME}/hadoop/etc/hadoop/hdfs-site.xml file, change the file content with this:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/{YOURUSERNAME}/hadoop/dfs/name</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/{YOURUSERNAME}/hadoop/dfs/data</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-bind-host</name>
            <value>0.0.0.0</value>
        </property>
    </configuration>
    
  4. Edit the /home/{YOURUSERNAME}/hadoop/etc/hadoop/mapred-site.xml file, change the file content with this:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>yarn.resourcemanager.address</name>
            <value>localhost:8032</value>
        </property>
        <property>
            <name>mapreduce.application.classpath</name>
            <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
        </property>
    </configuration>
    
  5. Edit the /home/{YOURUSERNAME}/hadoop/etc/hadoop/yarn-site.xml file, change the file content with this:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.nodemanager.env-whitelist</name>
            <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>
    </configuration>
    
  6. Format Hadoop DFS

    $ hdfs namenode -format
    
  7. Start the Hadoop Namenode and Yarn

    # To start:
    $ start-dfs.sh
    $ start-yarn.sh
    
    # To stop it, run : 
    $ stop-dfs.sh
    $ stop-yarn.sh
    
  8. Create the required directory

    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/job
    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/kaspa
    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/kafka-checkpoint
    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/kaspa-checkpoint
    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/schema/raw_kaspa
    $ hdfs dfs -mkdir -p hdfs://localhost:9000/user/{YOURUSERNAME}/file/maxmind
    
  9. Download Maxmind Database (GeoLite2-City) from here: https://www.maxmind.com/en/accounts/current/geoip/downloads

  10. Put the file GeoLite2-City.mmdb into HadoopFS

    $ hdfs dfs -put /path/to/GeoLite2-City.mmdb /user/{YOURUSERNAME}/file/maxmind/
    

Back