Hadoop 3.2.1 on Ubuntu 18.04 (Fully-Distributed)

R1CH4RD5
9 min readMay 26, 2021

In this small article we will configure a fully-distributed system using Hadoop 3.2.1 and three Ubuntu 18.04 machines using ORACLE VirtualBox.

[ Getting Started ]

Keep a close look in the pictures of this article as some steps are on them (eg. when asked some questions as a specific command are executed, etc…).

[ Network ]

VirtualBox Network Adapter 1

VirtualBox Network Adapter 2

[ Installing SSH ]

Type Yes, and hit ENTER when asked.

[ Installing PDSH ]

Type Yes, and hit ENTER when asked.

[ BASHRC File ]

At the end of the document, add the following line an save the file:

export PDSH_RCMD_TYPE=ssh

[ New Key ]

Execute the following command to create a new key:

ssh-keygen -t rsa -P “”

Copy the public key to authorized_keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Check the ssh settings by connecting into the localhost. If asked to continue, type yes:

[ Install JAVA 8 ]

It needs to be JAVA 8 as Hadoop 3* only support this one.

Check java installation.

[ Download Hadoop 3.2.1 ]

sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Wait until the download finish.

Check the existence of the downloaded file, extract it and change the name of the new directory (hadoop-3.2.1) to hadoop. Keep checking as you do this steps to see if everything is correctly concluded.

[ Check the Java’s path ]

ls /usr/lib/jvm/java-8-openjdk-amd64/

[ Editing hadoop-env.sh ]

Path:

nano ~/hadoop/etc/hadoop/hadoop-env.sh

Add the following line on the java’s implementation section:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

Change the hadoop directory location to /usr/local/hadoop

[ Editing Environment ]

Path:

sudo nano /etc/environment

In the PATH variable, add:

:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

Below the PATH variable, add a new variable (JAVA_HOME):

JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”

[ Create a New User ]

To manage sudo privileges for the new user we’ll use the command usermod. To grant sudo privileges to a user type:

sudo usermod -aG hadoopuser hadoopuser

Change the ownership of the hadoop directory:

sudo chown hadoopuser:root -R /usr/local/hadoop/

Change the directory permissions:

sudo chmod g+rwx -R /usr/local/hadoop/

Add the new user hadoopuser to the Sudoers:

sudo adduser hadoopuser sudo

[ Checking the Machine IP ]

ip addr

As the picture below, for me, my IP of the machine is 192.168.56.101. Keep in mind that yours could be a different.

[ Configure Hostnames for the Machines ]

sudo nano /etc/hosts

Name the early IP showed as hadoop-primary. Then as we will create new machines in the next steps, their IP’s will be create as new devices/machines connects into the network, in other words they will have a incrementation from the hadoop-primary machine, like 192.168.56.102 and 192.168.56.103 from this last one.

If you did follow this steps until here, before finishing the configuration on his machine (hadoop-primary), logout from the localhost.

[ Cloning ]

[ Secondary 1 Cloning ]

On the Expert Mode:

[ Secondary 2 Cloning ]

On the Expert Mode:

All three machines: Primary, Secondary 1 and Secondary 2.

Before proceeding to the next stage, let’s check the rcmd on each machine:

On a terminal:

pdsh -q -w localhost

Check the attributed value of the rcmd. Each machine needs to have the ssh as rcmd value, in case of any of them has rsh as value, change it with the following command:

echo "ssh" | sudo tee /etc/pdsh/rcmd_default

[ Setting the Hostnames ]

[ Primary ]

Enter normally into the Primary machine and set the hostname as: hadoop-primary.

sudo nano /etc/hostname

Then restart the machine to take effect.

[ Secondary 1 ]

Enter normally into the Secondary 1 machine and set the hostname as: hadoop-secondary-1.

sudo nano /etc/hostname

Then restart the machine to take effect.

[ Secondary 2 ]

Enter normally into the Secondary 2 machine and set the hostname as: hadoop-secondary-2.

sudo nano /etc/hostname

Then restart the machine to take effect.

[ Copying the SSH Keys ]

On the Primary machine, change the user to the hadoopuser.

Create a new key.

ssh-keygen -t rsa

Copy the key to all the machines:

ssh-copy-id hadoopuser@hadoop-primary
ssh-copy-id hadoopuser@hadoop-secondary-1
ssh-copy-id hadoopuser@hadoop-secondary-2

[ Editing core-site.xml ]

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following configuration:

  <property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-primary:9000</value>
</property>

[ Editing hdfs-site.xml ]

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following configurations:

  <property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

[ Adding the Workers ]

sudo nano /usr/local/hadoop/etc/hadoop/workers

As the workers hostnames:

hadoop-secondary-1
hadoop-secondary-2

Now, copy the early configured files from the Primary machine to the Secondary ones.

[ Secondary 1 ]

scp /usr/local/hadoop/etc/hadoop/* hadoop-secondary-1:/usr/local/hadoop/etc/hadoop/

[ Secondary 2 ]

scp /usr/local/hadoop/etc/hadoop/* hadoop-secondary-2:/usr/local/hadoop/etc/hadoop/

To read and execute the content of the environment variables, execute:

source /etc/environment

Then format the Hadoop Distributed File System (HDFS):

hdfs namenode -format

Start the dfs service.

start-dfs.sh

Note: If it gives denied permission, check again the rcmd values of each machine as done early on this article and if it has the rsh value, change it to ssh.

Check the JVM’s with jps.

If everything is correctly done, you can jps in the Secondary Machines to see the DataNode’s services on each one:

On the Primary machine you can check the Distributed File System page on hadoop-primary address on the port 9870:

[ Configuring the Yarn ]

[ Primary ]

On the Primary machine we will configure the yarn by executing the following commands:

export HADOOP_HOME=”/usr/local/hadoop”
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

[ Secondary’s ]

On the Secondary’s machines we will add some configurations in some files:

[ yarn-site.xml ]

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following configuration in both machines:

  <property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-primary</value>
</property>

Now, on the Primary machine, start the service:

start-yarn.sh

You can check the node list:

yarn node -list

On the Primary machine, you can check the Hadoop Cluster, with the address hadoop-primary:8088/cluster

[ Checking Nodes ]

[ Executing a Job ]

Has a hadoop mapreduce example, we will create a job with the following command that will give a estimated value of pi:

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 16 1000

As we can see in the results, the job did take 7.097 seconds to finish.

When wanted, you can terminate the services before turning off the machines:

To terminate yarn service:

stop-yarn.sh

To terminate dfs service:

stop-dfs.sh

Hope you liked this small article, best regards, Ricardo Costa (Richards).

This article was created in a context of the Distributed Systems Class 2020–21, ESTG-IPG.

--

--