If you do not have Java installed, then run the following commands to install Java in your system.
sudo yum update sudo yum install java-1.8.0-openjdk
These commands will update the package information on your VPS and then install Java.
Verify that Java has been installed on your system:
java -version
sudo addgroup hadoop sudo adduser --ingroup hadoop hadoop sudo adduser hadoop sudo
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password. This requirement can be eliminated by creating and setting up SSH certificates using the following commands:
ssh-keygen -t rsa -P '' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You will get the mirror link to download hadoop 3x tarball. Download that link, untar and then move hadoop to some location on your system. I am providing the mirror link in this article, if you are not able to download hadoop 3 tarball then you can get the link from apache website. The version of Hadoop we have used in 3.2.2
wget https://mirrors.estointernet.in/apache/hadoop/common/hadoop-3.2.2/hadoop-3.2.2-site.tar.gz tar -zxvf hadoop-3.2.2.tar.gz sudo mv hadoop-3.2.2 /usr/local/hadoop
As we know that hadoop and java are going to work on linux. So we have to update linux about the location of hadoop and java. We can do that by copying the below lines in .bashrc file and refresh the system.
vim ~/.bashrc #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END source ~/.bashrc
There are 5 configuration files that we need to configure to set up hadoop in a CentOS system. These are
You will find these configuration files in /hadoop/etc/hadoop directory.
As we know, hadoop works on top of java, so it is important for hadoop to know the location of java and in this file we will update hadoop about the location of java. In hadoop-env.sh just paste the mentioned below line.
vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre
In core-site.xml, we will update the file system and port number on which we can access hdfs on web ui. In this, we are assigning localhost as the default fs and port number to access hdfs i.e. 9000.
vim /usr/local/hadoop/etc/hadoop/core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://<hostname>:9000</value> </property> <property> <name>hadoop.proxyuser.username.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.username.hosts</name> <value>*</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp/hadoop-${user.name}</value> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property>
In hdfs-site.xml, we will set the replication value to 1 because we are setting up a hadoop cluster only on one node. By default the replication factor is set to 3 as hadoop makes 3 copies of data on the cluster. We will also set the path for namenode and data node directories.
vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.client.datanode-restart.timeout</name> <value>30</value> </property>
As we have mentioned above the path of namenode and datanode directories so we will create those directories and change the ownership of those directories
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode sudo chown -R hadoop /usr/local
vim /usr/local/hadoop/etc/hadoop/yarn-site.xml <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>${hadoop.tmp.dir}/nodemanager/local</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>2</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>16384</value> </property> <property> <name>yarn.nodemanager.resource.percentage-physical-cpu-limit</name> <value>80</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.4</value> </property> <property> <name>yarn.app.mapreduce.am.resource-mb</name> <value>3072</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx2457m</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><hostname></value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value><hostname>:8088</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
Paste the following lines in the mapred-site.xml file.
vim /usr/local/hadoop/etc/hadoop/mapred-site.xml <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>2048</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx983m</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx983m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx983m</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop/</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop/</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value> </property> <property> <name>mapreduce.job.counters.limit</name> <value>500</value> </property>
Before starting the Hadoop Distributed File System, we need to format namenode once.
hdfs namenode -format
You can start Namenode, Secondary Namenode and Datanode by executing the following command.
start-dfs.sh start-yarn.sh