MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. This blog explains the nature of this programming model and how it can be used to write programs and run in the Hadoop environment.
This Blog is intended to give budding MapReduce developers a start off in developing and running a hadoop based application. It involves some tricks on hadoop MapReduce programming.The code samples provided here is tested on hadoop environment but still do post me if you find any not working.
Word count is a typical example where Hadoop map reduce developers start their hands on with. This sample map reduce is intended to count the no of occurrences of each word in the provided input files.
What are the minimum requirements?
1. Input text files – any text file
2. A VM
3. The mapper, reducer and driver classes to process the input files
How it works
The word count operation takes place in two stages a mapper phase and a reducer phase. In mapper phase first the test is tokenized into words then we form a key value pair with these words where the key being the word itself and value ‘1’. For example consider the sentence
“tring tring the phone rings”
In map phase the sentence would be split as words and form the initial key value pair as
<tring,1>
<tring,1>
<the,1>
<phone,1>
<rings,1>
In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be
<tring,2>
<the,1>
<phone,1>
<rings,1>
This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for keys.
The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts. Say if we have a total of 10 lines in our input files combined together, first the 10 lines are tokenized and key value pairs are formed in parallel, only after this the aggregation/ reducer would start its operation.
Let us define some goals for this blog.
Goals for this Module:
- Installing and accessing Hadoop
- Java programs for word count
- Compilation and execution of Java programs
Installing Hadoop
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3. It can
be mounted using Orcale VM Virtual Box. Just to let you
know, you need atleast 6 Gb of RAM for it run decently. Once
you start the virtual machine it should take about 2-3 minutes for the machine
to be ready to be used.
Accessing Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once
at this page you need to access the Secure Shell Client at 127.0.0.1:4200.
The below screenshot will give you an idea if you are on the right path.
After
Logging into SSH Client you will be need to give the login details.
You will
need to make a directory WCclasses with the command
mkdir
WCclasses
in the root directory. Then you need to use Vi Editor to create the 3
Java programs. WordMapper.java, SumReducer.java &WordCount.java.
Below are the screenshots of the 3 programs to help you further.
Java Programs
WordMapper.java
WordCount.java
Sumreducer.java
After all
3 programs are in your root director you need to compile them using javac
compile command. The class files will be in the WCclasses directory which were
made earlier.
Compile and Execution
Commands to compile the java command are
javac -classpath
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar
-d WCclasses WordMapper.java
javac -classpath
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar
-d WCclasses SumReducer.java
javac -classpath
/usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar:
-d WCclasses WordCount.java
This will
compile all 3 java files and keep the compiled class files in the directory
WCclasses.
The below command will create a JAR
file in the directory WCclasses
Creation of JAR File
jar
-cvf WordCount.jar -C WCclasses/ .
Once the
jar file is created you need to create the input directory in the hdfs file
system using the below command.
hdfs dfs
-mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
Loading files into HUE
Input of the txt files to be read into hue
Now you need to access HUE on 127.0.0.1:8000 to input
the txt files to be read. Thankfully, this is a drag and drop and does not
involve writing any commands.
Once the files are inputted you use
the below code to run the final hadoop command.
Final Execution
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41
Notice
that we have not made wc-out2. Hadoop will create the output directory by
itself once the commands are run.
You can
track your job at the address 127.0.0.1:8088 which lists the
log of all jobs and the status.
Once the
job comes as Finished & Succeeded we are on our way.
Directory for Checking final Output
We go to
the directory that was created during execution of the program /user/ru1/wc-out41
File to be seen is part-r-00000 which contains the output.
OUTPUT
And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.
\
We are successfully able to count the words in a text file. I hope this gives you a head start with Hadoop. I will come up with new posts in my upcoming blog. All the best with your Hadoop Journey.