Saturday 21 November 2015

Hadoop - Word counting using Map-Reduce technique


MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. This blog explains the nature of this programming model and how it can be used to write programs and run in the Hadoop environment.

This Blog is intended to give budding MapReduce developers a start off in developing and running a hadoop based application. It involves some tricks on hadoop MapReduce programming.The code samples provided here is tested on hadoop environment but still do post me if you find any not working.

Word count is a typical example where Hadoop map reduce developers start their hands on with. This sample map reduce is intended to count the no of occurrences of each word  in the provided input files.

What are the minimum requirements?

1.       Input text files – any text file
2.       A VM
3.       The mapper, reducer and driver classes to process the input files

 How it works

The word count operation takes place in two stages a mapper phase and a reducer phase. In mapper phase first the test is tokenized into words then we form a key value pair with these words where the key being the word itself and value ‘1’. For example consider the sentence
“tring tring the phone rings”
In map phase the sentence would be split as words and form the initial key value pair as
<tring,1>
<tring,1>
<the,1>
<phone,1>
<rings,1>

In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be
<tring,2>
<the,1>
<phone,1>
<rings,1>
This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for keys.

The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts. Say if we have a total of 10 lines in our input files combined together, first the 10 lines are tokenized and key value pairs are formed in parallel, only after this the aggregation/ reducer would start its operation.

Let us define some goals for this blog.

Goals for this Module:

- Installing and accessing Hadoop
- Java programs for word count
- Compilation and execution of Java programs



Installing Hadoop
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3.  It can be mounted using Orcale VM Virtual Box.  Just to let you know, you need atleast 6 Gb of RAM for it run decently. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.
Accessing Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200.  The below screenshot will give you an idea if you are on the right path.

After Logging into SSH Client you will be need to give the login  details.
You will need to make a directory WCclasses with the command
mkdir WCclasses
in the root directory. Then you need to use Vi Editor to create the 3 Java programs. WordMapper.java, SumReducer.java &WordCount.java. Below are the screenshots of the 3 programs to help you further.

Java Programs

WordMapper.java



WordCount.java


Sumreducer.java




After all 3 programs are in your root director you need to compile them using javac compile command. The class files will be in the WCclasses directory which were made earlier.
Compile and Execution

Commands to compile the java command are
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
This will compile all 3 java files and keep the compiled class files in the directory WCclasses.
The below command will create a JAR file in the directory WCclasses
Creation of JAR File

jar -cvf WordCount.jar -C WCclasses/ .
Once the jar file is created you need to create the input directory in the hdfs file system using the below command.

hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp


Loading files into HUE

Input of the txt files to be read into hue
Now you need to access HUE on 127.0.0.1:8000 to input the txt files to be read. Thankfully, this is a drag and drop and does not involve writing any commands.




Once the files are inputted you use the below code to run the final hadoop command.
Final Execution

hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41
Notice that we have not made wc-out2. Hadoop will create the output directory by itself once the commands are run.

You can track your job at the address 127.0.0.1:8088 which lists the log of all jobs and the status.
Once the job comes as Finished & Succeeded we are on our way.


Directory for Checking final Output

We go to the directory that was created during execution of the program /user/ru1/wc-out41
File to be seen is part-r-00000 which contains the output.

OUTPUT

And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.





\


We are successfully able to count the words in a text file. I hope this gives you a head start with Hadoop. I will come up with new posts in my upcoming blog. All the best with your Hadoop Journey.

1 comment: