Hadoop is a parallel processing platform for big data. It works on commodity servers and scales up very well. Hadoop consists of 2 parts:
HDFS (Hadoop Distributed File system) is the data storage in Hadoop. As shown below, HDFS is primarily made up of 2 kinds of nodes (machines) :
- Name Node
- Data Node
Name Node is where the data about where all the data sits is recorded. Some people also call it as Metadata (data about data).
Data Node is where your data sits. It’s copied across various node as specified by your preferences, so that if one data node fails, automatically another data node containing the required data comes online.
The Name Node is also redundant. We have backup Name Nodes which kick off and become online, whenever the main Name Node fails.
HDFS works on serial data. It’s not a good candidate for random data.
The backup of Name Node is called Secondary Name Node as shown above. Data Node’s are also replicated. This is how HDFS is fault tolerant and can work on massive amounts of data in parallel. The Name Node has the complete list of where all the data is stored. Basically incoming data is split and given to different data nodes to be processed further. Once the data is processed it is sent back to the client.
This leads us to the processing algorithm. It’s called Map Reduce. There are two essential parts as shown below:
They are :
The Mapper function takes the input and transforms it into Key-Value pairs (Name : Value) and passes it onto a Sorting Function which sorts the data and groups them.
The Reducer function takes in the data from the sorter and processes them to produce the eventual output. The Map and Reduce function run on two kinds of trackers (nodes)
- Job Tracker
- Task Tracker
The Job Tracker will supply the Task Tackers with the Function that has to be applied to the data in the data nodes and then reduces it and gives back to the client,
Note that the Task tracker is the individual node with the map reduce algorithm. The Job Tracker simply collates all the information from the various task trackers and generates an output.
Here’s how to remember it :
|Component||First level||Second level|
|HDFS||Name Node||Data Node|
|MapReduce||Job Tracker||Task Tracker|
HDFS and Map Reduce is a very good combination that works on a swarm of data and produce results in seconds.
There are many vendors on Hadoop. Check out :
Hadoop is also a good combination with the cloud. Hosting a Hadoop system on AWS is a good proposition.
Note that earlier, we had systems where the data is moved to the CPU, now the individual map reduce programs are brought to where the data is. This is a 180 degree flip.
There are many people who are using other functions like Spark instead of Map Reduce. There are going to be developments in this area. Stay Tuned.