Introduction To Hadoop & MapReduce For Beginners

The objective informed is to offer a 10,000 feet opinion of Hadoop for those who know next to nothing about it and therefore you can learn hadoop step by step. This post is not developed to get you prepared for Hadoop growth, but to offer a sound understanding for you to take the next measures in mastering the technology.

Lets get down to it:

Hadoop is an Apache Application Platform venture that significantly provides two things:

An allocated file system known as HDFS (Hadoop Distributed File System)

A structure and API for developing and operating MapReduce jobs

some hyperlinks for your information:

1. What Is The Difference Between Hadoop Database and Traditional Relational Database


HDFS is organized in the same way to a normal Unix file system except that detailed storage space is shipped across several devices. It should not have been an alternative to a normal file system, but rather as a file system-like part for big allocated techniques to use. It has in designed systems to deal with device problems, and is enhanced for throughput rather than latency.

There are two and a half types of device in a HDFS cluster:

Datanode – where HDFS actually shops the details, there are usually quite a few of these.

Namenode – the ‘master’ device. It manages all the meta data for the cluster. Eg – what prevents blocks data, and what datanodes those prevents are saved on.

Additional Namenode – this is NOT a back-up namenode, but is an individual support that keeps a duplicate of both the modify records, and filesystem picture, consolidating them regularly to keep the dimension affordable.

this is soon being deprecated in benefit of the back-up node and the checkpoint node, but the performance continues to be identical (if not the same)

Data can be utilized using either the Java API, or the Hadoop control range customer. Many functions are just like their Unix alternatives. Examine out the certification web page for the complete record, but here are some easy examples:

list files in the root directory


fs -ls /

list files in my home directory


fs -ls ./

cat a file (decompressing if needed)


fs -text ./file.txt.gz

upload and retrieve a file

hadoop fs -put

./localfile.txt /home/matthew/remotefile.txt

Note that HDFS is enhanced in a different way than a normal file program. It is made for non-realtime programs challenging great throughput instead of online programs challenging low latency. For example, data files cannot be customized once published, and the latency of reads/writes is really bad by filesystem requirements. On the other hand, throughput devices pretty linearly with the variety of datanodes in a group, so it works with workloads no individual device would ever be able to.

HDFS also has a whole lot of improvements that ensure it is best suited for allocated systems:

    Failing tolerant – details can be copied across several datanodes to guard against device problems. The market conventional seems to be a duplication aspect of 3 (everything is saved on three machines).

    Scalability – data transfers occur straight with the datanodes so your read/write potential devices pretty well with the variety of datanodes

    Space – need more hard drive space? Just add more datanodes and re-balance

    Industry standard – Lots of Other allocated programs develop on top of HDFS (HBase, Map-Reduce)

    Pairs well with MapReduce


The second essential portion of Hadoop is the MapReduce aspect. This is comprised of two sub components:

An API for composing MapReduce workflows in Java.

A set of solutions for handling the performance of these workflows.

The Map and Reduce APIs

The primary assumption is this:

    Map tasks perform a transformation.

    Reduce tasks perform an aggregation.