Sunday, 27 December 2015

FAQ Hadoop - 1

1) List the various Hadoop daemons and their roles in a Hadoop cluster.
Namenode: It is the Master node which is responsible for storing the meta data for all the files and directories. It has information around blocks that make a file, and where those blocks are located in the cluster.
Datanode: It is the Slave node that contains the actual data. It reports information of the blocks it contains to the NameNode in a periodic fashion.
Secondary Namenode: It periodically merges changes in the NameNode with the edit log so that it doesn’t grow too large in size. It also keeps a copy of the image which can be used in case of failure of NameNode.
JobTracker: This is a daemon that runs on a Namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to different task trackers.
TaskTracker: This is a daemon that runs on Datanodes. Task Trackers manage the execution of individual tasks on the slave node.
ResourceManager (Hadoop 2.x): It is the central authority that manages resources and schedules applications running on top of YARN.
NodeManager (Hadoop 2.x): It runs on slave machines, and is responsible for launching the application’s containers, monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
JobHistoryServer (Hadoop 2.x): It maintains information about MapReduce jobs after the Application Master terminates.
2) Name some Hadoop tools that are required to work on Big Data.
“Hive”, “HBase, Ambari and many more. There are many Hadoop tools for Big Data.
3) List the difference between Hadoop 1 and Hadoop 2.
In Hadoop 1.x, “Namenode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “Namenodes”. If the active “Namenode” fails, the passive “Namenode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MR2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop-1.x.
To learn more about the advantages of Hadoop 2.x, read this blog: http://www.edureka.co/blog/introduction-to-hadoop-2-0-and-advantages-of-hadoop-2-0/
4) What are active and passive “Namenodes”?
In Hadoop-2.x, we have two Namenodes – Active “Namenode” and Passive “Namenode”. Active “Namenode” is the “Namenode” which works and runs in the cluster. Passive “Namenode” is a standby “Namenode”, which has similar data as active “Namenode”. When the active “Namenode” fails, the passive “Namenode” replaces the active “Namenode” in the cluster. Hence, the cluster is never without a “Namenode” and so it never fails.
5) How does one remove or add nodes in a Hadoop cluster?
One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance to the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.

6) What happens when two clients try to access the same file on the HDFS?
HDFS supports exclusive writes only.
When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client
7) Why do we sometimes get a “file could only be replicated to 0 nodes, instead of 1″ error?
This happens because the “Namenode” does not have any available DataNodes.
8) How does one switch off the “SAFEMODE” in HDFS?
You use the command: hadoop dfsadmin –safemode leave
9) How do you define “block” in HDFS? What is the block size in Hadoop 1 and in Hadoop 2? Can it be changed?
A “block” is the minimum amount of data that can be read or written. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size:  128 MB
Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xmlfile to set the size of a block in a Hadoop environment.
10) How do you define “rack awareness” in Hadoop?
It is the manner in which the “Namenode” decides how blocks are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.


Post a Comment