Sunday, 27 December 2015

FAQ - Hadoop - 3

21) What are the features of a “Fully Distributed” mode?
“Fully Distributed” mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host on which “Namenode” runs and another host on which “Datanode” runs, and then there are machines on which “TaskTracker/NodeManager” runs. We have separate masters and slaves in this sort of a distribution.
22) Name the three modes in which Hadoop can be run.
The three modes in which Hadoop can be run are:
1. Standalone (local) mode
2. Pseudo distributed mode
3. Fully distributed mode
23) What is the role of “ZooKeeper” in a Hadoop cluster?
The purpose of “ZooKeeper” is cluster management. “ZooKeeper” will help you achieve coordination between Hadoop nodes. “ZooKeeper” also helps to:
  • Manage configuration across nodes
  • Implement reliable messaging
  • Implement redundant services
  • Synchronize process execution
Questions around MapReduce
24) What is “MapReduce”?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
25) What is the syntax to run a “MapReduce” program?
hadoop jar file.jar /input_path /output_path
26) How would you debug a Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
  • Using Counters.
  • Using the web interface provided by the Hadoop framework.
27) What are the main configuration parameters in a “MapReduce” program?
Users of the “MapReduce” framework need to specify these parameters:
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format
  • Output format
  • Class containing the “map” function
  • Class containing the “reduce” function
28) What is the default input type/format in “MapReduce”?
By default, the type input type in MapReduce is “text”.
29) State the reason why we can’t perform “aggregation” (addition) in a mapper? Why do we need the “reducer” for this?
We cannot perform “aggregation” (addition) in a mapper because sorting does not occur in the “mapper”. Sorting occurs only on the reducer side. The “Mapper” method initialization depends on each input split. During “aggregation”, we will lose the value of the previous instance. For each row, a new “mapper” will get initialized. For each row, “input split” again gets divided into the “mapper”. Hence, we cannot have a track of the previous row value.
30) What is the purpose of “RecordReader” in Hadoop?
The “InputSplit” has defined a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper”. The “RecordReader” instance is defined by the “Input Format”.


Post a Comment