FAQ Hadoop

FAQ Hadoop - 4

Posted by : Sushanth Sunday, 27 December 2015

31) Explain “Distributed Cache” in a “MapReduce Framework”

“Distributed Cache” is an important feature provided by the “MapReduce Framework”. “Distributed Cache” is used when you want to share files across multiple nodes in a Hadoop Cluster. The files could reside as executable “jar” files or simple “properties” files.

32) What mechanism does the “Hadoop Framework” provide to synchronize changes made in the “Distribution Cache” during runtime of the application?

This is a tricky question. There is no such mechanism. “Distributed Cache” by design is “read only” during the time of job execution.

33) How do “reducers” communicate with each other?

This is another tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.

Explain the role of “Reduce Side Join” in “MapReduce”.

34) What does a “MapReduce Partitioner” do?

A “MapReduce Partitioner” makes sure that all the values of a single key goes to the same “reducer”, thus allowing even distribution of the map output over the “reducers”. It redirects the “mapper” output to the “reducer” by determining which “reducer” is responsible for the particular key.

35) What is a “Combiner”?

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

36) What do you know about “SequenceFileInputFormat”?

“SequenceFileInputFormat” is an input format for reading within sequence files. Key and value are user defined. It is a specific compressed binary file format which is optimized for passing the data between the output of one “MapReduce” job to the input of some other “MapReduce” job.

Questions around Pig

37) What is a “Bag”?

A Bag is one of the data models present in “Pig”. It is an unordered collection of tuples with possible duplicates. “Bags” are used to store collections while grouping. The size of “Bag” is the size of the local disk, which means that the size of the “Bag” is limited. When a “Bag” is full, “Pig” will spill this “Bag” into the local disk and keep only some parts of the “Bag” in memory. It is not necessary that the complete “Bag” fit into the memory. We represent “Bag” with “{}”.

38) What does “FOREACH” do?

“FOREACH” is used to apply transformations to the data and to generate new data items. The name itself indicates that for each element of a data “Bag”, the respective action will be performed.

Syntax: FOREACH bagname GENERATE expression1, expression2, …..

The meaning of this statement is that the expressions mentioned after “GENERATE” will be applied to the current record of the data “Bag”.

39) Why do we need “MapReduce” during “Pig” programming?

“Pig” is a high level platform that makes Hadoop data analysis issues easier to execute. The language we use for this platform is “Pig Latin”. A program written in “Pig Latin” is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in “Pig Latin”, the “Pig compiler” will convert the program into “MapReduce” jobs. Here, “MapReduce” acts as the execution engine.

40) What is the role of a “co-group” in “Pig”?

“Co-group” joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate “bags”. The first “bag” consists of records from the first data set with the common data set, while the second “bag” consists of records from the second data set along with the common data set.