- Back to Home »
- FAQ Hadoop - 2
Posted by : Sushanth
Sunday, 27 December 2015
11) Why do we use HDFS for applications having large data sets and not when there are a lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because “Namenode” is a very expensive, high performance system, and it is not prudent to occupy the space in the “Namenode” by unnecessary amounts of metadata that are generated for multiple small files. So, when there is a large amount of data in a single file, “Namenode” will occupy less space. Hence, for getting optimized performance, HDFS supports large data sets instead of multiple small files.
12) What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store and process huge amounts of data in the distributed file system. RDBMS will be useful when you want to seek one record from Big data, whereas Hadoop will be useful when you want Big Data in one shot and perform analysis on that later.
13) Explain the indexing process in HDFS.
Hadoop has its own way of indexing data. Depending on the block size, HDFS will continue storing the last part of the data. It will also tell you where the next part of the data is located.
14) What is “speculative execution” in Hadoop?
If a node appears to be running a task slower, the master node can redundantly execute another instance of the same task on another node. Here, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.
15) How do you achieve HA (High Availability) in a Hadoop cluster?
You can set up HA in two different ways; Using the Quorum Journal Manager (QJM), or with NFS for the shared storage. To understand this in detail, do read this Edureka blog: http://www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
16) Why is it that in HDFS, “Reading” is performed in parallel but “Writing” is not?
Using the MapReduce program, the file can be read by splitting its blocks. But while writing, MapReduce cannot be applied and no parallel writing is possible. Hence, the incoming values are not yet known to the system.
17) How can I restart “Namenode”?
- Click on stop-all.sh and then click on start-all.sh OR
- Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then/etc/init.d/hadoop-x.x-namenode start (press enter).
18) What happens if you get a “Connection refused java exception’ error when you try to access HDFS or its corresponding files?
It could mean that the “Namenode” is not working on your VM. The “Namenode” may be in “Safemode” or the IP address of “Namenode” may have changed.
19) What is “commodity hardware”? Does it include RAM?
“Commodity hardware” is a non-expensive system which doesn’t have high quality or high availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, commodity hardware includes RAM because some services might still be running on RAM.
20) What is the difference between an “HDFS Block” and an “Input Split”?
“HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data.