- Back to Home »
- Implementing the tool interface
Posted by : Sushanth
Thursday, 24 December 2015
Implementing the tool interface for Mapreduce driver:
The downside of using static main method for Mapreduce driver is that the configuration properties are hardcoded. To modify some of the configuration properties like number of reducers, the code need to modified, rebuild the jar file and redeploy the application. This can be avoided by implementing the Tool interface in MapReduce driver code.
Hadoop Configuration:
By implementing the Tool interface and extending Configured class, the hadoop Configuration object can be set via the GenericOptionsParser, thus through the command line interface. This makes the code definitely more portable (and additionally slightly cleaner) as it need not be hardcode to any specific configuration anymore.Without Tool interface
public class ToolMapReduce { public static void main(String[] args) throws Exception { // Create configuration Configuration conf = new Configuration(); // Create job Job job = new Job(conf, "Tool Job"); job.setJarByClass(ToolMapReduce.class); // Setup MapReduce job job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // Set only 1 reduce task job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); // Input FileInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); }}The above MapReduce job cab be executed by passing 2 arguments , inputPath and outputPath, located at respectively index [0] and [1] on your main method String array.
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce /input/path /output/pathIn that case, the number of reducers (1) is hardcoded (line #17) and therefore cannot be modified on demand.
With Tool interface:
import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class ToolMapReduce extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { // When implementing tool Configuration conf = this.getConf(); // Create job Job job = new Job(conf, "Tool Job"); job.setJarByClass(ToolMapReduce.class); // Setup MapReduce job // Do not specify the number of Reducer job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // Specify key / value job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); // Input FileInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); // Execute job and return status return job.waitForCompletion(true) ? 0 : 1; }}ToolsRunner executes the MapReduce job through its static run method.
In this example we do not need to hardcode the number of reducers anymore as it can be specified directly from the CLI (using the “-D” option).
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce -D mapred.reduce.tasks=1 /input/path /output/pathNote that inputPath and outputPath arguments are still needs to be supplied. Basically GenericOptionParser will separate the generic Tools options from the actual job’s arguments. Whatever the number of generic options you might supply, inputPath and outputPath variables will be still located at index [0] and [1], but in your run method String array (not in your main method).
This -D option can be used for any “official” or custom property values.
conf.set("my.dummy.configuration","foobar");becomes now…
-D my.dummy.configuration=foobarHDFS and JobTracker properties
When I need to submit a jar file remotely to a distant hadoop server, I need to specify the below properties in my driver code.Configuration conf = new Configuration();conf.set("mapred.job.tracker", "myserver.com:8021");conf.set("fs.default.name", "hdfs://myserver.com:8020");hadoop jar myjar.jar com.wordpress.hadoopi.ToolMapReduce -fs hdfs://myserver.com:8020 -jt myserver.com:8021UsingTool implementation, the jar file is now 100% portable, and can be executed both locally or remotely without having to hardcode any specific value.
Generic options supported
Some additional useful options can be supplied from CLI.-conf specify an application configuration file
-D use value for given property
-fs specify a namenode
-jt specify a job tracker
-files specify comma separated files to be copied to the map reduce cluster
-libjars specify comma separated jar files to include in the classpath.
-archives specify comma separated archives to be unarchived on the compute machines.
