- Back to Home »
- Implementing the tool interface
Posted by : Sushanth
Thursday, 24 December 2015
Implementing the tool interface for Mapreduce driver:
The downside of using static main method for Mapreduce driver is that the configuration properties are hardcoded. To modify some of the configuration properties like number of reducers, the code need to modified, rebuild the jar file and redeploy the application. This can be avoided by implementing the Tool interface in MapReduce driver code.
Hadoop Configuration:
By implementing the Tool interface and extending Configured class, the hadoop Configuration object can be set via the GenericOptionsParser, thus through the command line interface. This makes the code definitely more portable (and additionally slightly cleaner) as it need not be hardcode to any specific configuration anymore.Without Tool interface
public
class
ToolMapReduce {
public
static
void
main(String[] args) throws
Exception {
// Create configuration
Configuration conf = new
Configuration();
// Create job
Job job = new
Job(conf, "Tool Job");
job.setJarByClass(ToolMapReduce.class);
// Setup MapReduce job
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// Set only 1 reduce task
job.setNumReduceTasks(1);
// Specify key / value
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
// Input
FileInputFormat.addInputPath(job, new
Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
// Execute job
int
code = job.waitForCompletion(true) ? 0
: 1;
System.exit(code);
}
}
The above MapReduce job cab be executed by passing 2 arguments , inputPath and outputPath, located at respectively index [0] and [1] on your main method String array.
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce /input/path
/output/path
In that case, the number of reducers (1) is hardcoded (line #17) and therefore cannot be modified on demand.
With Tool interface:
import
org.apache.hadoop.util.Tool;
import
org.apache.hadoop.util.ToolRunner;
public
class
ToolMapReduce extends
Configured implements
Tool {
public
static
void
main(String[] args) throws
Exception {
int
res = ToolRunner.run(new
Configuration(), new
ToolMapReduce(), args);
System.exit(res);
}
@Override
public
int
run(String[] args) throws
Exception {
// When implementing tool
Configuration conf = this.getConf();
// Create job
Job job = new
Job(conf, "Tool Job");
job.setJarByClass(ToolMapReduce.class);
// Setup MapReduce job
// Do not specify the number of Reducer
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// Specify key / value
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
// Input
FileInputFormat.addInputPath(job, new
Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
// Execute job and return status
return
job.waitForCompletion(true) ? 0
: 1;
}
}
ToolsRunner executes the MapReduce job through its static run method.
In this example we do not need to hardcode the number of reducers anymore as it can be specified directly from the CLI (using the “-D” option).
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce -D mapred.reduce.tasks=1 /input/path
/output/path
Note that inputPath and outputPath arguments are still needs to be supplied. Basically GenericOptionParser will separate the generic Tools options from the actual job’s arguments. Whatever the number of generic options you might supply, inputPath and outputPath variables will be still located at index [0] and [1], but in your run method String array (not in your main method).
This -D option can be used for any “official” or custom property values.
conf.set("my.dummy.configuration","foobar");
becomes now…
-D my.dummy.configuration=foobar
HDFS and JobTracker properties
When I need to submit a jar file remotely to a distant hadoop server, I need to specify the below properties in my driver code.Configuration conf = new
Configuration();
conf.set("mapred.job.tracker", "myserver.com:8021");
conf.set("fs.default.name", "hdfs://myserver.com:8020");
hadoop jar myjar.jar com.wordpress.hadoopi.ToolMapReduce -fs hdfs://myserver.com:8020 -jt myserver.com:8021
UsingTool implementation, the jar file is now 100% portable, and can be executed both locally or remotely without having to hardcode any specific value.
Generic options supported
Some additional useful options can be supplied from CLI.-conf specify an application configuration file
-D use value for given property
-fs specify a namenode
-jt specify a job tracker
-files specify comma separated files to be copied to the map reduce cluster
-libjars specify comma separated jar files to include in the classpath.
-archives specify comma separated archives to be unarchived on the compute machines.