Windows+eclipse
(1)创建Map/Reduce项目
打开eclipse,点击File-->New-->Other-->Map/Reduce Project,按照步骤操作就可以创建一个Map/Reduce项目,与普通项目不同的是,当创建好Map/Reduce项目后,需要的Hadoop依赖包都自动从Hadoop安装目录中添加进来。如图:
(2)以开发wordcount程序为例。如图:
TokenizerMapper:
package wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
StringTokenizer tokenizer=new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}
IntSumReducer:
package wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
int sum = 0;
for (IntWritable v : values) {
sum += v.get();
}
result.set(sum);
context.write(key, result);
}
}
主程序WordCountMain:
package wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountMain.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
(3)设置main参数
点击Run-->Run Configurations,选择Arguments,如果使用本地本地数据输入目录和输出目录,如图:
如果使用HDFS,填入
hdfs://bigdata111:9000/input/data.txt hdfs://bigdata111:9000/output
,注意output目录事先不能存在(HDFS是不允许覆盖的),点击Run,发现控制台一片红,如下:
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:524)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:478)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:531)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:305)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:144)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at wordcount.WordCountMain.main(WordCountMain.java:37)
又是坑,看错误是Native Methhod,本地方法就和Windows系统息息相关了,Hadoop天然对Linux有较好支持,而在Windows下就会出现类似的问题,开发中建议使用Linux环境开发,因为有时候出现问题后无法排除是Windows问题,还是程序本身的问题。对于和和Windows系统相关的问题,只能检查bin目录是否使用Windows版本的Hadoop的bin目录替换,hadoop.dll文件是否拷贝到C:\Windows\System32目录下,Hadoop的环境变量是否配置错误。有时候点击小象前的箭头,如图:
报如下错误:
An internal error occurred during: "Map/Reduce location status updater".
org/apache/hadoop/mapred/JobConf
这个问题不大,原因可能是eclipse没准备好或者正在MapReduce任务。如果执行程序时,一直等待,没反应,如图:
原因可能是没连上集群,Host改为IP地址(之前使用的是主机名),如图:
Windows+idea
eclipse为我们提供了MapReduce的基础开发环境,比如自动连接集群和导入Hadoop依赖包,通过DFS Locations,可以查看和删除HDFS文件,而在IDEA中,和开发普通项目是一样,比如连接集群,需要在WordCountMain中添加如下代码:
configuration.set("fs.defaultFS","hdfs://192.168.128.111:9000");
Linux+eclipse
Linux中MapReduce开发流程和Windows一样,如图:
整个开发流程非常顺畅。
调试
以Windows+eclipse为例。
(1)设置断点
(2)Debug As
(3)查看变量的值
鼠标放在变量上,就可以查看该变量的值,如图:
部署
将程序打成可执行jar包,提交到集群上,例如将上面程序打成wc.jar包,如图:
执行如下命令:
hadoop jar wc.jar WordCountMain /input/data.txt /output
此时,输入和输出目录就不用写明HDFS的地址了(例如:hdfs://192.168.128.111:9000)。
部署简单,经常会碰到一些问题,比如在eclipse的Run Configurations中已经配置了主类,执行命令时就可以省略WordCountMain。