1、在本地配置hadoop的环境变量
增加系统变量HADOOP_HOME,变量值为hadoop-2.6.0.rar压缩包解压所在的目录
在系统变量中对变量名为PATH的系统变量追加变量值,变量值为 %HADOOP_HOME%/bin
2、新建一个maven工程
打开IDEA,依次点击“File”→“New”→“Project”,点击左侧Maven,勾选上方“Create from archetype”,在下方列表中选择org.apache.maven.archetypes:maven-archetype-quickstart,点击“Next”,文件建好之后,在Project框中src/main目录中新建目录resources。
3、将远程集群的Hadoop安装目录下hadoop/hadoop-2.7.7/etc/hadoop目录下的core-site.xml、hdfs-site.xml两个文件通过Xftp等SFTP文件传输软件将两个文件复制,并移动到上述src/main/resources目录中(拖拽即可),然后将下载的log4j.properties文件移动到src/main/resources目录中(防止不输出日志文件)
4、引入pom文件
使用下面的pom.xml文件覆盖项目本身的pom.xml文件(直接拖拽即可),该文件中的一些版本号(比如JDK、Hadoop等)修改为自己电脑中对应的版本
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.neu</groupId>
<artifactId>MapreduceTrain</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.version>2.7.3</hadoop.version>
<jdkLevel>1.8</jdkLevel>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>4.1.7</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<resources>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>${jdkLevel}</source>
<target>${jdkLevel}</target>
<showDeprecation>true</showDeprecation>
<showWarnings>true</showWarnings>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.neu.mapreduce.WordCount</mainClass>
</transformer>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
5、 编写WordCount程序
下面是wordCount程序
package org.example;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class Map extends Mapper<Object,Text,Text,IntWritable>{
private static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key,Text value,Context context) throws IOException,InterruptedException{
StringTokenizer st = new StringTokenizer(value.toString());
while(st.hasMoreTokens()){
word.set(st.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>{
private static IntWritable result = new IntWritable();
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException,InterruptedException{
int sum = 0;
for(IntWritable val:values){
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception{
System.setProperty("HADOOP_USER_NAME", "CNDCJEKINS01");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length != 2){
System.err.println("Usage WordCount <int> <out>");
System.exit(2);
}
Path outPath = new Path(otherArgs[1]);
if(fs.exists(outPath)) {
fs.delete(outPath, true);
}
Job job = new Job(conf,"word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6、配置参数(入参和回参)
WordCount.java代码中有两处参数值,因此需要配置参数
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
点击IDEA右上角的“Edit Configurations”
在Main class中填写WordCount类的包路径,在Program arguments中填写程序两个参数,即输入路径和输出路径
以我填写的Program arguments参数为例:"/user/hadoop/input_wordcount" “/user/hadoop/output/temp”,我的output目录中一定不要存在temp目录
7、运行程序
点击运行即可,若出现org.apache.hadoop.security.AccessControlException:Permission denied: user=…错误,需要在主函数第一行添加代码System.setProperty(“HADOOP_USER_NAME”, ”root”);,其中root为远程Hadoop所在虚拟机的主机名,每个人根据各自的情况填写运行成功日志如下:
8、查看文件输出结果
使用XShell等终端模拟软件连接Hadoop集群所在的虚拟机,查看程序运行结果
1)查看入参文件
hadoop fs -ls /user/hadoop/input_wordcount
2)查看生成的文件
hadoop fs -ls /user/hadoop/output/temp
hadoop fs -cat /user/hadoop/output/temp/part-r-00000
9、期间遇到的问题
运行程序时报下面的错误
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
1)在Intellij编辑器中解决办法:本地重新创建NativeIO类,修改一个方法返回值,然后用新建的NativeIO类覆盖源码中的NativeIO类。下面会展示。首先按2次shift,找到NativeIO,然后选择 download resource下载源码
2)然后在项目的java目录下重建一个NativeIO类,用于覆盖该源码,ctrl+A选中NativeIO源码,覆盖掉在java目录下新建的NativeIO类,就是把源码粘贴到这个新建的类里面
3)在NativeIO类中找到access0返回值所在的方法,将返回参数改成return true。
至此,再次运行wordcount程序,报错问题得以解决!