1、在本地配置hadoop的环境变量

增加系统变量HADOOP_HOME,变量值为hadoop-2.6.0.rar压缩包解压所在的目录

在系统变量中对变量名为PATH的系统变量追加变量值,变量值为 %HADOOP_HOME%/bin

idea编写远程hadoop程序 idea开发hadoop_apache


2、新建一个maven工程

打开IDEA,依次点击“File”→“New”→“Project”,点击左侧Maven,勾选上方“Create from archetype”,在下方列表中选择org.apache.maven.archetypes:maven-archetype-quickstart,点击“Next”,文件建好之后,在Project框中src/main目录中新建目录resources。

3、将远程集群的Hadoop安装目录下hadoop/hadoop-2.7.7/etc/hadoop目录下的core-site.xml、hdfs-site.xml两个文件通过Xftp等SFTP文件传输软件将两个文件复制,并移动到上述src/main/resources目录中(拖拽即可),然后将下载的log4j.properties文件移动到src/main/resources目录中(防止不输出日志文件)

4、引入pom文件
使用下面的pom.xml文件覆盖项目本身的pom.xml文件(直接拖拽即可),该文件中的一些版本号(比如JDK、Hadoop等)修改为自己电脑中对应的版本

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.neu</groupId>
  <artifactId>MapreduceTrain</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.version>2.7.3</hadoop.version>
    <jdkLevel>1.8</jdkLevel>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <dependency>
      <groupId>cn.hutool</groupId>
      <artifactId>hutool-all</artifactId>
      <version>4.1.7</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

  <build>
    <resources>
    </resources>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.2</version>
        <configuration>
          <source>${jdkLevel}</source>
          <target>${jdkLevel}</target>
          <showDeprecation>true</showDeprecation>
          <showWarnings>true</showWarnings>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.3</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <transformers>
                <transformer
                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                  <mainClass>com.neu.mapreduce.WordCount</mainClass>
                </transformer>
              </transformers>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

5、 编写WordCount程序

idea编写远程hadoop程序 idea开发hadoop_apache_02


下面是wordCount程序

package org.example;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class WordCount {
	public static class Map extends Mapper<Object,Text,Text,IntWritable>{
		private static IntWritable one = new IntWritable(1);
		private Text word = new Text();
		public void map(Object key,Text value,Context context) throws IOException,InterruptedException{
			StringTokenizer st = new StringTokenizer(value.toString());
			while(st.hasMoreTokens()){
				word.set(st.nextToken());
				context.write(word, one);
			}
		}
	}
	
	public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>{
		private static IntWritable result = new IntWritable();
		public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException,InterruptedException{
			int sum = 0;
			for(IntWritable val:values){
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}
	
	public static void main(String[] args) throws Exception{
		System.setProperty("HADOOP_USER_NAME", "CNDCJEKINS01");
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		if(otherArgs.length != 2){
			System.err.println("Usage WordCount <int> <out>");
			System.exit(2);
		}
		Path outPath = new Path(otherArgs[1]);
		if(fs.exists(outPath)) {
			fs.delete(outPath, true);
		}
		Job job = new Job(conf,"word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(Map.class);
		job.setCombinerClass(Reduce.class);
		job.setReducerClass(Reduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

6、配置参数(入参和回参)
WordCount.java代码中有两处参数值,因此需要配置参数

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

点击IDEA右上角的“Edit Configurations”

在Main class中填写WordCount类的包路径,在Program arguments中填写程序两个参数,即输入路径和输出路径

以我填写的Program arguments参数为例:"/user/hadoop/input_wordcount" “/user/hadoop/output/temp”,我的output目录中一定不要存在temp目录

idea编写远程hadoop程序 idea开发hadoop_maven_03


7、运行程序

点击运行即可,若出现org.apache.hadoop.security.AccessControlException:Permission denied: user=…错误,需要在主函数第一行添加代码System.setProperty(“HADOOP_USER_NAME”, ”root”);,其中root为远程Hadoop所在虚拟机的主机名,每个人根据各自的情况填写运行成功日志如下:

idea编写远程hadoop程序 idea开发hadoop_hadoop_04


8、查看文件输出结果

使用XShell等终端模拟软件连接Hadoop集群所在的虚拟机,查看程序运行结果

1)查看入参文件

hadoop fs -ls /user/hadoop/input_wordcount

idea编写远程hadoop程序 idea开发hadoop_hadoop_05


2)查看生成的文件

hadoop fs -ls /user/hadoop/output/temp

idea编写远程hadoop程序 idea开发hadoop_maven_06

hadoop fs -cat /user/hadoop/output/temp/part-r-00000

idea编写远程hadoop程序 idea开发hadoop_hadoop_07


9、期间遇到的问题

运行程序时报下面的错误

Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

idea编写远程hadoop程序 idea开发hadoop_hadoop_08


1)在Intellij编辑器中解决办法:本地重新创建NativeIO类,修改一个方法返回值,然后用新建的NativeIO类覆盖源码中的NativeIO类。下面会展示。首先按2次shift,找到NativeIO,然后选择 download resource下载源码

idea编写远程hadoop程序 idea开发hadoop_hadoop_09


2)然后在项目的java目录下重建一个NativeIO类,用于覆盖该源码,ctrl+A选中NativeIO源码,覆盖掉在java目录下新建的NativeIO类,就是把源码粘贴到这个新建的类里面

idea编写远程hadoop程序 idea开发hadoop_apache_10


3)在NativeIO类中找到access0返回值所在的方法,将返回参数改成return true。

idea编写远程hadoop程序 idea开发hadoop_apache_11


至此,再次运行wordcount程序,报错问题得以解决!