文章介绍的是MapReduce,作为一个job,如何提交到集群上,这一段过程它执行了哪些操作。文章以一个WordCount 作为案例,通过分析源码来看程序是如何执行的。文章由java源码和注解构成。

大体步骤:
step 1. 写class WordcountMapper, 重写map方法
step 2. 写class WordcountReducer, 重写reduce方法
step 3. 写class WordcountDriver

step1:

package com.atguigu.mapreduce.wordcountDemo.map;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//map阶段
//KEYIN输入数据的key官方的是Object()行的偏移量,因此是LongWritable
//VALUEIN输入数据的value,因此是Text
//KEYOUT输出数据的key的类型Text
//VALUEOUT输出的数据的value,因此是IntWritable
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	Text k = new Text();
	IntWritable v = new IntWritable(1);
	@Override
	protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
		// 1 获取一行
		String line = value.toString();//每次读取一行
		System.out.println("key: "+key+"\t\t\t"+"value: "+line);
 
		// 2 切割
		String[] words = line.split(" +");
		// 3 输出
		for (String word : words) {
			k.set(word);
			context.write(k, v);//写到缓存区里面
		}
	}
}

step2:

package com.atguigu.mapreduce.wordcountDemo.reduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	int sum;
	IntWritable v = new IntWritable();
	/**
	 * 把相同key 的值封装到了一个values里面
	 */
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
		// 1 累加求和
		sum = 0;
		for (IntWritable count : values) {
			sum += count.get();//System.out.println("count.get(): "+count.get());
		}
		// 2 输出
		v.set(sum);//System.out.println("输出:"+key+" "+sum);
		context.write(key,v);
	}
}

step3:
args[0]是文件输入路径,如e:\input.txt
args[0]是文件输出路径,如e:\output.txt

package com.atguigu.mapreduce.wordcountDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Job.JobState;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.atguigu.mapreduce.wordcountDemo.map.WordcountMapper;
import com.atguigu.mapreduce.wordcountDemo.reduce.WordcountReducer;
public class WordcountDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
		args = new String[] { "F:/Test/input.txt", "F:/Test/output" };
		
		
		// 1 获取配置信息以及封装任务
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);//获取实例对象
 
		// 2 设置jar加载路径
		job.setJarByClass(WordcountDriver.class);//通过反射找到 jar 的存储位置  本地运行可以不用设置这个
 
		// 3 设置map和reduce类
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
 
		// 4 设置map输出
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
 
		// 5 设置最终输出kv类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//---------------------------------------------------
		// 如果不设置InputFormat,它默认用的是TextInputFormat.class
		//job.setInputFormatClass(CombineTextInputFormat.class);
		//虚拟存储切片最大值设置4m
		//CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);//4M
		//---------------------------------------------------
		
		// 6 设置输入和输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
		// 7 提交
		boolean result = job.waitForCompletion(true);
		//job.submit(); = job.waitForCompletion(false);
		
		System.exit(result ? 0 : 1);//运行成功系统打印0,失败打印1  这步可有可没有
	}
}

args[0]是文件输入路径,如e:\input.txt
args[0]是文件输出路径,如e:\output.txt

args[0]文件可以如下:

wahaha wahaha
meinv meinv
shuaige
shitou
daxue

args[1]文件作为结果,如下图:

wahaha 2
meinv 2
shuaige 1
shitou 1
daxue 1

args[0]文件通过map方法之后,变成一行行的 “key value” key为Text类型,value为IntWritable类型,如wahaha 1。args[1]是reduce阶段。如果map之后得到两个wahaha 1的话,在reduce阶段则进行合并,最终得到wahaha 2。

map阶段如果有多个MapTaskde的话,则是并行运算,reduce阶段也是。


从提交job 开始,boolean result = job.waitForCompletion(true), 我们进到这个方法里去,源码如下:

/**
   * Submit the job to the cluster and wait for it to finish.
   * @param verbose print the progress to the user
   * @return true if the job succeeded
   * @throws IOException thrown if the communication with the 
   *         <code>JobTracker</code> is lost
   */
  public boolean waitForCompletion(boolean verbose ) throws IOException, InterruptedException, ClassNotFoundException {
    if (state == JobState.DEFINE) {//状态是否是已定义  
      submit();//提交  进入
    }//提交完成之后要打印一些完成的信息
    if (verbose) {
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    return isSuccessful();
  }

注释:

  1. JobState.DEFINE 相当于还没有被占用,没有相同的 job 可以继续运行。在后面它提交 job 的时候,它会将这个状态置为JobState.RUNNING(正在运行)。
  2. submit()之后会有一些打印信息

如果进入submit()方法内,源码如下:

/**
   * Submit the job to the cluster and return immediately.
   * @throws IOException
   */
  public void submit() throws IOException, InterruptedException, ClassNotFoundException {
    ensureState(JobState.DEFINE);//判断状态
    setUseNewAPI();//让你使用新 API 
    connect();//网络连接
    final JobSubmitter submitter =  getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException {
        return submitter.submitJobInternal(Job.this, cluster);//提交 job 的一些详细信息   (打断点)
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

如果进入ensureState()方法内,源码如下:

private void ensureState(JobState state) throws IllegalStateException {
    if (state != this.state) {
      throw new IllegalStateException("Job in state "+ this.state + 
                                      " instead of " + state);
    }

注释:ensureState() 方法还是用于判断 JobState 是不是同一个 JobState ,不是会抛出异常,然后跳出方法。

如果进入setUseNewAPI()方法内,源码如下:

//处理老的 API
  //把老的API"翻译"成新的 API,匹配兼容性
  /**
   * Default to the new APIs unless they are explicitly set or the old mapper or
   * reduce attributes are used.
   * @throws IOException if the configuration is inconsistant
   */
  private void setUseNewAPI() throws IOException {
    int numReduces = conf.getNumReduceTasks();
    String oldMapperClass = "mapred.mapper.class";
    String oldReduceClass = "mapred.reducer.class";
    conf.setBooleanIfUnset("mapred.mapper.new-api",
                           conf.get(oldMapperClass) == null);
    if (conf.getUseNewMapper()) {
      String mode = "new map API";
      ensureNotSet("mapred.input.format.class", mode);
      ensureNotSet(oldMapperClass, mode);
      if (numReduces != 0) {
        ensureNotSet("mapred.partitioner.class", mode);
       } else {
        ensureNotSet("mapred.output.format.class", mode);
      }      
    } else {
      String mode = "map compatability";
      ensureNotSet(INPUT_FORMAT_CLASS_ATTR, mode);
      ensureNotSet(MAP_CLASS_ATTR, mode);
      if (numReduces != 0) {
        ensureNotSet(PARTITIONER_CLASS_ATTR, mode);
       } else {
        ensureNotSet(OUTPUT_FORMAT_CLASS_ATTR, mode);
      }
    }
    if (numReduces != 0) {
      conf.setBooleanIfUnset("mapred.reducer.new-api",
                             conf.get(oldReduceClass) == null);
      if (conf.getUseNewReducer()) {
        String mode = "new reduce API";
        ensureNotSet("mapred.output.format.class", mode);
        ensureNotSet(oldReduceClass, mode);   
      } else {
        String mode = "reduce compatability";
        ensureNotSet(OUTPUT_FORMAT_CLASS_ATTR, mode);
        ensureNotSet(REDUCE_CLASS_ATTR, mode);   
      }
    }   
  }

注释:setUseNewAPI()主要的作用是为了兼容低版本,他的作用就是把老的 API “翻译成新的 API”,然后封装起来。

如果进入connect()方法内,源码如下:

//网络连接
  	//根据运行环境的不同创建不同的对象,本地创建LocalJobRunner 集群上创建的是yarnRunner 
    private synchronized void connect() throws IOException, InterruptedException, ClassNotFoundException {
    if (cluster == null) {//cluster(集群)  集群为空,创建
      cluster = ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
                   public Cluster run() throws IOException, InterruptedException,  ClassNotFoundException {
                     return new Cluster(getConfiguration());//
                   }
                 });
    }
  }

注释:connect() 方法用于网络连接,根据运行环境的不同创建不同的对象:本地创建LocalJobRunner,集群上创建的是yarnRunner 。connect() 方法首先判断,集群是否已经创建, 如果集群为空就创建一个集群,执行ugi.doAs(new PrivilegedExceptionAction() 方法。

我们进入cluster方法内部,源码如下:

public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException {  (打断点)
    this.conf = conf;
    this.ugi = UserGroupInformation.getCurrentUser();
    initialize(jobTrackAddr, conf);//conf 就是 Hadoop etc/hadoop/ 里面的各种配置文件(如:yarn-site.xml)
  }

注释:conf代表了各种etc/hadoop里的配置文件

进入initialize方法,源码如下:

//该方法做主要的作用就是判断你连接的是 Yarn 还是本地,根据环境的不同创建的对象也不同
    private void initialize(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException {
    synchronized (frameworkLoader) { //加锁   (打断点)
      for (ClientProtocolProvider provider : frameworkLoader) {
        LOG.debug("Trying ClientProtocolProvider : "+ provider.getClass().getName());
        ClientProtocol clientProtocol = null; //客户端的协议目前是空
        try {
          if (jobTrackAddr == null) {
            clientProtocol = provider.create(conf);//创造一个客户端协议  根据运行的地方不同创建的协议也不同,如果是在本地运行创建的是LocalJobRunner 如果是Hadoop集群上运行创建的是 Yarn上的
          } else {
            clientProtocol = provider.create(jobTrackAddr, conf);
          }
 
          if (clientProtocol != null) {
            clientProtocolProvider = provider;
            client = clientProtocol;
            LOG.debug("Picked " + provider.getClass().getName() + " as the ClientProtocolProvider");
            break;
          }
          else {
            LOG.debug("Cannot pick " + provider.getClass().getName() + " as the ClientProtocolProvider - returned null protocol");
          }
        } 
        catch (Exception e) {
          LOG.info("Failed to use " + provider.getClass().getName() + " due to error: ", e);
        }
      }
    }
    if (null == clientProtocolProvider || null == client) { 
    throw new IOException( "Cannot initialize Cluster. Please check your configuration for " + MRConfig.FRAMEWORK_NAME  + " and the correspond server addresses.");
    }
  }

注释:

  1. synchronized (frameworkLoader)是锁,同一时间只允许进入一个。
  2. clientProtocol 创造一个客户端协议。根据运行的地方不同创建的协议也不同,如果是在本地运行创建的是LocalJobRunner,如果是Hadoop集群上运行创建的是Yarn上的。

从connect()方法里进入submitJobInternal(Job.this, cluster) 方法,源码如下:

/**
   * Internal method for submitting jobs to the system.
   * 
   * <p>The job submission process involves:
   * <ol>
   *   <li>
   *   Checking the input and output specifications of the job.
   *   </li>
   *   <li>
   *   Computing the {@link InputSplit}s for the job.
   *   </li>
   *   <li>
   *   Setup the requisite accounting information for the 
   *   {@link DistributedCache} of the job, if necessary.
   *   </li>
   *   <li>
   *   Copying the job's jar and configuration to the map-reduce system
   *   directory on the distributed file-system. 
   *   </li>
   *   <li>
   *   Submitting the job to the <code>JobTracker</code> and optionally
   *   monitoring it's status.
   *   </li>
   * </ol></p>
   * @param job the configuration to submit
   * @param cluster the handle to the Cluster
   * @throws ClassNotFoundException
   * @throws InterruptedException
   * @throws IOException
   */
  JobStatus submitJobInternal(Job job, Cluster cluster) 
  throws ClassNotFoundException, InterruptedException, IOException {
 
    //validate the jobs output specs  //校验输出的路径  校验的很早,还没做什么事情之前都开始校验了
    checkSpecs(job);//进入
 
    Configuration conf = job.getConfiguration(); //配置信息 八个文件 (.xml)
    addMRFrameworkToDistributedCache(conf);//缓存的处理
 
    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);//每次提交会创建一个临时路径,一旦提交完就会删除参数的数据 电脑上直接 收拾 tmp 即可找到
    //configure the command line options correctly on the submitting dfs
    InetAddress ip = InetAddress.getLocalHost();
    if (ip != null) {
      submitHostAddress = ip.getHostAddress();//网络 ip 
      submitHostName = ip.getHostName();
      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
    }
    JobID jobId = submitClient.getNewJobID();//提交任务时的jobId,每个 job 提交任务的时候都会给你分配一个id
    job.setJobID(jobId);
    Path submitJobDir = new Path(jobStagingArea, jobId.toString());//提交的路径
    JobStatus status = null;
    try {
      conf.set(MRJobConfig.USER_NAME,
          UserGroupInformation.getCurrentUser().getShortUserName());
      conf.set("hadoop.http.filter.initializers", 
          "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
      LOG.debug("Configuring job " + jobId + " with " + submitJobDir 
          + " as the submit dir");
      // get delegation token for the dir
      TokenCache.obtainTokensForNamenodes(job.getCredentials(),
          new Path[] { submitJobDir }, conf);
      
      populateTokenCache(conf, job.getCredentials());
 
      // generate a secret to authenticate shuffle transfers
      if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
        KeyGenerator keyGen;
        try {
          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
          keyGen.init(SHUFFLE_KEY_LENGTH);
        } catch (NoSuchAlgorithmException e) {
          throw new IOException("Error generating shuffle secret key", e);
        }
        SecretKey shuffleKey = keyGen.generateKey();
        TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
            job.getCredentials());
      }
      if (CryptoUtils.isEncryptedSpillEnabled(conf)) {
        conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);
        LOG.warn("Max job attempts set to 1 since encrypted intermediate" +
                "data spill is enabled");
      }
 
      copyAndConfigureFiles(job, submitJobDir);//提交一些文件的信息  进入 (提前打断点)
 
      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);
      
      // Create the splits for the job
      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
      int maps = writeSplits(job, submitJobDir);//所有的切片信息 提交到 submitJobDir 路径上
      conf.setInt(MRJobConfig.NUM_MAPS, maps);
      LOG.info("number of splits:" + maps);
 
      // write "queue admins of the queue to which job is being submitted"
      // to job file.
      String queue = conf.get(MRJobConfig.QUEUE_NAME,
          JobConf.DEFAULT_QUEUE_NAME);
      AccessControlList acl = submitClient.getQueueAdmins(queue);
      conf.set(toFullPropertyName(queue,
          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());
 
      // removing jobtoken referrals before copying the jobconf to HDFS
      // as the tasks don't need this setting, actually they may break
      // because of it if present as the referral will point to a
      // different job.
      TokenCache.cleanUpTokenReferral(conf);
 
      if (conf.getBoolean(
          MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
          MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
        // Add HDFS tracking ids
        ArrayList<String> trackingIds = new ArrayList<String>();
        for (Token<? extends TokenIdentifier> t :
            job.getCredentials().getAllTokens()) {
          trackingIds.add(t.decodeIdentifier().getTrackingId());
        }
        conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
            trackingIds.toArray(new String[trackingIds.size()]));
      }
 
      // Set reservation info if it exists
      ReservationId reservationId = job.getReservationId();
      if (reservationId != null) {
        conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
      }
 
      // Write job file to submit dir
      writeConf(conf, submitJobFile);//到你的提交目录  进入
      
      //
      // Now, actually submit the job (using the submit name)
      //
      printTokens(jobId, job.getCredentials());
      status = submitClient.submitJob(
          jobId, submitJobDir.toString(), job.getCredentials());
      if (status != null) {
        return status;
      } else {
        throw new IOException("Could not launch job");
      }
    } finally {
      if (status == null) {
        LOG.info("Cleaning up the staging area " + submitJobDir);
        if (jtFs != null && submitJobDir != null)
          jtFs.delete(submitJobDir, true);
 
      }
    }
  }

注释:

  1. addMRFrameworkToDistributedCache(conf)代表缓存的处理。
  2. Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf)每次提交会创建一个临时路径,一旦提交完就会删除参数的数据, 在tmp/hadoop-administator/mapred/staging即可下找到。
  3. submitHostAddress = ip.getHostAddress() 如果是本地运行,则是本地ip。
  4. JobID jobId = submitClient.getNewJobID() 提交任务时的job Id,每个 job 提交任务的时候都会给你分配一个id。
  5. int maps = writeSplits(job, submitJobDir) 所有的切片信息都提交到 submitJobDir 路径上。当你打开submitJobDir 路径,你就会发现下面有你的切片信息,以文件格式存储。
  6. writeConf(conf, submitJobFile)里面有I/O流和你提交job的xml信息。

进入 checkSpecs(job) 方法 ,源码如下:

//校验输出路径
    private void checkSpecs(Job job) throws ClassNotFoundException,InterruptedException, IOException { //进入
    JobConf jConf = (JobConf)job.getConfiguration();
    // Check the output specification
    if (jConf.getNumReduceTasks() == 0 ? 
        jConf.getUseNewMapper() : jConf.getUseNewReducer()) {
      org.apache.hadoop.mapreduce.OutputFormat<?, ?> output =
        ReflectionUtils.newInstance(job.getOutputFormatClass(),job.getConfiguration());
      output.checkOutputSpecs(job);//进入 确保输出路径设置好,并且不存在   不然抛出异常   进入(提前打断点)
    } else {
      jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
    }
  }

注释:output.checkOutputSpecs(job) 确保输出路径设置好,如果此输出路径事先存在,则会抛出异常