文章介绍的是MapReduce,作为一个job,如何提交到集群上,这一段过程它执行了哪些操作。文章以一个WordCount 作为案例,通过分析源码来看程序是如何执行的。文章由java源码和注解构成。
大体步骤:
step 1. 写class WordcountMapper, 重写map方法
step 2. 写class WordcountReducer, 重写reduce方法
step 3. 写class WordcountDriver
step1:
package com.atguigu.mapreduce.wordcountDemo.map;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//map阶段
//KEYIN输入数据的key官方的是Object()行的偏移量,因此是LongWritable
//VALUEIN输入数据的value,因此是Text
//KEYOUT输出数据的key的类型Text
//VALUEOUT输出的数据的value,因此是IntWritable
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
// 1 获取一行
String line = value.toString();//每次读取一行
System.out.println("key: "+key+"\t\t\t"+"value: "+line);
// 2 切割
String[] words = line.split(" +");
// 3 输出
for (String word : words) {
k.set(word);
context.write(k, v);//写到缓存区里面
}
}
}
step2:
package com.atguigu.mapreduce.wordcountDemo.reduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
int sum;
IntWritable v = new IntWritable();
/**
* 把相同key 的值封装到了一个values里面
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
// 1 累加求和
sum = 0;
for (IntWritable count : values) {
sum += count.get();//System.out.println("count.get(): "+count.get());
}
// 2 输出
v.set(sum);//System.out.println("输出:"+key+" "+sum);
context.write(key,v);
}
}
step3:
args[0]是文件输入路径,如e:\input.txt
args[0]是文件输出路径,如e:\output.txt
package com.atguigu.mapreduce.wordcountDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Job.JobState;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.atguigu.mapreduce.wordcountDemo.map.WordcountMapper;
import com.atguigu.mapreduce.wordcountDemo.reduce.WordcountReducer;
public class WordcountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
args = new String[] { "F:/Test/input.txt", "F:/Test/output" };
// 1 获取配置信息以及封装任务
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);//获取实例对象
// 2 设置jar加载路径
job.setJarByClass(WordcountDriver.class);//通过反射找到 jar 的存储位置 本地运行可以不用设置这个
// 3 设置map和reduce类
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
// 4 设置map输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置最终输出kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//---------------------------------------------------
// 如果不设置InputFormat,它默认用的是TextInputFormat.class
//job.setInputFormatClass(CombineTextInputFormat.class);
//虚拟存储切片最大值设置4m
//CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);//4M
//---------------------------------------------------
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 提交
boolean result = job.waitForCompletion(true);
//job.submit(); = job.waitForCompletion(false);
System.exit(result ? 0 : 1);//运行成功系统打印0,失败打印1 这步可有可没有
}
}
args[0]是文件输入路径,如e:\input.txt
args[0]是文件输出路径,如e:\output.txt
args[0]文件可以如下:
wahaha wahaha
meinv meinv
shuaige
shitou
daxue
args[1]文件作为结果,如下图:
wahaha 2
meinv 2
shuaige 1
shitou 1
daxue 1
args[0]文件通过map方法之后,变成一行行的 “key value” key为Text类型,value为IntWritable类型,如wahaha 1。args[1]是reduce阶段。如果map之后得到两个wahaha 1的话,在reduce阶段则进行合并,最终得到wahaha 2。
map阶段如果有多个MapTaskde的话,则是并行运算,reduce阶段也是。
从提交job 开始,boolean result = job.waitForCompletion(true), 我们进到这个方法里去,源码如下:
/**
* Submit the job to the cluster and wait for it to finish.
* @param verbose print the progress to the user
* @return true if the job succeeded
* @throws IOException thrown if the communication with the
* <code>JobTracker</code> is lost
*/
public boolean waitForCompletion(boolean verbose ) throws IOException, InterruptedException, ClassNotFoundException {
if (state == JobState.DEFINE) {//状态是否是已定义
submit();//提交 进入
}//提交完成之后要打印一些完成的信息
if (verbose) {
monitorAndPrintJob();
} else {
// get the completion poll interval from the client.
int completionPollIntervalMillis =
Job.getCompletionPollInterval(cluster.getConf());
while (!isComplete()) {
try {
Thread.sleep(completionPollIntervalMillis);
} catch (InterruptedException ie) {
}
}
}
return isSuccessful();
}
注释:
- JobState.DEFINE 相当于还没有被占用,没有相同的 job 可以继续运行。在后面它提交 job 的时候,它会将这个状态置为JobState.RUNNING(正在运行)。
- submit()之后会有一些打印信息
如果进入submit()方法内,源码如下:
/**
* Submit the job to the cluster and return immediately.
* @throws IOException
*/
public void submit() throws IOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);//判断状态
setUseNewAPI();//让你使用新 API
connect();//网络连接
final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException {
return submitter.submitJobInternal(Job.this, cluster);//提交 job 的一些详细信息 (打断点)
}
});
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
如果进入ensureState()方法内,源码如下:
private void ensureState(JobState state) throws IllegalStateException {
if (state != this.state) {
throw new IllegalStateException("Job in state "+ this.state +
" instead of " + state);
}
注释:ensureState() 方法还是用于判断 JobState 是不是同一个 JobState ,不是会抛出异常,然后跳出方法。
如果进入setUseNewAPI()方法内,源码如下:
//处理老的 API
//把老的API"翻译"成新的 API,匹配兼容性
/**
* Default to the new APIs unless they are explicitly set or the old mapper or
* reduce attributes are used.
* @throws IOException if the configuration is inconsistant
*/
private void setUseNewAPI() throws IOException {
int numReduces = conf.getNumReduceTasks();
String oldMapperClass = "mapred.mapper.class";
String oldReduceClass = "mapred.reducer.class";
conf.setBooleanIfUnset("mapred.mapper.new-api",
conf.get(oldMapperClass) == null);
if (conf.getUseNewMapper()) {
String mode = "new map API";
ensureNotSet("mapred.input.format.class", mode);
ensureNotSet(oldMapperClass, mode);
if (numReduces != 0) {
ensureNotSet("mapred.partitioner.class", mode);
} else {
ensureNotSet("mapred.output.format.class", mode);
}
} else {
String mode = "map compatability";
ensureNotSet(INPUT_FORMAT_CLASS_ATTR, mode);
ensureNotSet(MAP_CLASS_ATTR, mode);
if (numReduces != 0) {
ensureNotSet(PARTITIONER_CLASS_ATTR, mode);
} else {
ensureNotSet(OUTPUT_FORMAT_CLASS_ATTR, mode);
}
}
if (numReduces != 0) {
conf.setBooleanIfUnset("mapred.reducer.new-api",
conf.get(oldReduceClass) == null);
if (conf.getUseNewReducer()) {
String mode = "new reduce API";
ensureNotSet("mapred.output.format.class", mode);
ensureNotSet(oldReduceClass, mode);
} else {
String mode = "reduce compatability";
ensureNotSet(OUTPUT_FORMAT_CLASS_ATTR, mode);
ensureNotSet(REDUCE_CLASS_ATTR, mode);
}
}
}
注释:setUseNewAPI()主要的作用是为了兼容低版本,他的作用就是把老的 API “翻译成新的 API”,然后封装起来。
如果进入connect()方法内,源码如下:
//网络连接
//根据运行环境的不同创建不同的对象,本地创建LocalJobRunner 集群上创建的是yarnRunner
private synchronized void connect() throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {//cluster(集群) 集群为空,创建
cluster = ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
public Cluster run() throws IOException, InterruptedException, ClassNotFoundException {
return new Cluster(getConfiguration());//
}
});
}
}
注释:connect() 方法用于网络连接,根据运行环境的不同创建不同的对象:本地创建LocalJobRunner,集群上创建的是yarnRunner 。connect() 方法首先判断,集群是否已经创建, 如果集群为空就创建一个集群,执行ugi.doAs(new PrivilegedExceptionAction() 方法。
我们进入cluster方法内部,源码如下:
public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException { (打断点)
this.conf = conf;
this.ugi = UserGroupInformation.getCurrentUser();
initialize(jobTrackAddr, conf);//conf 就是 Hadoop etc/hadoop/ 里面的各种配置文件(如:yarn-site.xml)
}
注释:conf代表了各种etc/hadoop里的配置文件
进入initialize方法,源码如下:
//该方法做主要的作用就是判断你连接的是 Yarn 还是本地,根据环境的不同创建的对象也不同
private void initialize(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException {
synchronized (frameworkLoader) { //加锁 (打断点)
for (ClientProtocolProvider provider : frameworkLoader) {
LOG.debug("Trying ClientProtocolProvider : "+ provider.getClass().getName());
ClientProtocol clientProtocol = null; //客户端的协议目前是空
try {
if (jobTrackAddr == null) {
clientProtocol = provider.create(conf);//创造一个客户端协议 根据运行的地方不同创建的协议也不同,如果是在本地运行创建的是LocalJobRunner 如果是Hadoop集群上运行创建的是 Yarn上的
} else {
clientProtocol = provider.create(jobTrackAddr, conf);
}
if (clientProtocol != null) {
clientProtocolProvider = provider;
client = clientProtocol;
LOG.debug("Picked " + provider.getClass().getName() + " as the ClientProtocolProvider");
break;
}
else {
LOG.debug("Cannot pick " + provider.getClass().getName() + " as the ClientProtocolProvider - returned null protocol");
}
}
catch (Exception e) {
LOG.info("Failed to use " + provider.getClass().getName() + " due to error: ", e);
}
}
}
if (null == clientProtocolProvider || null == client) {
throw new IOException( "Cannot initialize Cluster. Please check your configuration for " + MRConfig.FRAMEWORK_NAME + " and the correspond server addresses.");
}
}
注释:
- synchronized (frameworkLoader)是锁,同一时间只允许进入一个。
- clientProtocol 创造一个客户端协议。根据运行的地方不同创建的协议也不同,如果是在本地运行创建的是LocalJobRunner,如果是Hadoop集群上运行创建的是Yarn上的。
从connect()方法里进入submitJobInternal(Job.this, cluster) 方法,源码如下:
/**
* Internal method for submitting jobs to the system.
*
* <p>The job submission process involves:
* <ol>
* <li>
* Checking the input and output specifications of the job.
* </li>
* <li>
* Computing the {@link InputSplit}s for the job.
* </li>
* <li>
* Setup the requisite accounting information for the
* {@link DistributedCache} of the job, if necessary.
* </li>
* <li>
* Copying the job's jar and configuration to the map-reduce system
* directory on the distributed file-system.
* </li>
* <li>
* Submitting the job to the <code>JobTracker</code> and optionally
* monitoring it's status.
* </li>
* </ol></p>
* @param job the configuration to submit
* @param cluster the handle to the Cluster
* @throws ClassNotFoundException
* @throws InterruptedException
* @throws IOException
*/
JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException {
//validate the jobs output specs //校验输出的路径 校验的很早,还没做什么事情之前都开始校验了
checkSpecs(job);//进入
Configuration conf = job.getConfiguration(); //配置信息 八个文件 (.xml)
addMRFrameworkToDistributedCache(conf);//缓存的处理
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);//每次提交会创建一个临时路径,一旦提交完就会删除参数的数据 电脑上直接 收拾 tmp 即可找到
//configure the command line options correctly on the submitting dfs
InetAddress ip = InetAddress.getLocalHost();
if (ip != null) {
submitHostAddress = ip.getHostAddress();//网络 ip
submitHostName = ip.getHostName();
conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);
}
JobID jobId = submitClient.getNewJobID();//提交任务时的jobId,每个 job 提交任务的时候都会给你分配一个id
job.setJobID(jobId);
Path submitJobDir = new Path(jobStagingArea, jobId.toString());//提交的路径
JobStatus status = null;
try {
conf.set(MRJobConfig.USER_NAME,
UserGroupInformation.getCurrentUser().getShortUserName());
conf.set("hadoop.http.filter.initializers",
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");
conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());
LOG.debug("Configuring job " + jobId + " with " + submitJobDir
+ " as the submit dir");
// get delegation token for the dir
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] { submitJobDir }, conf);
populateTokenCache(conf, job.getCredentials());
// generate a secret to authenticate shuffle transfers
if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {
KeyGenerator keyGen;
try {
keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);
keyGen.init(SHUFFLE_KEY_LENGTH);
} catch (NoSuchAlgorithmException e) {
throw new IOException("Error generating shuffle secret key", e);
}
SecretKey shuffleKey = keyGen.generateKey();
TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),
job.getCredentials());
}
if (CryptoUtils.isEncryptedSpillEnabled(conf)) {
conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);
LOG.warn("Max job attempts set to 1 since encrypted intermediate" +
"data spill is enabled");
}
copyAndConfigureFiles(job, submitJobDir);//提交一些文件的信息 进入 (提前打断点)
Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);
// Create the splits for the job
LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
int maps = writeSplits(job, submitJobDir);//所有的切片信息 提交到 submitJobDir 路径上
conf.setInt(MRJobConfig.NUM_MAPS, maps);
LOG.info("number of splits:" + maps);
// write "queue admins of the queue to which job is being submitted"
// to job file.
String queue = conf.get(MRJobConfig.QUEUE_NAME,
JobConf.DEFAULT_QUEUE_NAME);
AccessControlList acl = submitClient.getQueueAdmins(queue);
conf.set(toFullPropertyName(queue,
QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());
// removing jobtoken referrals before copying the jobconf to HDFS
// as the tasks don't need this setting, actually they may break
// because of it if present as the referral will point to a
// different job.
TokenCache.cleanUpTokenReferral(conf);
if (conf.getBoolean(
MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,
MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {
// Add HDFS tracking ids
ArrayList<String> trackingIds = new ArrayList<String>();
for (Token<? extends TokenIdentifier> t :
job.getCredentials().getAllTokens()) {
trackingIds.add(t.decodeIdentifier().getTrackingId());
}
conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,
trackingIds.toArray(new String[trackingIds.size()]));
}
// Set reservation info if it exists
ReservationId reservationId = job.getReservationId();
if (reservationId != null) {
conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());
}
// Write job file to submit dir
writeConf(conf, submitJobFile);//到你的提交目录 进入
//
// Now, actually submit the job (using the submit name)
//
printTokens(jobId, job.getCredentials());
status = submitClient.submitJob(
jobId, submitJobDir.toString(), job.getCredentials());
if (status != null) {
return status;
} else {
throw new IOException("Could not launch job");
}
} finally {
if (status == null) {
LOG.info("Cleaning up the staging area " + submitJobDir);
if (jtFs != null && submitJobDir != null)
jtFs.delete(submitJobDir, true);
}
}
}
注释:
- addMRFrameworkToDistributedCache(conf)代表缓存的处理。
- Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf)每次提交会创建一个临时路径,一旦提交完就会删除参数的数据, 在tmp/hadoop-administator/mapred/staging即可下找到。
- submitHostAddress = ip.getHostAddress() 如果是本地运行,则是本地ip。
- JobID jobId = submitClient.getNewJobID() 提交任务时的job Id,每个 job 提交任务的时候都会给你分配一个id。
- int maps = writeSplits(job, submitJobDir) 所有的切片信息都提交到 submitJobDir 路径上。当你打开submitJobDir 路径,你就会发现下面有你的切片信息,以文件格式存储。
- writeConf(conf, submitJobFile)里面有I/O流和你提交job的xml信息。
进入 checkSpecs(job) 方法 ,源码如下:
//校验输出路径
private void checkSpecs(Job job) throws ClassNotFoundException,InterruptedException, IOException { //进入
JobConf jConf = (JobConf)job.getConfiguration();
// Check the output specification
if (jConf.getNumReduceTasks() == 0 ?
jConf.getUseNewMapper() : jConf.getUseNewReducer()) {
org.apache.hadoop.mapreduce.OutputFormat<?, ?> output =
ReflectionUtils.newInstance(job.getOutputFormatClass(),job.getConfiguration());
output.checkOutputSpecs(job);//进入 确保输出路径设置好,并且不存在 不然抛出异常 进入(提前打断点)
} else {
jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
}
}
注释:output.checkOutputSpecs(job) 确保输出路径设置好,如果此输出路径事先存在,则会抛出异常