hadoop mapreduce 读取大量小文件 hadoop如何读

转载

mob64ca14133dc6 2024-06-05 15:41:22

文章标签 hadoop mapreduce 大数据数据 List 文章分类 Hadoop 大数据

什么是Hadoop ？

简单来说，Hadoop就是解决⼤数据时代下海量数据的存储和分析计算问题。

Hadoop不是指具体的⼀个框架或者组件，它是Apache软件基⾦会下⽤Java语⾔开发的⼀个开源分布式计算平台，实现在⼤量计算机组成的集群中对海量数据进⾏分布式计算，适合⼤数据的分布式存储和计算，从⽽有效弥补了传统数据库在海量数据下的不⾜。

什么是MapReduce ？

MapReduce采⽤"分⽽治之"的思想，从它名字上来看就⼤致可以看出个缘由，两个动词Map和Reduce，“Map（映射）”就是将⼀个大任务分解成为多个小任务，“Reduce”就是将分解后多个小任务处理的结果汇总起来，得出最后的分析结果。

整个MapReduce的大致过程如下：

hadoop mapreduce 读取大量小文件 hadoop如何读_mapreduce

框架会将数据处理好一行一行的传到map方法，所以在编程时，开发人员只需要在map方法里面编写业务代码来处理传入的数据，隐藏了细节的同时提供了极大的便利，开发人员可以将焦点集中在业务层面；但我们依然对内部的处理有所了解，总结就是：框架会对物理文件进行逻辑分隔成若干个小块(split,记录文件每一个区块的开始和结束的位置信息)，框架会为每一个split分配一个mapTask任务，任务在根据对应的split块从物理文件不同位置一行行读取文件内容传递到map方法处理；

什么是LineRecordReader？

RecordReader的子类；按行读取文件内容，以每行的偏移量作为读入map的key，每行的内容作为读入map的value，将key和value作为map方法参数；它建立了文件和mapper方法之间的桥梁，确定了如果将文件内容以何种方式交给map方法处理；

hadoop mapreduce 读取大量小文件 hadoop如何读_大数据_02

本文是以读取文本文件为例，下面从源码来看在数据进入map方法前都做了哪些事情

1、切片：根据配置将物理文件切成若干个逻辑分片（split）

// 计算split的大小,默认和blockSize一样
// minSize:  split允许的最小值；可以通过mapreduce.input.fileinputformat.split.minsize参数设置；如果未设置默认是1
// maxSize:  split允许的最大值；可以通过mapreduce.input.fileinputformat.split.maxsize参数设置；如果未设置默认是Long.MAX_VALUE
// blockSize:  是文件存储在hdfs上的块大小
  protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }



  public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);

    boolean ignoreDirs = !getInputDirRecursive(job)
      && job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
    for (FileStatus file: files) {
      if (ignoreDirs && file.isDirectory()) {
        continue;
      }
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          long bytesRemaining = length;
          // 以split为单位创建split；
          // split记录的是：哪个文件、该split第一个字符在文件中的偏移量、该split的数据长度（即等于splitSize）
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }
          // 处理最后一部分的数据，这部分可能大于splitSize
          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          if (LOG.isDebugEnabled()) {
            // Log only if the file is big enough to be splitted
            if (length > Math.min(file.getBlockSize(), minSize)) {
              LOG.debug("File is not splittable so no parallelization "
                  + "is possible: " + file.getPath());
            }
          }
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }

2、根据分片创建MapTaskRunnable实例，分配到各个执行单元并执行（此处使用的是本地执行模式 LocalJobRunner，将任务扔到线程池中去执行）

org.apache.hadoop.mapred.LocalJobRunner.java


    // 根据分片创建MapTaskRunnable实例 
    protected List<RunnableWithThrowable> getMapTaskRunnables(
        TaskSplitMetaInfo [] taskInfo, JobID jobId,
        Map<TaskAttemptID, MapOutputFile> mapOutputFiles) {

      int numTasks = 0;
      ArrayList<RunnableWithThrowable> list =
          new ArrayList<RunnableWithThrowable>();
      for (TaskSplitMetaInfo task : taskInfo) {
        list.add(new MapTaskRunnable(task, numTasks++, jobId,
            mapOutputFiles));
      }

      return list;
    }


    
    private void runTasks(List<RunnableWithThrowable> runnables,
        ExecutorService service, String taskType) throws Exception {
      // Start populating the executor with work units.
      // They may begin running immediately (in other threads).
      for (Runnable r : runnables) {
        service.submit(r);
      }

      try {
        service.shutdown(); // Instructs queue to drain.

        // Wait for tasks to finish; do not use a time-based timeout.
        // (See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6179024)
        LOG.info("Waiting for " + taskType + " tasks");
        service.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
      } catch (InterruptedException ie) {
        // Cancel all threads.
        service.shutdownNow();
        throw ie;
      }

      LOG.info(taskType + " task executor complete.");

      // After waiting for the tasks to complete, if any of these
      // have thrown an exception, rethrow it now in the main thread context.
      for (RunnableWithThrowable r : runnables) {
        if (r.storedException != null) {
          throw new Exception(r.storedException);
        }
      }
    }

3、在执行MapTaskRunnable实例时，会创建MapTask任务，调用其run方法执行并调用到runNewMapper方法

org.apache.hadoop.mapred.LocalJobRunner.java

public void run() {
        try {
          TaskAttemptID mapId = new TaskAttemptID(new TaskID(
              jobId, TaskType.MAP, taskId), 0);
          LOG.info("Starting task: " + mapId);
          mapIds.add(mapId);
// new maptask实例
          MapTask map = new MapTask(systemJobFile.toString(), mapId, taskId,
            info.getSplitIndex(), 1);
          // 省略代码~~~~~~~~~
          try {
            map_tasks.getAndIncrement();
            myMetrics.launchMap(mapId);
//调用maptask的run方法
            map.run(localConf, Job.this);
            myMetrics.completeMap(mapId);
          } finally {
            map_tasks.getAndDecrement();
          }

          LOG.info("Finishing task: " + mapId);
        } catch (Throwable e) {
          this.storedException = e;
        }
      }

org.apache.hadoop.mapred.MapTask.java

  @Override
  public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, ClassNotFoundException, InterruptedException {
    this.umbilical = umbilical;

    // 省略代码 ~~~~~

    if (useNewApi) {
      runNewMapper(job, splitMetaInfo, umbilical, reporter);
    } else {
      runOldMapper(job, splitMetaInfo, umbilical, reporter);
    }
    done(umbilical, reporter);
  }



  @SuppressWarnings("unchecked")
  private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // 省略代码~~~~

    try {
      // 调用 LineRecordReader 的initialize
      input.initialize(split, mapperContext);
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

4、调用 initialize ,做好读取文件的准备工作

关键点：如果如果不是第一个split，丢弃该split的第一行数据，因为在 nextKeyValue 方法里会额外读取一行

org.apache.hadoop.mapreduce.lib.input.LineRecordReader.java

public void initialize(InputSplit genericSplit,
					   TaskAttemptContext context) throws IOException {
	FileSplit split = (FileSplit) genericSplit;
	Configuration job = context.getConfiguration();
	this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
	start = split.getStart();
	end = start + split.getLength();
	final Path file = split.getPath();

	// open the file and seek to the start of the split
	final FileSystem fs = file.getFileSystem(job);
	fileIn = fs.open(file);

	// 根据文件名后缀查找给定文件的相关压缩编解码器
	CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
	if (null!=codec) {
		// 处理压缩相关代码省略
	} else {
		//
		fileIn.seek(start); // 将打开的文件流seek到该split的起始偏移量，以便后续读取该split的文件内容
		in = new UncompressedSplitLineReader(
				fileIn, job, this.recordDelimiterBytes, split.getLength());
		filePosition = fileIn;
	}
	// If this is not the first split, we always throw away first record
	// because we always (except the last split) read one extra line in
	// next() method.
	// 如果不是第一个split，丢弃该split的第一行数据，因为在 nextKeyValue 方法里会额外读取一行
	if (start != 0) {
		start += in.readLine(new Text(), 0, maxBytesToConsume(start));
	}
	this.pos = start;
}

5、迭代方式从文件中读取内容传入到map方法，context.nextKeyValue()方法一层层会调用到LineRecordReader的nextKeyValue方法

/**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

核心点就在于 LineRecordReader的initialize方法和nextKeyValue方法；上面已经有initialize方法的代码，下面看下nextKeyValue方法实现

关键点：while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {

org.apache.hadoop.mapreduce.lib.input.LineRecordReader.java

    /**
     * 读取下一行数据，并给key和value赋值
     *
     * @return 读取到内容反回true，没读取返回false
     * @throws IOException
     */
    public boolean nextKeyValue() throws IOException {
        if (key == null) {
            key = new LongWritable();
        }
        key.set(pos);
        if (value == null) {
            value = new Text();
        }
        int newSize = 0;
        // We always read one extra line, which lies outside the upper
        // split limit i.e. (end - 1)
        // 读取到split最后时，多读取一行,注意这里是<=
        while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
            if (pos == 0) {
                newSize = skipUtfByteOrderMark();
            } else {
                newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
                pos += newSize;
            }

            if ((newSize == 0) || (newSize < maxLineLength)) {
                break;
            }

            // line too long. try again
            LOG.info("Skipped line of size " + newSize + " at pos " +
                    (pos - newSize));
        }
        if (newSize == 0) {
            key = null;
            value = null;
            return false;
        } else {
            return true;
        }
    }


    org.apache.hadoop.util.LineReader.java

    /**
     * Read a line terminated by one of CR, LF, or CRLF.
     * 读取一行内容，一行的结束可以是CR、LF、CRLF三个的其中一种
     */
    private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
            throws IOException {
        str.clear();
        int txtLength = 0; //tracks str.getLength(), as an optimization
        int newlineLength = 0; //length of terminating newline
        boolean prevCharCR = false; //true of prev char was CR
        long bytesConsumed = 0;
        // do while的方式循环读取文件存入缓存区，读取一行内容
        // newlineLength == 0 && bytesConsumed < maxBytesToConsume    未读取到行分隔符且读取到的字符长度未超过最大限制，则继续循环
        do {
            int startPosn = bufferPosn; //starting from where we left off the last time
            // buffer读标记bufferPosn大于等于缓存区里的实际数据长度时，说明缓存区里面的数据已经读完，需要再次从文件读取内容到缓存区
            if (bufferPosn >= bufferLength) {
                startPosn = bufferPosn = 0;
                if (prevCharCR) {
                    ++bytesConsumed; //account for CR from previous read
                }
                // 从文件读取内容到缓存区
                bufferLength = fillBuffer(in, buffer, prevCharCR);
                if (bufferLength <= 0) {
                    break; // EOF
                }
            }
            // 从读取到缓冲区里面的字符中查找是否存在行分隔符，这里可能是do while的第N次执行，一行字符远大于64K时会存在这个情况
            // 这里面会对行分隔符 CR LF CRLF分别判断
            // 1、当是CR时：C处会是true，在下次for循环时会走到B，然后会跳出这个for循环，这种情况行分隔符的字符数（newlineLength）是1
            // 2、当是LF时：C处永远是false，会进A的if，因上一个字符（prevCharCR）不是CR，所以这里的行分隔符的字符数（newlineLength）是1
            // 3、当是CRLF时：C处会是true，进入A处if，这里因为上一个字符（prevCharCR）是CR，所以这里的行分隔的字符数（newlineLength）是2
            for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
                // A
                if (buffer[bufferPosn] == LF) {
                    newlineLength = (prevCharCR) ? 2 : 1;
                    ++bufferPosn; // at next invocation proceed from following byte
                    break;
                }
                // B
                if (prevCharCR) { //CR + notLF, we are at notLF
                    newlineLength = 1;
                    break;
                }
                // C
                prevCharCR = (buffer[bufferPosn] == CR);
            }

            // 得到上面for读取到的数据长度
            int readLength = bufferPosn - startPosn;
            // 这里是处理CR出现在buffer最后时的情况，这时需要把已读取到的数据长度减1（即减掉CR）
            if (prevCharCR && newlineLength == 0) {
                --readLength; //CR at the end of the buffer
            }
            // bytesConsumed 是记得为了读取一行数据，已经读取到的字节数
            bytesConsumed += readLength;
            // 减掉换行符占用的字符数
            int appendLength = readLength - newlineLength;

            // maxLineLength 单行最大字符限制，可以通过 mapreduce.input.linerecordreader.line.maxlength 设置
            // txtLength 已经读取到的数据长度
            // appendLength 本次do while读取到的数据长度
            // 当读取的内容超过了单选最大限制，则进行截取
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }
            // 如果本次有读取到数据，则将内容append到str
            if (appendLength > 0) {
                str.append(buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
        } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

        if (bytesConsumed > Integer.MAX_VALUE) {
            throw new IOException("Too many bytes before newline: " + bytesConsumed);
        }
        return (int) bytesConsumed;
    }

以上就是涉及到相关源码,接下来以图文加demo形式进行原理解读

虽然hdfs会对文件进行物理分割，split时会做逻辑分割，但在读取文件时是不用关心底层的实现，在API层面相当于是对一个文件的读取，因为默认情况下splitSize是和blockSize是相对的，所以不会出现一个split跨两个block，如果我们将splitSize设置为blockSize的2倍，那么就存在一个split跨两个block，但在文件读取readDefaultLine时，每次读取固定量的字节到buffer(可以通过io.file.buffer.size设置buffer的大小)，然后遍历buffer判断是否存在行分隔符，如果没有则再从文件读取一批内容到buffer，直到找个行分隔符，此时会将读取到的行内容存放到LineRecordReader.value里（传入到map方法的value）

以下是一个纯文本的demo文件，在notepad++里打开，显示出来换行符（CRLF）

我们设置下hadoop的参数

bufferSize（io.file.buffer.size）= 8byte

splitSize（mapreduce.input.fileinputformat.split.maxsize） = 32byte

示例文件

hadoop mapreduce 读取大量小文件 hadoop如何读_List_03