《Hadoop权威指南》读书笔记之六 — Chapter 6

原创

说文科技 2021-07-07 15:33:59 ©著作权

文章标签 # 《Hadoop权威指南》 文章分类 Hadoop 大数据

©著作权归作者所有：来自51CTO博客作者说文科技的原创作品，请联系作者获取转载授权，否则将追究法律责任

《`Hadoop`权威指南》读书笔记之六 — `Chapter 6`

1. `xml`文件的读取

1.1 读取步骤

01.添加在 resources 文件夹中
02.使用 Configuration 类的 addResource() 方法
03.获取.xml 文件中的属性
04.xml文件可以通过 variable expansion 的方式进行设置。但是这个设置值的顺序是不是得有个先后呢？这个定义值的顺序是没有先后关系要求的，只要属性是全局唯一的，那么就能获取到。
05.xml文件中如果某个属性是finla修饰，则不能够在其它的.xml文件中重置了。

1.2 实战代码

import org.apache.hadoop.conf.Configuration;

public class PrintConfiguration {
    public static void main(String[] args) {
        Configuration conf = new Configuration();
        conf.addResource("configuration-1.xml");
        conf.addResource("configuration-2.xml");
        System.out.println(conf.get("color"));

        //getInt: Get the value of the name property as an int.
        //如果不存在值，则直接返回 defaultValue
        System.out.println(conf.getInt("size",0));

        System.out.println("weight： "+conf.get("weight"));

        //variable expansion
        System.out.println("size-weight： "+conf.get("size-weight"));

        //
        System.setProperty("length", "2");
        System.out.println("length: "+conf.get("length"));
    }
}

执行结果如下：
《Hadoop权威指南》读书笔记之六 — Chapter 6_# 《Hadoop权威指南》

configuration-1.xml

<?xml version="1.0"?>
<configuration>
    <property>
        <name>size-weight</name>
        <value>${size},${weight}</value>
        <description>Size and weight</description>
    </property>
    <property>
        <name>color</name>
        <value>yellow</value>
        <description>Color</description>
    </property>
    <property>
        <name>size</name>
        <value>10</value>
        <description>Size</description>
    </property>
    <property>
        <name>weight</name>
        <value>heavy</value>
        <final>true</final>
        <description>Weight</description>
    </property>

</configuration>

configuration-2.xml

<?xml version="1.0"?>
<configuration>
    <property>
        <name>color</name>
        <value>blue</value>
        <description>Color</description>
    </property>
    <property>
        <name>size</name>
        <value>20</value>
        <description>Size</description>
    </property>

    <property>
        <name>weight</name>
        <value>light</value>
        <description>Weight</description>
    </property>
</configuration>

1.3 注意事项

这里configuration-2.xml文件中有一个属性weight 是对 configuration-1.xml属性的覆写，但是因为configuration-1.xml中的weight属性是final的，且代码的顺序是先加载configuration-1.xml，再加载configuration-2.xml文件，所以导致运行的时候会出现一个覆写警告。
Note that although configuration properties can be defined in terms of system properties, unless system properties are redefined using configuration properties, they are not accessible through the configuration API.

注意：尽管在系统属性方面可以定义配置属性，【这句话我不是很理解】但是除非系统属性被重定义在配置文件中，否则它们是不能通过配置的API访问的。【例如：上面的length属性，会得到一个null值】。

2. Tool 类详解

在了解Tool类之前，先了解一下GenericOptionsParser类。

2.1 `GenericOptionsParser`

GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired.

但是我们并不需要这么麻烦，因为很多时候，我们可以直接使用Tool 类就可以解决问题。因为Tool 内部其实就是使用了GenericOptionParser 类。因为Tool类继承自Configurable类

public interface Tool extends Configurable {
int run(String [] args) throws Exception;
}

2.2 `Tool`

所有的 Tool 的实现类同时需要实现 Configurable类 (因为Tool 继承了该类)。并且其（Configurable）子类Configured 是最简单的实现。

2.3 实战代码

package hadoopDefinitiveGuide.chapter_6;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.util.Map;

public class ConfigurationPrinter extends Configured implements Tool {

    static
    {
        Configuration.addDefaultResource("hdfs-default.xml");
    }
    @Override
    public int run(String[] args) throws Exception {
        //这里可以直接调用getConf() 方法是因为：getConf()是从Configured 中继承来的
        //而 Configured 的方法是从 Configurable 中实现而来的。
        Configuration conf = getConf();
        for (Map.Entry<String, String> entry : conf) {
            System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
        }
        return 0;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
        System.exit(exitCode);
    }
}

《Hadoop权威指南》读书笔记之六 — Chapter 6_# 《Hadoop权威指南》_02

其中省略了部分输出。得到如上结果。
但是我们代码里又用到了ToolRunner 类，这个是干嘛的呢？查看该方法如下：
《Hadoop权威指南》读书笔记之六 — Chapter 6_# 《Hadoop权威指南》_03

可以看到这个 run()其实底层调用了另一个run()方法，但是在运行之前添加了一个tool.getConf()参数，这个getConf()方法是用于得到一个Configuration实例，从而传递给run()方法作为参数。接着调用的run()方法如下：
《Hadoop权威指南》读书笔记之六 — Chapter 6_# 《Hadoop权威指南》_04

可以看到里面分别维持了三个重要对象：conf,parser 和tool
接着调用Tool类的 run 方法，运行程序。

3. 详解如下两个类/接口

3.1 `Configurable` 是什么

3.2 `Configured` 是什么

3.3 二者有什么区别？

Configurable 接口
Configured 类

4.属性配置

并不是所有的属性都可以在 client 中配置。比如说 yarn.nodemanager.resource.memeory-mb 这个配置就必须在 yarn-site.xml 中配置，否则无效。

5.在Hadoop 2 之后，配置的命名规则

01.与 HDFS 属性相关的namenode 已经改为使用 dfs.namenode 作为前缀了。
02.与MapReduce 属性相关的则是有一个 mapreduce前缀，而不再是mapred 前缀了。如: mapreduce.job.name

6. `Hadoop` 和 `JVM` 设置属性的区别

01.注意使用 GenericOptionsParser 这个类去设置hadoop 的属性时，其语法是【-D property=value】这个类和 JVM中的 -Dproperty=value不同。注意-D 和 property的空格。
02.JVM 的系统属性值是从 java.lang.System 类中读取的； Hadoop 的属性值仅仅是从 Configuration 对象中获取

7. 为什么 `Mapper` 方法中的 `<KeyIn,valuIn>`通常都是 `Longwritable`, 和`Text`?

因为大多数的Mapper 任务都是需要从hdfs从读取文件的，而大多数的文件都是默认的.txt文件，即可以直接读取的。
Mapper 的默认读取类型是 TextInputFormat 类，从而使得KeyIn = LongWritable, 而 valueIn = Text。

8. `MapReduce` 的job在不同的平台上运行的效果

The local job runner uses a single JVM to run a job, so as long as all the classes that your
job needs are on its classpath, then things will just work.

本地作业运行程序使用单个jvm 来运行作业，因此只要作业所需的所有类都在其类路径中，那么就会起作用。
但是集群中的作业运行可能稍有不同。如果需要将写好的代码放到集群中运行，那么分为如下几步：

step 1:packaged into a job JAR file to send to the cluster
step 2:Hadoop will find the job JAR automatically by searching for the JAR on the driver’s classpath that contains the class set in the setJarByClass() method (on JobConf or Job).
step 3:Any dependent JAR files can be packaged in a lib subdirectory in the job JAR file, although there are other ways to include dependencies, discussed later.
step 4:Similarly, resource files can be packaged in a classes subdirectory.

在一个集群（包括伪分布集群）上，map 以及 reduce 任务都是运行在单独节点的jvm上的。并且它们的类路径并不是由HADOOP_CLASSPATH控制的。
HADOOP_CLASSPATH 是一个客户端设置并且仅仅为提交作业的驱动的JVM设置。相反，用户的作业路径包含如下项：

The job JAR file
Any JAR files contained in the lib directory of the job JAR file, and the classes directory (if present)
Any files added to the distributed cache using the -libjars option , or the addFileToClassPath() method on DistributedCache (old API), or Job (new API)

10.`User JAR` 文件可以被添加到client classpath，以及task classpath之后，在这种情况下，这些用户的jar包可能会导致和 hadoop自带的jar包产生冲突。

解决这种问题的办法常用：

调整任务的 classpath 顺序，从而让你的类能够被首次获取
在客户端，可以设置 HADOOP_USER_CLASSPATH_FIRST 环境变量为true

11. `MapReduce job IDs` 的生成规则

11.1 生成规则

application 前缀
这个前缀是固定的，代表的是一个MapReduce应用
Yarn application IDs
这个ids 由 YARN resources manager 生成，是一个时间戳。
一个自增的id
这个自增的id 也是由 YARN resource Manager 生成，代表的是这个resource manager 所处理的应用数

11.2 针对`id = application_1410450250506_0003` 举例

这是一个application
该任务是在 timestamp = 1410450250506 这个时候创建的；
某个yarn 的第三个任务；
前导零的存在是为了更好的排序，更好的展示出来。如果这个自增的id变成了五位数，这个数字会继续保留下去，而不是重置为0。

同理，job的id，也是这么生成的，只不过是将前缀 application 替换成了job。 例如：task_1410450250506_0003_m_000003 则表示是 job_1410450250506_0003 的第四个【000003，从0计数】map任务。

attempt_1410450250506_0003_m_000003_0 则表示的是 task 重试的次数。
因为task可能被执行超过一次，所以需要添加一个数字标记。比如说上面的这个标志，其后面的0 表示的则是第一次尝试运行【是从0开始计数】

12. `Job history`

Job history refers to the events and configuration for a completed MapReduce job. It is retained regardless of whether the job was successful.
Job history files are stored in HDFS by the MapReduce application master, in a directory set by the
mapreduce.jobhistory.done-dir property.
Job history files are kept for one week before being deleted by the system.
The history log includes job, task, and attempt events, all of which are stored in a file in JSON format

13. job 调优

13.1 调优前言

在开发job之后，开发者经常想到的问题就是：“可以让这个job运行的更快一点儿嘛？” 在Hadoop中有一些通常的可疑项，这些可疑项是值的检验并查看是否引起了性能问题。在你剖析以及优化任务之前，你应该检查如下这份清单。
《Hadoop权威指南》读书笔记之六 — Chapter 6_# 《Hadoop权威指南》_05

13.2 `Profiling Tasks`

Hadoop allows you to profile a fraction of the tasks in a job and, as each task completes, pulls down the profile information to your machine for later analysis with standard profiling tools.

上面这句话的意思说的就是：Hadoop 允许开发者对一个job中的某个任务进行侧面剖查，并且在每个任务完成时，获取剖查信息，用于后期使用标准的profiling tools的分析

在分布式的集群中剖析应用充满许多挑战。

针对不同的瓶颈进行优化【如果一个任务是I/O繁忙型，就不应该去优化cpu】
不应该仅仅比对程序运行时间【因为集群在不同的时候的资源是不同的】

13.3 开启 `profile` 功能

Enabling profiling is as simple as setting the property mapreduce.task.profile to true

例如：在运行任务的时候，指定参数 -D mapreduce.task.profile=true 。
如：hadoop jar hadoop-examples.jar -D mapreduce.task.profile=true ....

13.4 注意事项

避免分析所有的task
要分析所有的运行的task，通常是没有意义的，因为通过数个task就能够看到程序哪里出了问题。所以默认分析的程序只有 id =0,1,2 会被分析。可以通过设置 mapreduce.task.profile.maps 以及 mapreduce.task.profile.reduces 参数去指定 profile 的task id 区间。

上一篇：《Hadoop 权威指南》读书笔记之七 — chapter7

下一篇：Hadoop+HBase+Zookeeper安装摘要

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

《Hadoop权威指南》读书笔记之六 — Chapter 6

《Hadoop权威指南》读书笔记之六 — Chapter 6

《Hadoop权威指南》读书笔记之六 — Chapter 6

1. xml文件的读取

1.1 读取步骤

1.2 实战代码

1.3 注意事项

2. Tool 类详解

2.1 GenericOptionsParser

2.2 Tool

2.3 实战代码

3. 详解如下两个类/接口

3.1 Configurable 是什么

3.2 Configured 是什么

3.3 二者有什么区别？

4.属性配置

5.在Hadoop 2 之后，配置的命名规则

6. Hadoop 和 JVM 设置属性的区别

7. 为什么 Mapper 方法中的 <KeyIn,valuIn>通常都是 Longwritable, 和Text?

8. MapReduce 的job在不同的平台上运行的效果

10.User JAR 文件可以被添加到client classpath，以及task classpath之后，在这种情况下，这些用户的jar包可能会导致和 hadoop自带的jar包产生冲突。

11. MapReduce job IDs 的生成规则

11.1 生成规则

11.2 针对id = application_1410450250506_0003 举例

12. Job history

13. job 调优

13.1 调优前言

13.2 Profiling Tasks

13.3 开启 profile 功能

13.4 注意事项

51CTO博客

《`Hadoop`权威指南》读书笔记之六 — `Chapter 6`

1. `xml`文件的读取

2.1 `GenericOptionsParser`

2.2 `Tool`

3.1 `Configurable` 是什么

3.2 `Configured` 是什么

6. `Hadoop` 和 `JVM` 设置属性的区别

7. 为什么 `Mapper` 方法中的 `<KeyIn,valuIn>`通常都是 `Longwritable`, 和`Text`?

8. `MapReduce` 的job在不同的平台上运行的效果

10.`User JAR` 文件可以被添加到client classpath，以及task classpath之后，在这种情况下，这些用户的jar包可能会导致和 hadoop自带的jar包产生冲突。

11. `MapReduce job IDs` 的生成规则

11.2 针对`id = application_1410450250506_0003` 举例

12. `Job history`

13.2 `Profiling Tasks`

13.3 开启 `profile` 功能