Hadoop分布式大数据 & python可视化分析

一、MapReduce

MapReduce作为Hadoop专门用作计算的一个组件,虽然相比spark略有不足,但是他的与原生Hadoop的紧密配合还是可观的。

分布式存储指的是HDFS组件,现在的MapReduce计算组件也被看做分布式并发计算,因为YARN资源控制组件,将资源调度和分配处理的清清楚楚。

本次note主要记载了MapReduce的第一个小实验:入门分词wordcount(类似python的jieba库,但处理的数据是来源于大数据Hadoop的hdfs)

二、初始工作

  • 搭建maven框架
  • 修改pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <groupId>com.mrtest</groupId>
    <artifactId>mapreduce</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.3</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.3</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-jobclient -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.3</version>
        </dependency>


        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.4</version>
        </dependency>


        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>com.mrtest.hadoop.MainJob</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <version>2.10</version>
                <executions>
                    <execution>
                        <id>copy-dependencies</id>
                        <phase>package</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

点击IDEA的引入:

hadoop 数据倾斜监控 hadoop数据可视化_大数据

  • 为了更快的下载pom.xml里导入的包,修改项目的settings.xml文件
    添加一个镜像
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
    <mirrors>
        <!-- mirror
         | Specifies a repository mirror site to use instead of a given repository. The repository that
         | this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
         | for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
         |
        <mirror>
          <id>mirrorId</id>
          <mirrorOf>repositoryId</mirrorOf>
          <name>Human Readable Name for this Mirror.</name>
          <url>http://my.repository.com/repo/path</url>
        </mirror>
         -->

        <mirror>
            <id>alimaven</id>
            <name>aliyun maven</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <mirrorOf>central</mirrorOf>
        </mirror>

        <mirror>
            <id>uk</id>
            <mirrorOf>central</mirrorOf>
            <name>Human Readable Name for this Mirror.</name>
            <url>http://uk.maven.org/maven2/</url>
        </mirror>

        <mirror>
            <id>CN</id>
            <name>OSChina Central</name>
            <url>http://maven.oschina.net/content/groups/public/</url>
            <mirrorOf>central</mirrorOf>
        </mirror>

        <mirror>
            <id>nexus</id>
            <name>internal nexus repository</name>
            <!-- <url>http://192.168.1.100:8081/nexus/content/groups/public/</url>-->
            <url>http://repo.maven.apache.org/maven2</url>
            <mirrorOf>central</mirrorOf>
        </mirror>

    </mirrors>

</settings>

导包完成之后就可以上手编程了

三、MapReduce分词(wordcount)统计编程

这是第一个依附“大数据”的计算程序,本质是统计单词数目。

  1. Mapper
    首先要写的第一个类是Mapper,Mapper会被YARN本配到每个datanode上(先被分工,后被集合),以此来计算分布文件。

上代码:

package com.mrtest.hadoop;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * Created by lee
 *
 * Hadoop's mapreduce of mapper
 *
 * 确定四个参数的数据类型:这里用的是apache自带的数据类型,下文中的
 *                      LongWritable , Text ,Text , IntWritable,全是Hadoop自己的数据对象
 *
 * mapper重要的是被部署到每个区块当中
 *
 * keyin:输入的是LongWritable
 * valuein:Text
 * 这个<k,v>对传入的数据依次对应:k(一行,或者是偏移量),v(一行的内容,本例子关心的重点内容)
 *
 * keyout;Text
 * valueout:IntWritable
 */

public class WordCountMapper extends Mapper<LongWritable , Text ,Text , IntWritable> {

    /**
     * 这个重写的方法就是:业务块,以此实现统计
     *
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //先将Hadoop自己的数据类型转化为Java可处理的数据类型
        String line = value.toString();

        //业务:分割单词
        String[] words = line.split(" ");

        //遍历数组,并写出
        for (String word : words) {

            //把Mapper阶段处理的数据发送,即是keyout,valueout
            //并将context作为reduce节点的输入
            context.write(new Text(word) ,new IntWritable(1));
        }
    }
}
  1. Reducer
    Reducer是将最后Mapper计算的结果进行汇总,看过《长城》的伙伴知道这个电影里面的饕餮王,Reducer类似,最后将所有结果汇总并计算。

上代码:

package com.mrtest.hadoop;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * create by lee
 *
 * Hadoop's mapreduce of reducer
 *
 * 这里的reducer处理的输入,就是mapper的输出
 *
 * 同样的思维框架:确定四个参数的数据类型
 * keyin:mapper的数据类型是单词。所以是 Text
 * valuein:单词次数,是数字 inwritable
 * keyout:统计的单词
 * valueout:单词数 inwritable
 */

public class WordCountReducer extends Reducer<Text, IntWritable, Text ,IntWritable> {

    /**
     * reducer具体的业务阶段,接受mapper的输出数据
     *
     * reducer先接受mapper的数据 ——> 按照字典key排序 ——> 把key相同的作为一组,调用reduce方法
     *
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        //定义计数器
        int count = 0;

        //遍历
        for (IntWritable value : values){

            count += value.get();
        }

        //输出
        context.write(key ,new IntWritable(count));
    }
}
  1. 主方法运行
package com.mrtest.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * create by lee
 *
 * 主要执行文件操作等,以此可以操作mapreduce
 *
 * 要写输入数据,输出数据
 */
public class MainJob {

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        //指定运作的主类
        job.setJarByClass(MainJob.class);

        //指定mapper reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //指定mapper的K V类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定mapreduce最终的输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //指定输入路径文件:这里指的是hdfs文件系统的路径,如果想指向本地,需要本地配置Hadoop
        FileInputFormat.setInputPaths(job ,new Path("/wordcount/input"));
        FileOutputFormat.setOutputPath(job ,new Path("/wordcount/output"));

        //提交工作
        boolean b = job.waitForCompletion(true);

        //打印程序执行代码
        System.exit(b?0:1);

    }
}

四、集群运行

想要在集群上运行,首先先将刚刚写的程序打包成.jar文件

打包如果没有配置一个插件的话,需要在pom.xml中配置下面这段代码(我上面的pom.xml已经包含此段代码):

<build>
        <plugins>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>com.mrtest.hadoop.MainJob /注意:这里是你主程序的目录</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <version>2.10</version>
                <executions>
                    <execution>
                        <id>copy-dependencies</id>
                        <phase>package</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

hadoop 数据倾斜监控 hadoop数据可视化_mapreduce_02

开始集群:

  1. 启动集群的Hadoop
  2. 上传打包好的.jar包
[root@Hadoop-1 hadoop-2.4.1]# rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring MRTest.jar...
  100%      10 KB      10 KB/sec    00:00:01       0 Errors  

[root@Hadoop-1 hadoop-2.4.1]# ls
bin  file:    lib      logs        sbin   test01.txt
etc  include  libexec  MRTest.jar  share  test02.txt
  1. 记得编写两三个实验源,比如上面那段代码的test01、2.txt,并上传到hdfs文件系统中的wordcount文件中(一定要与代码中的输入路径一样)

没有wordcount文件就要创建:hadoop fs -mkdir -p /workcount/input

上传test数据源文件:hadoop fs -put test01.txt test02.txt /wordcount/input

hadoop 数据倾斜监控 hadoop数据可视化_大数据_03


hadoop 数据倾斜监控 hadoop数据可视化_hadoop 数据倾斜监控_04

  1. 将jar包上传到YARN
    命令:hadoop jar MRTest.jar 运行结果如下:
[root@Hadoop-1 hadoop-2.4.1]# hadoop jar MRTest.jar
20/07/14 05:20:10 INFO client.RMProxy: Connecting to ResourceManager at Hadoop-1/192.168.2.128:8032
20/07/14 05:20:11 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/07/14 05:20:12 INFO input.FileInputFormat: Total input paths to process : 2
20/07/14 05:20:12 INFO mapreduce.JobSubmitter: number of splits:2
20/07/14 05:20:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594721479652_0002
20/07/14 05:20:14 INFO impl.YarnClientImpl: Submitted application application_1594721479652_0002
20/07/14 05:20:14 INFO mapreduce.Job: The url to track the job: http://Hadoop-1:8088/proxy/application_1594721479652_0002/
20/07/14 05:20:14 INFO mapreduce.Job: Running job: job_1594721479652_0002
20/07/14 05:21:01 INFO mapreduce.Job: Job job_1594721479652_0002 running in uber mode : false
20/07/14 05:21:02 INFO mapreduce.Job:  map 0% reduce 0%
20/07/14 05:24:49 INFO mapreduce.Job:  map 100% reduce 0%
20/07/14 05:25:32 INFO mapreduce.Job:  map 100% reduce 100%
20/07/14 05:25:37 INFO mapreduce.Job: Job job_1594721479652_0002 completed successfully
20/07/14 05:25:37 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=352
                FILE: Number of bytes written=280492
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=378
                HDFS: Number of bytes written=100
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=469224
                Total time spent by all reduces in occupied slots (ms)=37803
                Total time spent by all map tasks (ms)=469224
                Total time spent by all reduce tasks (ms)=37803
                Total vcore-seconds taken by all map tasks=469224
                Total vcore-seconds taken by all reduce tasks=37803
                Total megabyte-seconds taken by all map tasks=480485376
                Total megabyte-seconds taken by all reduce tasks=38710272
        Map-Reduce Framework
                Map input records=10
                Map output records=32
                Map output bytes=282
                Map output materialized bytes=358
                Input split bytes=224
                Combine input records=0
                Combine output records=0
                Reduce input groups=15
                Reduce shuffle bytes=358
                Reduce input records=32
                Reduce output records=15
                Spilled Records=64
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=2145
                CPU time spent (ms)=29790
                Physical memory (bytes) snapshot=457777152
                Virtual memory (bytes) snapshot=6240309248
                Total committed heap usage (bytes)=258678784
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=154
        File Output Format Counters 
                Bytes Written=100

五、结果查看

在Web端查看结果:

http://192.168.2.128:50070/explorer.html#/wordcount/output

hadoop 数据倾斜监控 hadoop数据可视化_hadoop 数据倾斜监控_05


hadoop 数据倾斜监控 hadoop数据可视化_hadoop_06


可见效果还是不错的。

查看YARN资源控制:
http://192.168.2.128:8088/cluster

hadoop 数据倾斜监控 hadoop数据可视化_分布式_07

查看日志

hadoop 数据倾斜监控 hadoop数据可视化_mapreduce_08

hadoop 数据倾斜监控 hadoop数据可视化_mapreduce_09

六、python数据可视化

hadoop 数据倾斜监控 hadoop数据可视化_大数据_10


ps:还是python优雅简单

代码:

import matplotlib.pyplot as plt
import pandas as pd

file = pd.read_csv(r"D:\桌面\data.CSV")

x = file.word
y = file.num

plt.title('word and num Relationship')
plt.xlabel("word")
plt.ylabel("num")

plt.bar(x,y)
plt.show()

事后发散:处理的数据如果是几T的,我的电脑是不就炸机了
初次体验Hadoop,感觉不错,就是这个环境和配置实在是复杂而且脆弱