Hadoop分布式大数据 & python可视化分析
一、MapReduce
MapReduce作为Hadoop专门用作计算的一个组件,虽然相比spark略有不足,但是他的与原生Hadoop的紧密配合还是可观的。
分布式存储指的是HDFS组件,现在的MapReduce计算组件也被看做分布式并发计算,因为YARN资源控制组件,将资源调度和分配处理的清清楚楚。
本次note主要记载了MapReduce的第一个小实验:入门分词wordcount(类似python的jieba
库,但处理的数据是来源于大数据Hadoop的hdfs)
二、初始工作
- 搭建maven框架
- 修改pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<groupId>com.mrtest</groupId>
<artifactId>mapreduce</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-jobclient -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.6</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>com.mrtest.hadoop.MainJob</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
点击IDEA的引入:
- 为了更快的下载
pom.xml
里导入的包,修改项目的settings.xml
文件
添加一个镜像
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<mirrors>
<!-- mirror
| Specifies a repository mirror site to use instead of a given repository. The repository that
| this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
| for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
|
<mirror>
<id>mirrorId</id>
<mirrorOf>repositoryId</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://my.repository.com/repo/path</url>
</mirror>
-->
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>uk</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://uk.maven.org/maven2/</url>
</mirror>
<mirror>
<id>CN</id>
<name>OSChina Central</name>
<url>http://maven.oschina.net/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>nexus</id>
<name>internal nexus repository</name>
<!-- <url>http://192.168.1.100:8081/nexus/content/groups/public/</url>-->
<url>http://repo.maven.apache.org/maven2</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
</settings>
导包完成之后就可以上手编程了
三、MapReduce分词(wordcount)统计编程
这是第一个依附“大数据”的计算程序,本质是统计单词数目。
- Mapper
首先要写的第一个类是Mapper,Mapper会被YARN本配到每个datanode上(先被分工,后被集合),以此来计算分布文件。
上代码:
package com.mrtest.hadoop;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* Created by lee
*
* Hadoop's mapreduce of mapper
*
* 确定四个参数的数据类型:这里用的是apache自带的数据类型,下文中的
* LongWritable , Text ,Text , IntWritable,全是Hadoop自己的数据对象
*
* mapper重要的是被部署到每个区块当中
*
* keyin:输入的是LongWritable
* valuein:Text
* 这个<k,v>对传入的数据依次对应:k(一行,或者是偏移量),v(一行的内容,本例子关心的重点内容)
*
* keyout;Text
* valueout:IntWritable
*/
public class WordCountMapper extends Mapper<LongWritable , Text ,Text , IntWritable> {
/**
* 这个重写的方法就是:业务块,以此实现统计
*
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//先将Hadoop自己的数据类型转化为Java可处理的数据类型
String line = value.toString();
//业务:分割单词
String[] words = line.split(" ");
//遍历数组,并写出
for (String word : words) {
//把Mapper阶段处理的数据发送,即是keyout,valueout
//并将context作为reduce节点的输入
context.write(new Text(word) ,new IntWritable(1));
}
}
}
- Reducer
Reducer是将最后Mapper计算的结果进行汇总,看过《长城》的伙伴知道这个电影里面的饕餮王,Reducer类似,最后将所有结果汇总并计算。
上代码:
package com.mrtest.hadoop;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* create by lee
*
* Hadoop's mapreduce of reducer
*
* 这里的reducer处理的输入,就是mapper的输出
*
* 同样的思维框架:确定四个参数的数据类型
* keyin:mapper的数据类型是单词。所以是 Text
* valuein:单词次数,是数字 inwritable
* keyout:统计的单词
* valueout:单词数 inwritable
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text ,IntWritable> {
/**
* reducer具体的业务阶段,接受mapper的输出数据
*
* reducer先接受mapper的数据 ——> 按照字典key排序 ——> 把key相同的作为一组,调用reduce方法
*
* @param key
* @param values
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//定义计数器
int count = 0;
//遍历
for (IntWritable value : values){
count += value.get();
}
//输出
context.write(key ,new IntWritable(count));
}
}
- 主方法运行
package com.mrtest.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* create by lee
*
* 主要执行文件操作等,以此可以操作mapreduce
*
* 要写输入数据,输出数据
*/
public class MainJob {
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//指定运作的主类
job.setJarByClass(MainJob.class);
//指定mapper reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//指定mapper的K V类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//指定mapreduce最终的输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//指定输入路径文件:这里指的是hdfs文件系统的路径,如果想指向本地,需要本地配置Hadoop
FileInputFormat.setInputPaths(job ,new Path("/wordcount/input"));
FileOutputFormat.setOutputPath(job ,new Path("/wordcount/output"));
//提交工作
boolean b = job.waitForCompletion(true);
//打印程序执行代码
System.exit(b?0:1);
}
}
四、集群运行
想要在集群上运行,首先先将刚刚写的程序打包成.jar
文件
打包如果没有配置一个插件的话,需要在pom.xml
中配置下面这段代码(我上面的pom.xml已经包含此段代码):
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.6</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>com.mrtest.hadoop.MainJob /注意:这里是你主程序的目录</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
开始集群:
- 启动集群的Hadoop
- 上传打包好的
.ja
r包
[root@Hadoop-1 hadoop-2.4.1]# rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring MRTest.jar...
100% 10 KB 10 KB/sec 00:00:01 0 Errors
[root@Hadoop-1 hadoop-2.4.1]# ls
bin file: lib logs sbin test01.txt
etc include libexec MRTest.jar share test02.txt
- 记得编写两三个实验源,比如上面那段代码的
test01、2.txt
,并上传到hdfs
文件系统中的wordcount文件中(一定要与代码中的输入路径一样)
没有wordcount文件就要创建:hadoop fs -mkdir -p /workcount/input
上传test数据源文件:hadoop fs -put test01.txt test02.txt /wordcount/input
- 将jar包上传到YARN
命令:hadoop jar MRTest.jar
运行结果如下:
[root@Hadoop-1 hadoop-2.4.1]# hadoop jar MRTest.jar
20/07/14 05:20:10 INFO client.RMProxy: Connecting to ResourceManager at Hadoop-1/192.168.2.128:8032
20/07/14 05:20:11 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/07/14 05:20:12 INFO input.FileInputFormat: Total input paths to process : 2
20/07/14 05:20:12 INFO mapreduce.JobSubmitter: number of splits:2
20/07/14 05:20:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594721479652_0002
20/07/14 05:20:14 INFO impl.YarnClientImpl: Submitted application application_1594721479652_0002
20/07/14 05:20:14 INFO mapreduce.Job: The url to track the job: http://Hadoop-1:8088/proxy/application_1594721479652_0002/
20/07/14 05:20:14 INFO mapreduce.Job: Running job: job_1594721479652_0002
20/07/14 05:21:01 INFO mapreduce.Job: Job job_1594721479652_0002 running in uber mode : false
20/07/14 05:21:02 INFO mapreduce.Job: map 0% reduce 0%
20/07/14 05:24:49 INFO mapreduce.Job: map 100% reduce 0%
20/07/14 05:25:32 INFO mapreduce.Job: map 100% reduce 100%
20/07/14 05:25:37 INFO mapreduce.Job: Job job_1594721479652_0002 completed successfully
20/07/14 05:25:37 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=352
FILE: Number of bytes written=280492
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=378
HDFS: Number of bytes written=100
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=469224
Total time spent by all reduces in occupied slots (ms)=37803
Total time spent by all map tasks (ms)=469224
Total time spent by all reduce tasks (ms)=37803
Total vcore-seconds taken by all map tasks=469224
Total vcore-seconds taken by all reduce tasks=37803
Total megabyte-seconds taken by all map tasks=480485376
Total megabyte-seconds taken by all reduce tasks=38710272
Map-Reduce Framework
Map input records=10
Map output records=32
Map output bytes=282
Map output materialized bytes=358
Input split bytes=224
Combine input records=0
Combine output records=0
Reduce input groups=15
Reduce shuffle bytes=358
Reduce input records=32
Reduce output records=15
Spilled Records=64
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=2145
CPU time spent (ms)=29790
Physical memory (bytes) snapshot=457777152
Virtual memory (bytes) snapshot=6240309248
Total committed heap usage (bytes)=258678784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=154
File Output Format Counters
Bytes Written=100
五、结果查看
在Web端查看结果:
http://192.168.2.128:50070/explorer.html#/wordcount/output
可见效果还是不错的。
查看YARN资源控制:
http://192.168.2.128:8088/cluster
查看日志
六、python数据可视化
ps:还是python优雅简单
代码:
import matplotlib.pyplot as plt
import pandas as pd
file = pd.read_csv(r"D:\桌面\data.CSV")
x = file.word
y = file.num
plt.title('word and num Relationship')
plt.xlabel("word")
plt.ylabel("num")
plt.bar(x,y)
plt.show()
事后发散:处理的数据如果是几T的,我的电脑是不就炸机了
初次体验Hadoop,感觉不错,就是这个环境和配置实在是复杂而且脆弱