分布式计算框架spark pytorch 分布式计算框架课程

转载

mob64ca13f446df 2023-09-22 18:57:54

文章标签 MapReduce WordCount mr hdfs Text 文章分类 Spark 大数据

1. 理解MapReduce思想

MapReduce思想在生活中处处可见。或多或少都曾接触过这种思想。MapReduce的思想核心是“分而治之”，适用于大量复杂的任务处理场景（大规模数据处理场景）。即使是发布过论文实现分布式计算的谷歌也只是实现了这种思想，而不是自己原创。

Map负责“分”，即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆分的前提是这些小任务可以并行计算，彼此间几乎没有依赖关系。

Reduce负责“合”，即对map阶段的结果进行全局汇总。

这两个阶段合起来正是MapReduce思想的体现。

分布式计算框架spark pytorch 分布式计算框架课程_MapReduce

2. MapReduce编程规范及步骤

分布式计算框架spark pytorch 分布式计算框架课程_hdfs_02

2.1 天龙八部

mapReduce一共有八个步骤

map阶段两个步骤
第一步：读取文件，解析成key,value对 k1 v1
第二步：自定义map逻辑，接收k1 v1 转换成新的 k2 v2 输出
shuffle阶段四个步骤
第三步：分区。相同key的数据，发送到同一个reduce里面去，key合并，value形成一个集合
第四步：排序
第五步：规约
第六步：分组
reduce阶段的两个步骤
第七步：自定义reduce逻辑，接收k2 v2，转换成新的k3 v3 输出
第八步：将我们k3 v3 写出到目的地

3. WordCount示例编写

3.1 需求

在一堆给定的文本文件中统计输出每一个单词出现的总次数

3.2 数据格式准备如下

cd /export/servers
vim wordcount.txt
hello,world,hadoop
hive,sqoop,flume,hello
kitty,tom,jerry,world
hadoop
hdfs dfs -mkdir /wordcount/
hdfs dfs -put wordcount.txt /wordcount/

3.3 代码编写

3.3.1 mapper类

public class WordCountMapper extends Mapper<LongWritable,Text,Text,LongWritable> {
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split(",");
        for (String word : split) {
            context.write(new Text(word),new LongWritable(1));
        }

    }
}

3.3.2 定义一个reducer类

public class WordCountReducer extends Reducer<Text,LongWritable,Text,LongWritable> {
    /**
     * 自定义我们的reduce逻辑
     * 所有的key都是我们的单词，所有的values都是我们单词出现的次数
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long count = 0;
        for (LongWritable value : values) {
            count += value.get();
        }
        context.write(key,new LongWritable(count));
    }
}

3.3.3 定义一个主类，用来描述job并提交job

public class JobMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), JobMain.class.getSimpleName());
        //打包到集群上面运行时候，必须要添加以下配置，指定程序的main函数
        job.setJarByClass(JobMain.class);
        //第一步：读取输入文件解析成key，value对
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("hdfs://192.168.52.100:8020/wordcount"));

        //第二步：设置我们的mapper类
        job.setMapperClass(WordCountMapper.class);
        //设置我们map阶段完成之后的输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //第三步，第四步，第五步，第六步，省略
        //第七步：设置我们的reduce类
        job.setReducerClass(WordCountReducer.class);
        //设置我们reduce阶段完成之后的输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //第八步：设置输出类以及输出路径
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job,new Path("hdfs://192.168.52.100:8020/wordcount_out"));
        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }

    /**
     * 程序main函数的入口类
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Tool tool  =  new JobMain();
        int run = ToolRunner.run(configuration, tool, args);
        System.exit(run);
    }
}

3.3.4 bug解决

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=admin, access=WRITE, inode="/":root:supergroup:drwxr-xr-x

解决方法
直接将hdfs-site.xml当中的权限关闭即可

<property>
         <name>dfs.permissions</name>
          <value>false</value>
</property>

然后重启hdfs集群

4. MapReduce程序运行模式

4.1 本地运行

mapreduce程序是被提交给LocalJobRunner在本地以单进程的形式运行
而处理的数据及输出结果可以在本地文件系统，也可以在hdfs上
怎样实现本地运行？写一个程序，不要带集群的配置文件
本质是程序的conf中是否有mapreduce.framework.name=local以及yarn.resourcemanager.hostname=local参数
本地模式非常便于进行业务逻辑的debug，只要在eclipse中打断点即可
本地模式代码设置

configuration.set("mapreduce.framework.name","local");
configuration.set("yarn.resourcemanager.hostname","local");
TextInputFormat.addInputPath(job,new Path("file:///F:\\wordcount\\input"));
TextOutputFormat.setOutputPath(job,new Path("file:///F:\\wordcount\\output"));

4.2 集群运行模式

将mapreduce程序提交给yarn集群，分发到很多的节点上并发执行
处理的数据和输出结果应该位于hdfs文件系统
提交集群的实现步骤：

代码设置

//打包到集群上面运行时候，必须要添加以下配置，指定程序的main函数
job.setJarByClass(JobMain.class);

将程序打成JAR包，然后在集群的任意一个节点上用hadoop命令启动

yarn jar hadoop_hdfs_operate-1.0-SNAPSHOT.jar cn.itcast.hdfs.demo1.JobMain

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：redis cluster存在2个leader如何选举 redis cluster选举机制

下一篇：redis cluster mget耗时严重 redis cluster meet

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

分布式计算框架spark pytorch 分布式计算框架课程

分布式计算框架spark pytorch 分布式计算框架课程

1. 理解MapReduce思想

2. MapReduce编程规范及步骤

2.1 天龙八部

3. WordCount示例编写

3.1 需求

3.2 数据格式准备如下

3.3 代码编写

3.3.1 mapper类

3.3.2 定义一个reducer类

3.3.3 定义一个主类，用来描述job并提交job

3.3.4 bug解决

4. MapReduce程序运行模式

4.1 本地运行

4.2 集群运行模式

51CTO博客