mapreduce代码示例

Welcome to MapReduce algorithm example. Before writing MapReduce programs in CloudEra Environment, first we will discuss how MapReduce algorithm works in theory with some simple MapReduce example in this post. In my next posts, we will discuss about How to develop a MapReduce Program to perform WordCounting and some more useful and simple examples.

欢迎使用MapReduce算法示例。 在CloudEra环境中编写MapReduce程序之前,首先我们将通过本文中的一些简单MapReduce示例讨论MapReduce算法在理论上如何工作。 在我的下一篇文章中,我们将讨论如何开发MapReduce程序来执行WordCounting以及一些更有用和更简单的示例。

(MapReduce Algorithm)

MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s MapReduce Tech Paper.

MapReduce是一种分布式数据处理算法,由Google在其MapReduce技术论文中引入。

MapReduce Algorithm is mainly inspired by Functional Programming model. ( Please read this post “Functional Programming Basics” to get some understanding about Functional Programming , how it works and it’s major advantages).

MapReduce算法主要受功能编程模型的启发。 (请阅读这篇文章“ Functional Programming Basics ”,以了解有关Functional Programming的知识,其工作方式及其主要优点)。

MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments.

MapReduce算法主要用于在集群环境中以并行,可靠和高效的方式处理大量数据。

I hope, everyone is familiar with “Divide and Conquer” algorithm. It uses Divide and Conquer technique to process large amount of data.

我希望每个人都熟悉“分而治之”算法。 它使用分而治之技术来处理大量数据。

It divides input task into smaller and manageable sub-tasks (They should be executable independently) to execute them in-parallel.

它将输入任务分为较小且易于管理的子任务(它们应独立执行)以并行执行它们。

(MapReduce Algorithm Steps)

MapReduce Algorithm uses the following three main steps:

MapReduce算法使用以下三个主要步骤:

  1. Map Function
  2. Shuffle Function
  3. Reduce Function

Here we are going to discuss each function role and responsibility in MapReduce algorithm. If you don’t understand it well in this section, don’t get panic. Please read next section, where we use one simple word counting example to explain them in-detail. Once you read next section again come back to this section re-read it again. I bet you will definitely understand these 3 steps or functions very well.

在这里,我们将讨论MapReduce算法中的每个功能角色和职责。 如果您在本节中不太了解,请不要慌张。 请阅读下一节,在这里我们将使用一个简单的单词计数示例详细解释它们。 再次阅读下一节后,请再次阅读本节。 我敢打赌,您一定会很好地理解这三个步骤或功能。

(Map Function)

Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.

地图功能是MapReduce算法的第一步。 它需要输入任务(例如DataSets。在下图中,我仅给出了一个DataSet。)并将它们划分为较小的子任务。 然后对每个子任务并行执行所需的计算。

This step performs the following two sub-steps:

此步骤执行以下两个子步骤:

  1. Splitting
  2. Mapping
  • Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
  • Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.

The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.

该映射函数的输出是一组键和值对,如<Key,Value>,如下图所示。

MapReduce First Step Output:

MapReduce第一步输出:

(Shuffle Function)

It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”.

这是MapReduce算法的第二步。 随机播放功能也称为“合并功能”。

It performs the following two sub-steps:

它执行以下两个子步骤:

  1. Merging
  2. Sorting

It takes a list of outputs coming from “Map Function” and perform these two sub-steps on each and every key-value pair.

它获取来自“映射功能”的输出列表,并对每个键值对执行这两个子步骤。

  • Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
  • Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.

Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.

最后,随机播放功能将<键,列表<值>>排序对的列表返回到下一步。

MapReduce Second Step Output:

MapReduce第二步输出:

(Reduce Function)

It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.

这是MapReduce算法的最后一步。 它仅执行一个步骤:减少步骤。

It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.

它从随机播放功能中获取<Key,List <Value >>排序对的列表,并执行如下所示的归约运算。

MapReduce Final Step Output:

MapReduce最终步骤输出:

Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.

最后一步输出看起来像第一步输出。 但是,最后一步<Key,Value>对与第一步<Key,Value>对不同。 最后步骤<键,值>对被计算并排序。

We can observe the difference between first step output and final step output with some simple example. We will discuss same steps with one simple example in next section.

我们可以通过一些简单的示例观察第一步输出和最后一步输出之间的差异。 在下一节中,我们将通过一个简单的示例讨论相同的步骤。

That’s it all three steps of MapReduce Algorithm.

这就是MapReduce算法的所有三个步骤。

(MapReduce Example – Word Count)

In this section, we are going to discuss about “How MapReduce Algorithm solves WordCount Problem” theoretically. We will implement a Hadoop MapReduce Program and test it in my coming post.

在本节中,我们将从理论上讨论“ MapReduce算法如何解决WordCount问题”。 我们将实现Hadoop MapReduce程序并在我的后续文章中对其进行测试。

Problem Statement: Count the number of occurrences of each word available in a DataSet.

问题陈述: 计算数据集中可用的每个单词的出现次数。

Input DataSet Please find our example Input DataSet file in below diagram. Just for simplicity, we are going to use simple small DataSet. However, Real-time applications use very huge amount of Data.

输入数据集 请在下图中找到我们的示例输入数据集文件。 为了简单起见,我们将使用简单的小型DataSet。 但是,实时应用程序使用非常大量的数据。

Client Required Final Result

客户要求的最终结果

MapReduce – Map Function (Split Step)

MapReduce –地图功能(分割步骤)

MapReduce – Map Function (Mapping Step)

MapReduce –地图功能(映射步骤)

MapReduce – Shuffle Function (Merge Step)

MapReduce –随机播放功能(合并步骤)

MapReduce – Shuffle Function (Sorting Step)

MapReduce –随机播放功能(排序步骤)

MapReduce – Reduce Function (Reduce Step)

MapReduce –减少功能(减少步骤)

MapReduce 3 Step Process With WordCount Example

带WordCount示例的MapReduce 3步骤过程

That’s it all about MapReduce Algorithm and map reduce example step by step. It’s time to start Developing and testing MapReduce Programs. First, We are going to develop same “WordCounting” example in my coming post.

这就是MapReduce算法和map逐步缩小示例的全部内容。 现在该开始开发和测试MapReduce程序了。 首先,在接下来的文章中,我们将开发相同的“ WordCounting”示例。

Please drop me a comment if you like my post or have any issues/suggestions.

如果您喜欢我的帖子或有任何问题/建议,请给我评论。