Deploying MapReduce v2 (YARN) on a Cluster

原创

wbj0110 2023-07-24 18:02:57 博主文章分类：Hadoop ©著作权

文章标签 Hadoop xml HDFS mapreduce 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者wbj0110的原创作品，请联系作者获取转载授权，否则将追究法律责任

Deploying MapReduce v2 (YARN) on a Cluster

This section describes configuration tasks for YARN clusters only, and is specifically tailored for administrators who have installed YARN from packages.

Important:

Do the following tasks after you have configured and deployed HDFS:

Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in /etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service

About MapReduce v2 (YARN)

The default installation in CDH 5 is MapReduce 2.x (MRv2) built on the YARN framework. In this document we usually refer to this new version as YARN. The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).

Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment.

If you have installed YARN from packages, follow the instructions below to deploy it. (To deploy MRv1 instead, see Deploying MapReduce v1 (MRv1) on a Cluster.)
If you have installed CDH 5 from tarballs, the default deployment is YARN. Keep in mind that the instructions on this page are tailored for a deployment following installation from packages.

Step 1: Configure Properties for YARN Clusters

Note:

Edit these files in the custom directory you created when you copied the Hadoop configuration. When you have finished, you will push this configuration to all the nodes in the cluster; see Step 5.

Property	Configuration File	Description
mapreduce.framework.name	mapred-site.xml	If you plan on running YARN, you must set this property to the value of yarn.

Sample Configuration:

mapred-site.xml:

<property> <name>mapreduce.framework.name</name> <value>yarn</value></property>

Step 2: Configure YARN daemons

Configure the following services: ResourceManager (on a dedicated host) and NodeManager (on every host where you plan to run MapReduce v2 jobs).

The following table shows the most important properties that you must configure for your cluster in yarn-site.xml

Property	Recommended value	Description
yarn.nodemanager.aux-services	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.
yarn.resourcemanager.hostname	resourcemanager.company.com	yarn.resourcemanager.address,yarn.resourcemanager.admin.address, yarn.resourcemanager.scheduler.address,yarn.resourcemanager.resource-tracker.address,yarn.resourcemanager.webapp.address
yarn.application.classpath	$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/, $HADOOP_COMMON_HOME/lib/,$HADOOP_HDFS_HOME/, $HADOOP_HDFS_HOME/lib/, $HADOOP_MAPRED_HOME/,$HADOOP_MAPRED_HOME/lib/, $HADOOP_YARN_HOME/, $HADOOP_YARN_HOME/lib/	Classpath for typical applications.
yarn.log.aggregation.enable	true

Next, you need to specify, create, and assign the correct permissions to the local directories where you want the YARN daemons to store data.

You specify the directories by configuring the following two properties in the yarn-site.xml

Property	Description
yarn.nodemanager.local-dirs	Specifies the URIs of the directories where the NodeManager stores its localized files. All of the files required for running a particular YARN application will be put here for the duration of the application run. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/local through /data/N/yarn/local.
yarn.nodemanager.log-dirs	Specifies the URIs of the directories where the NodeManager stores container log files. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/logs throughfile:///data/N/yarn/logs.
yarn.nodemanager.remote-app-log-dir	Specifies the URI of the directory where logs are aggregated. Set the value to hdfs://var/log/hadoop-yarn/apps. See also Step 9.

Here is an example configuration:

yarn-site.xml:

<property>  <property>    <name>yarn.resourcemanager.hostname</name>    <value>resourcemanager.company.com</value>  </property>   
  <property>
    <description>Classpath for typical applications.</description>
    <name>yarn.application.classpath</name>
    <value>
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
    </value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
  </property>
  <property>
  </property>
    <name>yarn.log.aggregation.enable</name>
    <value>true</value> 
  <property>
    <description>Where to aggregate logs</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>hdfs://var/log/hadoop-yarn/apps</value>
  </property>

After specifying these directories in the yarn-site.xml

In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.

To configure local storage directories for use by YARN:

Create the yarn.nodemanager.local-dirs local directories: $ sudo mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
Create the yarn.nodemanager.log-dirs local directories: $ sudo mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
Configure the owner of the yarn.nodemanager.local-dirs directory to be the yarn user: $ sudo chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
Configure the owner of the yarn.nodemanager.log-dirs directory to be the yarn user: $ sudo chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs

Here is a summary of the correct owner and permissions of the local directories:

Directory	Owner	Permissions
yarn.nodemanager.local-dirs	yarn:yarn	drwxr-xr-x
yarn.nodemanager.log-dirs	yarn:yarn	drwxr-xr-x

Step 3: Configure the History Server

mapred-site.xml.

Property	Recommended value	Description
mapreduce.jobhistory.address	historyserver.company.com:10020	The address of the JobHistory Server host:port
mapreduce.jobhistory.webapp.address	historyserver.company.com:19888	The address of the JobHistory Server web application host:port

In addition, make sure proxying is enabled for the mapred user; configure the following properties in core-site.xml:

Property	Recommended value	Description
hadoop.proxyuser.mapred.groups	*	Allows the mapreduser to move files belonging to users in these groups
hadoop.proxyuser.mapred.hosts	*	Allows the mapreduser to move files belonging on these hosts

Step 4: Configure the Staging Directory

YARN requires a staging directory for temporary files created by running jobs. By default it creates /tmp/hadoop-yarn/staging with restrictive permissions that may prevent your users from running jobs. To forestall this, you should configure and create the staging directory yourself; in the example that follows we use /user:

Configure yarn.app.mapreduce.am.staging-dir in mapred-site.xml: <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value></property>
Once HDFS is up and running, you will create this directory and a history subdirectory under it (see Step 8).

Alternatively, you can do the following:

Configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir in mapred-site.xml.
Create these two directories.
Set permissions on mapreduce.jobhistory.intermediate-done-dir
Set permissions on mapreduce.jobhistory.done-dir

If you configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir as above, you can skip Step 8.

Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster

Deploy the configuration if you have not already done so.

Step 6: If Necessary, Start HDFS on Every Node in the Cluster

Start HDFS if you have not already done so.

Step 7: If Necessary, Create the HDFS /tmp

Create the /tmp directory if you have not already done so.

Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp

Step 8: Create the history Directory and Set Permissions and Owner

This is a subdirectory of the staging directory you configured in Step 4. In this example we're using /user/history. Create it and set permissions as follows:

sudo -u hdfs hadoop fs -mkdir -p /user/historysudo -u hdfs hadoop fs -chmod -R 1777 /user/historysudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

Step 9: Create Log Directories

Note:

Step 10: Verify the HDFS File Structure:

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt - hdfs supergroup 0 2012-04-19 14:31 /tmpdrwxr-xr-x - hdfs supergroup 0 2012-05-31 10:26 /userdrwxrwxrwt - yarn supergroup 0 2012-04-19 14:31 /user/historydrwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /vardrwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/logdrwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn

Step 11: Start YARN and the MapReduce JobHistory Server

To start YARN, start the ResourceManager and NodeManager services:

Note:

Make sure you always start ResourceManager before starting NodeManager services.

On the ResourceManager system:

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

To start the MapReduce JobHistory Server

On the MapReduce JobHistory Server system:

$ sudo service hadoop-mapreduce-historyserver start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir  /user/<user>$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USERsudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 13: Configure the Hadoop Daemons to Start at Boot Time

See Configuring the Hadoop Daemons to Start at Boot Time.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_yarn_cluster_deploy.html#topic_11_4

上一篇：Hadoop 中利用 mapreduce 读写 mysql 数据

下一篇：Hadoop Tool,ToolRunner原理分析

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯