搭建HDFS环境 hdfs环境变量配置

转载

mob6454cc798a0c 2024-04-19 16:34:55

文章标签 搭建HDFS环境 hadoop xml Hadoop sed 文章分类 架构后端开发

NameNode管理界面：http://namenode:50070

JobTracker管理界面：http://jobtracker:50030

Hadoop守护进程日志存放目录：可以用环境变量${Hadoop_LOG_DIR}进行配置，默认情况下是${HADOOP_HOME}/logs

1．配置类型节点的环境变量

在配置集群的时候可以在conf/hadoop-env.sh配置不同节点的环境变量：

Daemon	Configure Options
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
SecondaryNamenode	HADOOP_SECONDARYNAMENODE_OPTS
JobTracker	HADOOP_JOBTRACKER_OPTS
TaskTracker	HADOOP_TASKTRACKER_OPTS

例如，可以在hadoop-env.sh中加入下面一行，使NameNode使用ParallelGC

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

2. 配置Hadoop守护进程
conf/core-site.xml:通过该配置文件配置文件系统根目录，即hdfs://namenode

Parameter	Value	Notes
fs.default.name	URI of NameNode.	hdfs://hostname/

conf/hdfs-site.xml:通过该配置文件，配置NameNode和DataNode数据文件的存放目录

Parameter	Value	Notes
dfs.name.dir	NameNode上存放命名空间和日志的本地目录，即fsimage和edits文件的存放目录	如果该配置有由逗号分开的多条目录，那么NameNode会在每条目录中进行冗余存储
dfs.data.dir	DataNode中存放block的目录	如果是由逗号分开的多条目录，则所有的目录都用来存储数据

conf/mapred-site.xml:可以通过这个文件配置MapReduce框架

Parameter	Value	Notes
mapred.job.tracker	JobTracker的IP和端口	host:portpair.
mapred.system.dir	Path on the HDFS where where the MapReduce framework stores system files e.g. /hadoop/mapred/system/.	This is in the default filesystem (HDFS) and must be accessible from both the server and client machines.
mapred.local.dir	Comma-separated list of paths on the local filesystem where temporary MapReduce data is written.	Multiple paths help spread disk i/o.
mapred.tasktracker.{map\|reduce}.tasks.maximum	The maximum number of MapReduce tasks, which are run simultaneously on a given TaskTracker, individually.	Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware.
dfs.hosts/dfs.hosts.exclude	List of permitted/excluded DataNodes.	If necessary, use these files to control the list of allowable datanodes.
mapred.hosts/mapred.hosts.exclude	List of permitted/excluded TaskTrackers.	If necessary, use these files to control the list of allowable TaskTrackers.
mapred.queue.names	Comma separated list of queues to which jobs can be submitted.	The MapReduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.
mapred.acls.enabled	Boolean, specifying whether checks for queue ACLs and job ACLs are to be done for authorizing users for doing queue operations and job operations.	If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described at Job Authorization
mapred.task.timeout	Map/reduce task多长时间没有返回就认为他是失败的	一般为10分钟
mapred.map.max.attemptes	如果一个map任务失败，最多可以重新调度的次数	一般为4
mapred.reduce.max.attempts	如果一个reduce任务失败，最多可以重新调度的次数	一般为4
mapred.max.map.failures.percent	一个作业可以允许的map任务失败的比率	有时候我们认为，一个作业即使有一部分任务失败但作业其他任务执行也是有用的，该参数设置一个作业可以允许最多可以承受多少map任务失败
mapred.max.reduce.failures.percent	同上，针对reduce任务	同上，针对reduce任务
mapred.tasktracker.expiry.interval	Tasktracker多长时间没有向JobTracker发送心跳就认为tasktracker失败	默认为10分钟
mapred.user.jobconf.limit	Hadoop能够接受的最多job数量
mapred.tasktracker.map.tasks.maximum	每一个tasktracker能够同时执行的map的个数	默认为2
mapred.tasktracker.reduce.tasks.maximum	每一个tasktracker能够同时执行的reduce的个数	默认为2

下面是Hadoop项目在做一些项目时，对他们的集群所做的配置，在实际工作过程中我们也可以进行参考。

This section lists some non-default configuration parameters which have been used to run the sort benchmark on very large clusters.

l Some non-default configuration values used to run sort900, that is 9TB of data sorted on a cluster with 900 nodes:

Configuration File	Parameter	Value	Notes
conf/hdfs-site.xml	dfs.block.size	134217728	HDFS blocksize of 128MB for large file-systems.
conf/hdfs-site.xml	dfs.namenode.handler.count	40	More NameNode server threads to handle RPCs from large number of DataNodes.
conf/mapred-site.xml	mapred.reduce.parallel.copies	20	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
conf/mapred-site.xml	mapred.map.child.java.opts	-Xmx512M	Larger heap-size for child jvms of maps.
conf/mapred-site.xml	mapred.reduce.child.java.opts	-Xmx512M	Larger heap-size for child jvms of reduces.
conf/core-site.xml	fs.inmemory.size.mb	200	Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces.
conf/core-site.xml	io.sort.factor	100	More streams merged at once while sorting files.
conf/core-site.xml	io.sort.mb	200	Higher memory-limit while sorting data.
conf/core-site.xml	io.file.buffer.size	131072	Size of read/write buffer used in SequenceFiles.

l Updates to some configuration values to run sort1400 and sort2000, that is 14TB of data sorted on 1400 nodes and 20TB of data sorted on 2000 nodes:

Configuration File	Parameter	Value	Notes
conf/mapred-site.xml	mapred.job.tracker.handler.count	60	More JobTracker server threads to handle RPCs from large number of TaskTrackers.
conf/mapred-site.xml	mapred.reduce.parallel.copies	50
conf/mapred-site.xml	tasktracker.http.threads	50	More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs.
conf/mapred-site.xml	mapred.map.child.java.opts	-Xmx512M	Larger heap-size for child jvms of maps.
conf/mapred-site.xml	mapred.reduce.child.java.opts	-Xmx1024M	Larger heap-size for child jvms of reduces.