hadoop的job显示web
There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. By default, these are located at [WWW] http://job.tracker.addr:50030/ and [WWW] http://name.node.addr:50070/. 

hadoop监控
OnlyXP(52388483) 131702
用nagios作告警,ganglia作监控图表即可

status of 255 error
错误类型:
java.io.IOException: Task process exit with nonzero status of 255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

错误原因:
Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reason for failure, though I'm not sure

split size
FileInputFormat input splits: (详见 《the definitive guide》P190)
mapred.min.split.size: default=1, the smallest valide size in bytes for a file split.
mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size.
dfs.block.size: default = 64M, 系统中设置为128M。
如果设置 minimum split size > block size, 会增加块的数量。(猜想从其他节点拿去数据的时候,会合并block,导致block数量增多) 
如果设置maximum split size < block size, 会进一步拆分block。

split size = max(minimumSize, min(maximumSize, blockSize));
其中 minimumSize < blockSize < maximumSize.

sort by value
hadoop 不提供直接的sort by value方法,因为这样会降低mapreduce性能。
但可以用组合的办法来实现,具体实现方法见《the definitive guide》, P250
基本思想:
1. 组合key/value作为新的key;
2. 重载partitioner,根据old key来分割;
conf.setPartitionerClass(FirstPartitioner.class);
3. 自定义keyComparator:先根据old key排序,再根据old value排序;
conf.setOutputKeyComparatorClass(KeyComparator.class);
4. 重载GroupComparator, 也根据old key 来组合;  conf.setOutputValueGroupingComparator(GroupComparator.class);

small input files的处理
对于一系列的small files作为input file,会降低hadoop效率。
有3种方法可以将small file合并处理:
1. 将一系列的small files合并成一个sequneceFile,加快mapreduce速度。
详见WholeFileInputFormat及SmallFilesToSequenceFileConverter,《the definitive guide》, P194
2. 使用CombineFileInputFormat集成FileinputFormat,但是未实现过;
3. 使用hadoop archives(类似打包),减少小文件在namenode中的metadata内存消耗。(这个方法不一定可行,所以不建议使用)
   方法:
   将/my/files目录及其子目录归档成files.har,然后放在/my目录下
   bin/hadoop archive -archiveName files.har /my/files /my
   
   查看files in the archive:
   bin/hadoop fs -lsr har://my/files.har

skip bad records
JobConf conf = new JobConf(ProductMR.class);
conf.setJobName("ProductMR");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Product.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setMapOutputCompressorClass(DefaultCodec.class);
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
String objpath = "abc1";
SequenceFileInputFormat.addInputPath(conf, new Path(objpath));
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
SkipBadRecords.setSkipOutputPath(conf, new Path("data/product/skip/"));
String output = "abc";
SequenceFileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);

For skipping failed tasks try : mapred.max.map.failures.percent

restart 单个datanode
如果一个datanode 出现问题,解决之后需要重新加入cluster而不重启cluster,方法如下:
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start jobtracker