问题描述:一小表 1000 row
一大表 60w row
方案一:
在运行的时候发现会自动转为map join
本以为会很快,但是只起了一个map ,join 的计算量 : 单机计算6 亿次,结果一直map 0% 最后运行 1800s
方案二:
采用关闭map join :
但是依旧会很慢 what,why? 因为mapper的数量还是太小了,并行度不够啊。
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1492598920618_36034, Tracking URL = http://qing-hadoop-master-srv1:8088/proxy/application_1492598920618_36034/
Kill Command = /data/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/bin/hadoop job -kill job_1492598920618_36034
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-06-23 13:30:34,807 Stage-1 map = 0%, reduce = 0%
2017-06-23 13:30:38,949 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.3 sec
2017-06-23 13:30:39,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.45 sec
2017-06-23 13:30:50,349 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 15.63 sec
2017-06-23 13:31:32,703 Stage-1 map = 100%, reduce = 68%, Cumulative CPU 62.93 sec
2017-06-23 13:32:33,304 Stage-1 map = 100%, reduce = 68%, Cumulative CPU 125.97 sec
2017-06-23 13:32:47,645 Stage-1 map = 100%, reduce = 69%, Cumulative CPU 141.24 sec
2017-06-23 13:33:48,111 Stage-1 map = 100%, reduce = 69%, Cumulative CPU 204.34 sec
2017-06-23 13:33:57,326 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 213.84 sec
2017-06-23 13:34:57,940 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 276.55 sec
2017-06-23 13:35:04,081 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 282.85 sec
方案三:
考虑优化一下。map join ,并且提高map的并行度:
这里设置如下,开启map join, 然后设置合适的split的大小,来增加到合适的mapper数量。
set mapred.max.split.size=1000;
set mapred.min.split.size.per.node=1000;
set mapred.min.split.size.per.rack=1000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.ignore.mapjoin.hint=false;
set hive.auto.convert.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask.size=100000000 ;
drop table if exists tmp_table.table20170622_2 ;
create table tmp_table.table20170622_2 as
select a.address, sum (case when ( distance_lat_lng(if(a.lat <> ' ', a.lat,0 ),if(a.lng <> ' ',a.lng,0), b.lat,b.lng)< 2) then 1 else 0 end ) as cnt
from (
select address, split(lnglat,'\\|')[1] as lat, split(lnglat,'\\|')[0] as lng from tmp_table.address_sample_latlng
) a left join
tmp_table.hotel_location b where b.lat > 10 and b.lng > 10
group by a.address ;
运行日志
2017-06-23 11:30:45 Starting to launch local task to process map join; maximum memory = 2022178816
2017-06-23 11:30:47 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hdfs/4c51d439-1dac-4b0d-9476-b03afba927f1/hive_2017-06-23_11-30-41_940_1742320364386420907-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile31--.hashtable
2017-06-23 11:30:47 Uploaded 1 File to: file:/tmp/hdfs/4c51d439-1dac-4b0d-9476-b03afba927f1/hive_2017-06-23_11-30-41_940_1742320364386420907-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile31--.hashtable (16285494 bytes)
2017-06-23 11:30:47 End of local task; Time Taken: 1.882 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1492598920618_35989, Tracking URL = http://qing-hadoop-master-srv1:8088/proxy/application_1492598920618_35989/
Kill Command = /data/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/bin/hadoop job -kill job_1492598920618_35989
Hadoop job information for Stage-5: number of mappers: 64; number of reducers: 0
日志分析:
Hadoop job information for Stage-5: number of mappers: 64; number of reducers: 0
查看 hdfs文件,发现小表的文件大小为64k左右,
hive> dfs -du -s -h /user/hive/warehouse/tmp_table.db/address_sample_latlng ;
63.4 K 190.1 K /user/hive/warehouse/tmp_table.db/address_sample_latlng
上面设置最大的map split 为1000 即1k,所以起来 64 mapper
但是发现仍然会启动map join,而且大表没有做任何切分 ,看来优化点还在于小文件切分上。
优化后起了64 个mapper ,优化后运行的时间105 s。
这里没有在测试设置map的最大切分大小,进行进一步优化,相比于第一个join运行效率已经得到很大的提升。