sparksql 小文件太多 spark hdfs 小文件多

转载

mob64ca13ed93fa 2023-08-29 13:54:28

文章标签 sparksql 小文件太多 hdfs hive spark 数据 文章分类 Spark 大数据

文章目录

1.1 hdfs为什么不能小文件过多？

1.1.1 概念
1.1.2 发生的问题
1.1.3 hadoop的默认内存大小和预估能够存储的文件数量
1.1.4 修改namenode datanode的内存

1.2 flume、hive、 tez、 hbase、 spark、 flink 写数据到hdfs分别怎么解决小文件？

1.2.1 flume
1.2.2 hive
1.2.3 tez
1.2.4 hbase

1.2.4.1 hbase中解决hfile的小文件过多
1.2.4.2 拓展1：hfile过多，对hbase性能有什么影响？
1.2.4.3 拓展2 ：hbase中compact(合并)的用途是什么，什么时候触发，分为哪两种，有什么区别？

1.2.5 spark

1.2.5.1 spark写hive小文件解决方案
1.2.5.2 拓展：coalesce与repartition的区别

1.2.6 flink

1.1 hdfs为什么不能小文件过多？

1.1.1 概念

小文件是指文件size小于HDFS上block大小的文件。这样的文件会给hadoop的扩展性和性能带来严重问题。

Hadoop的小文件问题主要是会对NameNode内存管理和MapReduce性能造成影响

首先，在HDFS中，任何block，文件或者目录在内存中均以对象的形式存储，每个对象约占150byte，如果有1000 0000个小文件，每个文件占用一个block，则namenode大约需要2G空间。如果存储1亿个文件，则namenode需要20G空间

其次，访问大量小文件速度远远小于访问几个大文件。HDFS最初是为流式访问大文件开发的，如果访问大量小文件，需要不断的从一个datanode跳到另一个datanode，严重影响性能。

最后，处理大量小文件速度远远小于处理同等大小的大文件的速度。每一个小文件要占用一个slot，而task启动将耗费大量时间甚至大部分时间都耗费在启动task和释放task上。

1.1.2 发生的问题

hadoop的时候出现java heap error，说明namenode内存大小不够

1.1.3 hadoop的默认内存大小和预估能够存储的文件数量

hadoop默认namenode内存的大小为1000M，在HDFS中，任何block，文件或者目录在内存中均以对象的形式存储，每个对象约占150byte。简单来说一个文件占用nameNode的内存月150B，可以得出一个公式：

===> NameNode内存 1GB = 500万个文件左右

1.1.4 修改namenode datanode的内存

a.参数解释
# The maximum amount of heap to use, in MB. Default is 1000.
# 表示HDFS中所有角色的最大堆内存，默认是1000M，这个也就是我们所有HDFS角色进程的默认堆内存大小。(如果其他默认，修改这个参数会对namenode，datanode的内存都有影响)
export HADOOP_HEAPSIZE=

#表示NameNode的初始化堆内存大小，默认也是1000M
export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# 针对NameNode的特殊的JVM参数的配置，默认只设置hadoop.security.logger和hdfs.audit.logger两个日志级别信息参数
export HADOOP_NAMENODE_OPTS=""

# 针对DataNode的特殊的JVM参数的配置，默认只设置hadoop.security.logger日志级别信息参数
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

b.配置NameNode的堆内存可以有两种方式：
## 第一种方式
export HADOOP_NAMENODE_INIT_HEAPSIZE="20480M"
 
## 第二种方式
export HADOOP_NAMENODE_OPTS="-Xms20480M -Xmx20480M -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOO
P_NAMENODE_OPTS"

c.配置DataNode的堆内存可以有以下两种方式：
## 第一种方式
export HADOOP_HEAPSIZE=2048M
 
## 第二种方式，这种方式会覆盖掉上面第一种方式的配置
export HADOOP_DATANODE_OPTS="-Xms2048M -Xmx2048M -Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

d.配置Client的堆内存可以有如下方式：
export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"

1.2 flume、hive、 tez、 hbase、 spark、 flink 写数据到hdfs分别怎么解决小文件？

1.2.1 flume

## 开启日志文件定时回滚,不要产生大量小文件
        # 根据时间5min(300)回滚文件
        a5.sinks.k1.hdfs.rollInterval= 300
        # 当临时文件达到96M(100663296)时,文件回滚;为0,不根据文件大小回滚，注意此大小是压缩前的大小，如果hdfs上有snappy压缩，在hdfs上形成的文件大小约35MB
        a5.sinks.k1.hdfs.rollSize= 100663296
        # 不根据event数量回滚文件
        a5.sinks.k1.hdfs.rollCount= 0
        # 10min(600)内临时文件没有数据写入,该文件回滚;为0,never roll based on file size
        a5.sinks.k1.hdfs.idleTimeout = 300

1.2.2 hive

<property>
            <name>hive.merge.size.per.task</name>
            <value>134217728</value>
            <description>合并后的文件大小128M,Size of merged files at the end of the job.default:256000000</description>
        </property>
        
        <property>
            <name>hive.merge.mapfiles</name>
            <value>true</value>
            <description>合并MAP小文件，Merge small files at the end of a map-only job,default :true</description>
        </property>
        
        <property>
            <name>hive.merge.mapredfiles</name>
            <value>true</value>
            <description>合并reduce小文件,Merge small files at the end of a map-reduce job.,default :false</description>
        </property>

此处可以参考文章 hive小文件合并

1.2.3 tez

tez任务完成后，会另外启一个job，合并hdfs上的小文件
         <property>
              <name>hive.merge.tezfiles</name>
              <value>true</value>
            <description>
            Merge small files at the end of a Tez DAG. 默认fasle
            </description>
         </property>

1.2.4 hbase

1.2.4.1 hbase中解决hfile的小文件过多

<!-- 一个store里面允许存的hfile的个数，超过这个个数会被写到新的一个hfile里面 也即是每个region的每个列族对应的memstore在fulsh为hfile的时候，默认情况下当超过3个hfile的时候就会   对这些文件进行合并重写为一个新文件，设置个数越大可以减少触发合并的时间，但是每次合并的时间就会越长 -->  
        <property>  
            <name>hbase.hstore.compactionThreshold</name>  
            <value>3</value>  
            <description>  
            If more than this number of HStoreFiles in any one HStore  
            (one HStoreFile is written per flush of memstore) then a compaction  
            is run to rewrite all HStoreFiles files as one. Larger numbers  
            put off compaction but when it runs, it takes longer to complete.  
            </description>  
        </property>  
        
        <!-- 每个minor compaction操作的 允许的最大hfile文件上限 -->
        <property>  
            <name>hbase.hstore.compaction.max</name>  
            <value>10</value>  
            <description>Max number of HStoreFiles to compact per 'minor'  
            compaction.</description>  
        </property> 
        
        
        <!-- service工作的sleep间隔，单位毫秒。 可以作为service线程的sleep间隔，比如log roller. -->
        <property>
              <name>hbase.server.thread.wakefrequency</name>
              <value>10000</value>  
              <description>默认值10000ms,每隔10s检查是否需要min compact，Time to sleep in between searches for work (in milliseconds). Used as sleep interval by service threads such as log roller.</description>
        </property>
        
         <property>
              <name>hbase.hregion.majorcompaction</name>
              <value>86400000</value>
        	  <description>默认值86400000ms,一个Region中的所有HStoreFile的major compactions的时间间隔。默认是1天。 设置为0就是禁用这个功能。</description>
         </property>

1.2.4.2 拓展1：hfile过多，对hbase性能有什么影响？

client读hbase数据超时

影响hbase的读性能，如果 每个region下都有 200-500的storefile积压，会造成业务端出现了大量的读超时。

hbase天生适合写操作

hbase的设计架构，决定了hbase天生适合写操作,而读操作是繁琐的：
1.hbase一次范围查询可能涉及多个region 、多个缓存甚至多个hfile
2.hbase的更新 删除是很简单实现。但是更新操作并没有真正的更新原数据，
而是通过时间戳属性来实现多版本；删除操作也没有真正的删除原数据，
而是插入一条标记为delete标签的数据。那么真正的数据上传是在hbase做
大合并(full gc)。很显然，这种思路的设计：
极大的简化更新 删除操作，但是对数据的读取却是非常的繁琐。
而需要通过版本进行过滤和对已标记删除的数据也过滤。

1.2.4.3 拓展2 ：hbase中compact(合并)的用途是什么，什么时候触发，分为哪两种，有什么区别？

在HBase中，每当memstore的数据flush到磁盘后，就形成一个storefile，当storefile的数量越来越大时，会严重影响HBase的读性能 ，HBase内部的compact处理流程是为了解决MemStore Flush之后，文件数目太多，导致读数据性能大大下降的一种自我调节手段，它会将文件按照某种策略进行合并，大大提升HBase的数据读性能。
        主要起到如下几个作用：
        |   合并文件
        |   清除删除、过期、多余版本(Version)的数据
        |   提高读写数据的效率
         
        HBase中实现了两种compaction的方式：minor and major. Minor compactions will usually pick up a couple of the smaller adjacent StoreFiles and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. Sometimes a minor compaction will pick up all the StoreFiles in the Store and in this case it actually promotes itself to being a major compaction.
         
        这两种compaction方式的区别是：
        |  Minor操作只用来做部分文件的合并操作以及包括minVersion=0并且设置ttl的过期版本清理，不做任何删除数据、多版本数据的清理工作。  Minor默认是10个小文件合并
        |  Major操作是对Region下的HStore下的所有StoreFile执行合并操作，最终的结果是整理合并出一个文件。
        Major会合并所有的小文件(比如有100个小文件,也会合并),而且合并的时候会阻塞客户端的写操作.
        禁用major合并，手动合并(调API方法,手动合并)
        compaction触发时机：
        |   Memstore刷写后，判断是否compaction
        |   CompactionChecker线程，周期轮询
        
        比如compaction,删除数据,minor(轻量级)的是不删除数据(只将数据标记为删除),Major(重量级)是将数据进行删除. 
        类似GC的minorGC和MajorGC

1.2.5 spark

1.2.5.1 spark写hive小文件解决方案

比如写入到hive的时候，设置resultRDD的分区
        rdd.coalesce(1)  --> 调用 coalesce(numPartitions, shuffle = false)
    或  rdd.repartition(1) --> 调用coalesce(numPartitions, shuffle = true)
        
        写出去是几个文件，要根据实际数据量的大小，来确定partition是1还是更大。如果是1，则用coalesce更加，因为其不走shuffle，减少了磁盘和网络IO。

1.2.5.2 拓展：coalesce与repartition的区别

a.注意:
        repartition方法底层也是调用的是 coalesce(numPartitions, shuffle = true),
        coalesce调用的是默认的shuff=false方法coalesce(numPartitions, shuffle = false)
        
        def repartition(numPartitions: Int): DStream[T] = ssc.withScope {
        this.transform(_.repartition(numPartitions))
            //其中repartitio的方法调用
            def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
            coalesce(numPartitions, shuffle = true)
               //其中coalesce方法调用的是
               def coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T] = withScope {
                   ...
               }
            }
        }
        
        b.coalesce(numPartitions, shuffle = false)方法的源码解释
        
        coalesce(numPartitions, shuffle = false)，不走shuffle，简单的讲原来的分区进行合并，如果分区由1000->100，可能导致分区数据不均匀。
        
        coalesce(numPartitions, shuffle = true),走shuffle，运用hash partitioner算法重新分区数据，如果指定了键进行分区，而键又是均匀分布，则分区后的数据是均匀的。
        
       /**
       源码注释的具体解释
       * Return a new RDD that is reduced into `numPartitions` partitions.
       *
       * This results in a narrow dependency, e.g. if you go from 1000 partitions
       * to 100 partitions, there will not be a shuffle, instead each of the 100
       * new partitions will claim 10 of the current partitions. If a larger number
       * of partitions is requested, it will stay at the current number of partitions.
       *
       * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
       * this may result in your computation taking place on fewer nodes than
       * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
       * you can pass shuffle = true. This will add a shuffle step, but means the
       * current upstream partitions will be executed in parallel (per whatever
       * the current partitioning is).
       *
       * @note With shuffle = true, you can actually coalesce to a larger number
       * of partitions. This is useful if you have a small number of partitions,
       * say 100, potentially with a few partitions being abnormally large. Calling
       * coalesce(1000, shuffle = true) will result in 1000 partitions with the
       * data distributed using a hash partitioner. The optional partition coalescer
       * passed in must be serializable.
       **/

1.2.6 flink

与spark的思路一样，写入hive的时候，设置sink的并行度。一般也是使用滚动窗口函数写入，根据数据量的大小，设置并行度为1或更大。
    DataStreamSink.setParallelism(1)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ios launcher代码大全 ios launcher如何使用

下一篇：ios注销用户登录苹果id注销登录

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

sparksql 小文件太多 spark hdfs 小文件 多

sparksql 小文件太多 spark hdfs 小文件 多

文章目录

1.1 hdfs为什么不能小文件过多？

1.1.1 概念

1.1.2 发生的问题

1.1.3 hadoop的默认内存大小和预估能够存储的文件数量

1.1.4 修改namenode datanode的内存

1.2 flume、hive、 tez、 hbase、 spark、 flink 写数据到hdfs分别怎么解决小文件？

1.2.1 flume

1.2.2 hive

1.2.3 tez

1.2.4 hbase

1.2.4.1 hbase中解决hfile的小文件过多

1.2.4.2 拓展1：hfile过多，对hbase性能有什么影响？

1.2.4.3 拓展2 ：hbase中compact(合并)的用途是什么，什么时候触发，分为哪两种，有什么区别？

1.2.5 spark

1.2.5.1 spark写hive小文件解决方案

1.2.5.2 拓展：coalesce与repartition的区别

1.2.6 flink

51CTO博客

sparksql 小文件太多 spark hdfs 小文件多

sparksql 小文件太多 spark hdfs 小文件多