jpa实现mysql原升批量插入sql jpa的saveall如何处理批量

转载

definitely 2023-09-30 09:53:57

文章标签 jpa实现mysql原升批量插入sql jpa spring 主键 sql 文章分类 MySQL 数据库

数据背景

测试数据插入量级为2000+条数据，表字段120+（业务需求）
开启打印执行时间统计信息，进行批量处理前后效果查询
其中包含业务查询以及处理数据（1次delete，4次复杂的sql查询，插入总条数2000+条）
数据库使用的oracle

最后面有一个全网抄来抄去但都没有写清楚的saveAll问题

以下为批量提交配置、以及执行时间统计信息配置

spring:
  jpa:
  	properties:
      hibernate:
	    #打印执行时间统计信息
    	generate_statistics: true
        jdbc:
          #为spring data jpa saveAll方法提供批量插入操作 此处可以随时更改大小
          batch_size: 500
          batch_versioned_data: true
        order_inserts: true
        #批量更新这边不使用，配不配之无所谓
        order_updates: true

未批量提交

首先未配置批量插入之前，系统没有做batch的处理，flush只有一次

4792100 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    84075100 nanoseconds spent preparing 2203 JDBC statements;
    10701684100 nanoseconds spent executing 2203 JDBC statements;
    0 nanoseconds spent executing 0 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    12018229400 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
    595400 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)

批量提交

可以看到批次变多了，但是没有按照2000/500的进行划分。当时这块研究了半天，一直以为是数据库参数配置问题。
查询mysql可以设置参数配置在url后面加 rewriteBatchedStatements=true ，但是oracle不支持，会报错

3592000 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    42867300 nanoseconds spent preparing 646 JDBC statements;
    36625900 nanoseconds spent executing 8 JDBC statements;
    3085078300 nanoseconds spent executing 638 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    6765787300 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
    6153600 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)

之后发现其实是entity配置了一个参数@DynamicInsert，由于这个是动态拼接sql，null的字段不拼接，所以会导致insert sql语句不一致，batch应该是把前后类似的sql汇总成一个batch处理的，如果前后字段不一致会分成多个batch。（我这边使用的是hibernate的uuid，也没有使用sequence的，所以生成主键不会执行数据库）

4480700 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    1386400 nanoseconds spent preparing 9 JDBC statements;
    36561200 nanoseconds spent executing 8 JDBC statements;
    413603400 nanoseconds spent executing 5 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    19469112800 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
    7884900 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)

注释掉@DynamicInsert之后可以看到batch次数变成5次了，这就正常了。但是flush的时间使用就变多了

分多次flush

之后就是尝试多次flush处理

这边使用PersistenceContext注解，因为查到 EntityManager 非线程安全

@PersistenceContext
    private EntityManager entityManager;

saveAll代码变更
其中entityManager.persist为置entity新增记录持久态。具体源码分析我放在后面

public void save(List<XXXEntity> result){
		batchInsert(result);
//		if(!CollectionUtils.isEmpty(result)) {
//			XXXRepository.saveAll(result);
//		}
}
	
    /**
     * 批量插入
     *
     * @param list 实体类集合
     * @param <T>  表对应的实体类
     */
    public <T> void batchInsert(List<T> list) {
        if (!CollectionUtils.isEmpty(list)) {
            for (int i = 0; i < list.size(); i++) {
                entityManager.persist(list.get(i));
                if (i % 500 == 0) {//100
                    entityManager.flush();
                    entityManager.clear();
                }
            }
            entityManager.flush();
            entityManager.clear();
        }
    }

i % 100

3744200 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    3345300 nanoseconds spent preparing 31 JDBC statements;
    48246500 nanoseconds spent executing 8 JDBC statements;
    485294000 nanoseconds spent executing 23 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    17969493100 nanoseconds spent executing 23 flushes (flushing a total of 2202 entities and 0 collections);
    813000 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)

i %500

4044700 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    1293800 nanoseconds spent preparing 14 JDBC statements;
    32300100 nanoseconds spent executing 8 JDBC statements;
    329406500 nanoseconds spent executing 6 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    2756675900 nanoseconds spent executing 6 flushes (flushing a total of 2202 entities and 0 collections);
    692600 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)

综上分析，flush次数多，flushes 耗时长；batch批次分的多的，batches耗时长。权衡利弊吧就。
之前还查到flush好像有上限，这个待验证。

saveAll源码分析

因为这边使用的jpa，所以源码自然是jpa的。
org.springframework.data.jpa.repository.support.SimpleJpaRepository

/*
	 * (non-Javadoc)
	 * @see org.springframework.data.jpa.repository.JpaRepository#save(java.lang.Iterable)
	 */
	@Transactional
	public <S extends T> List<S> saveAll(Iterable<S> entities) {

		Assert.notNull(entities, "The given Iterable of entities not be null!");

		List<S> result = new ArrayList<S>();

		for (S entity : entities) {
			result.add(save(entity));
		}

		return result;
	}
	
	/*
	 * (non-Javadoc)
	 * @see org.springframework.data.repository.CrudRepository#save(java.lang.Object)
	 */
	@Transactional
	public <S extends T> S save(S entity) {

		if (entityInformation.isNew(entity)) {
			em.persist(entity);
			return entity;
		} else {
			return em.merge(entity);
		}
	}

可以看到这边使用的是for循环entityInformation.isNew(entity)
追踪进去看到使用如下
org.springframework.data.jpa.repository.support.JpaMetamodelEntityInformation

/*
	 * (non-Javadoc)
	 * @see org.springframework.data.repository.core.support.AbstractEntityInformation#isNew(java.lang.Object)
	 */
	@Override
	public boolean isNew(T entity) {

		if (!versionAttribute.isPresent()
				|| versionAttribute.map(Attribute::getJavaType).map(Class::isPrimitive).orElse(false)) {
			return super.isNew(entity);
		}

		BeanWrapper wrapper = new DirectFieldAccessFallbackBeanWrapper(entity);

		return versionAttribute.map(it -> wrapper.getPropertyValue(it.getName()) == null).orElse(true);
	}

org.springframework.data.repository.core.support.AbstractEntityInformation

/*
	 * (non-Javadoc)
	 * @see org.springframework.data.repository.core.EntityInformation#isNew(java.lang.Object)
	 */
	public boolean isNew(T entity) {

		ID id = getId(entity);
		Class<ID> idType = getIdType();

		if (!idType.isPrimitive()) {
			return id == null;
		}

		if (id instanceof Number) {
			return ((Number) id).longValue() == 0L;
		}

		throw new IllegalArgumentException(String.format("Unsupported primitive id type %s!", idType));
	}

查询配置的ID，如果非原始类型（boolean、char、byte、short、int、long、float、double），判断该字段是否为空。id 会在em.persist(entity)之后赋值

PS：以下是查网上资料被坑的最惨的地方

网上查了都在说saveAll执行insert之前都会去执行select，但我一直没有select的sql日志。之前一直以为是版本问题，直到我无意中给主键赋值，发现insert之前居然有select日志了。

首先说明如何增加hibernate的sql输出日志yml

spring:
  jpa:
    properties:
      hibernate: 
        generate_statistics: true
        show_sql: true

我这边使用hibernate去生成主键uuid

@Id
	@GenericGenerator(name = "system-uuid", strategy = "uuid")
	@GeneratedValue(generator = "system-uuid")
	private String xxxid;

网上查了全在说SimpleJpaRepository的saveAll遍历每条记录去执行select，需要另外写方法才能使新增记录跳过select执行步骤。

//SimpleJpaRepository
	/*
	 * (non-Javadoc)
	 * @see org.springframework.data.repository.CrudRepository#save(java.lang.Object)
	 */
	@Transactional
	public <S extends T> S save(S entity) {

		if (entityInformation.isNew(entity)) {
			em.persist(entity);
			return entity;
		} else {
			return em.merge(entity);
		}
	}

我查了其实主要是因为如果主键已经被赋值了之后，SimpleJpaRepository的会由于判断主键非空而走else，然后em.merge会执行select语句去找是否存在该条记录，才导致的新增数据也会走查询。

下面是我使用少量数据做的测试
10条记录对主键id赋值

5309100 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    2280400 nanoseconds spent preparing 20 JDBC statements;
    60949500 nanoseconds spent executing 10 JDBC statements;//这里
    38818200 nanoseconds spent executing 10 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    54961800 nanoseconds spent executing 1 flushes (flushing a total of 10 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)

10条记录不对主键id赋值

4243600 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    653700 nanoseconds spent preparing 10 JDBC statements;
    0 nanoseconds spent executing 0 JDBC statements;//这里
    37879700 nanoseconds spent executing 10 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    48910200 nanoseconds spent executing 1 flushes (flushing a total of 10 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)

可以看出多的 60949500 nanoseconds spent executing 10 JDBC statements; 这部分其实就是多出的select执行sql。

所以要么对每条记录进行entityManager.persist(entity)，要么把主键清空。这样就不会对新增数据进行多余查询动作了

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。