一、数据

数据库落盘前面讲了日志,今天分析一下数据的落盘,麻烦的很。但是原理都差不多。在前面的分析已经可以明确知道,在MySql中,不管哪种数据,都是先进入缓存,然后再落盘保存。而在数据库,最重要的是什么?当然是数据,不管你是什么2PC,什么缓存,什么线程等等。最终的目的都是保证数据的安全应用。说的直白一些,就是满足各种SQL语句的操作,支持数据的各种恢复备份以及数据库的迁移。马Sir不是说过,以后是DT时代,数据为王。所以这几年国内数据库产业风声水起。

二、数据写入

数据的写入是在日志写入后事务提交完成后开始进行的。即buffer pool中缓存的数据页,而此时如果遇到特殊情况,比如掉电等意外,如何保证数据的安全性呢?就是前面提到的会处理Redo Log,因此,每次事务提交都会触发一次Redo Log。当然,如果Binlog被打开的话也会刷盘Binlog。
而当事务都完成时(也就是两个日志都刷盘成功),MySql会有一个double write机制,把Buffer Pool中的数据(即脏数据)写入到Double Write中,它存储于数据库的共享表空间内,而此时就已经开始落盘,但仍然不是真正的数据落盘。这样做的目的同样也是为了保证不完整脏页写失败的安全性。它和Redo Log一起保证了数据的可安全恢复。(日志的粒度是页,页内数据由DOUBLE WRITE负责安全)
另外,在数据库的操作中,数据库表的索引有两大类,即聚集索引和非聚集索引,聚集索引比较好处理,它是物理空间平坦的。但是非聚集索引不是,如果每次都去更新的话效率会很低。其实讲到这里,应该有点经验的都明白了,没啥,缓存呗,最后集中写入。对,MySql叫做Insert Buffer。

三、源码分析

分析一下INNODB服务状态源码:

/** Function to pass InnoDB status variables to MySQL */
void srv_export_innodb_status(void) {
  buf_pool_stat_t stat;
  buf_pools_list_size_t buf_pools_list_size;
  ulint LRU_len;
  ulint free_len;
  ulint flush_list_len;

  buf_get_total_stat(&stat);
  buf_get_total_list_len(&LRU_len, &free_len, &flush_list_len);
  buf_get_total_list_size_in_bytes(&buf_pools_list_size);

  mutex_enter(&srv_innodb_monitor_mutex);

  export_vars.innodb_data_pending_reads = os_n_pending_reads;

  export_vars.innodb_data_pending_writes = os_n_pending_writes;

  export_vars.innodb_data_pending_fsyncs =
      fil_n_pending_log_flushes + fil_n_pending_tablespace_flushes;

  export_vars.innodb_data_fsyncs = os_n_fsyncs;

  export_vars.innodb_data_read = srv_stats.data_read;

  export_vars.innodb_data_reads = os_n_file_reads;

  export_vars.innodb_data_writes = os_n_file_writes;

  export_vars.innodb_data_written = srv_stats.data_written;

  export_vars.innodb_buffer_pool_read_requests =
      Counter::total(stat.m_n_page_gets);

  export_vars.innodb_buffer_pool_write_requests =
      srv_stats.buf_pool_write_requests;

  export_vars.innodb_buffer_pool_wait_free = srv_stats.buf_pool_wait_free;

  export_vars.innodb_buffer_pool_pages_flushed = srv_stats.buf_pool_flushed;

  export_vars.innodb_buffer_pool_reads = srv_stats.buf_pool_reads;

  export_vars.innodb_buffer_pool_read_ahead_rnd = stat.n_ra_pages_read_rnd;

  export_vars.innodb_buffer_pool_read_ahead = stat.n_ra_pages_read;

  export_vars.innodb_buffer_pool_read_ahead_evicted = stat.n_ra_pages_evicted;

  export_vars.innodb_buffer_pool_pages_data = LRU_len;

  export_vars.innodb_buffer_pool_bytes_data =
      buf_pools_list_size.LRU_bytes + buf_pools_list_size.unzip_LRU_bytes;

  export_vars.innodb_buffer_pool_pages_dirty = flush_list_len;

  export_vars.innodb_buffer_pool_bytes_dirty =
      buf_pools_list_size.flush_list_bytes;

  export_vars.innodb_buffer_pool_pages_free = free_len;

#ifdef UNIV_DEBUG
  export_vars.innodb_buffer_pool_pages_latched = buf_get_latched_pages_number();
#endif /* UNIV_DEBUG */
  export_vars.innodb_buffer_pool_pages_total = buf_pool_get_n_pages();

  export_vars.innodb_buffer_pool_pages_misc =
      buf_pool_get_n_pages() - LRU_len - free_len;

  export_vars.innodb_page_size = UNIV_PAGE_SIZE;

  export_vars.innodb_log_waits = srv_stats.log_waits;

  export_vars.innodb_os_log_written = srv_stats.os_log_written;

  export_vars.innodb_os_log_fsyncs = fil_n_log_flushes;

  export_vars.innodb_os_log_pending_fsyncs = fil_n_pending_log_flushes;

  export_vars.innodb_os_log_pending_writes = srv_stats.os_log_pending_writes;

  export_vars.innodb_log_write_requests = srv_stats.log_write_requests;

  export_vars.innodb_log_writes = srv_stats.log_writes;
  //重点看一下这两个变量
  //监视DOBLE WRITE的页次数
  export_vars.innodb_dblwr_pages_written = srv_stats.dblwr_pages_written;
  监视DOUBLE WRITE的写次数
  export_vars.innodb_dblwr_writes = srv_stats.dblwr_writes;

  export_vars.innodb_pages_created = stat.n_pages_created;

  export_vars.innodb_pages_read = stat.n_pages_read;

  export_vars.innodb_pages_written = stat.n_pages_written;

  export_vars.innodb_redo_log_enabled = srv_redo_log;

  export_vars.innodb_row_lock_waits = srv_stats.n_lock_wait_count;

  export_vars.innodb_row_lock_current_waits =
      srv_stats.n_lock_wait_current_count;

  export_vars.innodb_row_lock_time = srv_stats.n_lock_wait_time / 1000;

  ......

  mutex_exit(&srv_innodb_monitor_mutex);
}

再看一下如何进行DoubleWrite的流程,首先,启动写入的函数有几个,即buf_flush_page_try,buf_flush_try_neighbors,buf_flush_single_page_from 和MMY_ATTRIBUTE,它们都调用了一个函数:

ibool buf_flush_page(buf_pool_t *buf_pool, buf_page_t *bpage,
                     buf_flush_t flush_type, bool sync) {
  BPageMutex *block_mutex;

  ut_ad(flush_type < BUF_FLUSH_N_TYPES);
  /* Hold the LRU list mutex iff called for a single page LRU
  flush. A single page LRU flush is already non-performant, and holding
  the LRU list mutex allows us to avoid having to store the previous LRU
  list page or to restart the LRU scan in
  buf_flush_single_page_from_LRU(). */
  ut_ad(flush_type == BUF_FLUSH_SINGLE_PAGE ||
        !mutex_own(&buf_pool->LRU_list_mutex));
  ut_ad(flush_type != BUF_FLUSH_SINGLE_PAGE ||
        mutex_own(&buf_pool->LRU_list_mutex));
  ut_ad(buf_page_in_file(bpage));
  ut_ad(!sync || flush_type == BUF_FLUSH_SINGLE_PAGE);

  block_mutex = buf_page_get_mutex(bpage);
  ut_ad(mutex_own(block_mutex));

  ut_ad(buf_flush_ready_for_flush(bpage, flush_type));

  bool is_uncompressed;

  is_uncompressed = (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
  ut_ad(is_uncompressed == (block_mutex != &buf_pool->zip_mutex));

  ibool flush;
  rw_lock_t *rw_lock = nullptr;
  bool no_fix_count = bpage->buf_fix_count == 0;

  if (!is_uncompressed) {
    flush = TRUE;
    rw_lock = nullptr;
  } else if (!(no_fix_count || flush_type == BUF_FLUSH_LIST) ||
             (!no_fix_count &&
              srv_shutdown_state.load() < SRV_SHUTDOWN_FLUSH_PHASE &&
              fsp_is_system_temporary(bpage->id.space()))) {
    /* This is a heuristic, to avoid expensive SX attempts. */
    /* For table residing in temporary tablespace sync is done
    using IO_FIX and so before scheduling for flush ensure that
    page is not fixed. */
    flush = FALSE;
  } else {
    rw_lock = &reinterpret_cast<buf_block_t *>(bpage)->lock;
    if (flush_type != BUF_FLUSH_LIST) {
      flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
    } else {
      /* Will SX lock later */
      flush = TRUE;
    }
  }

  if (flush) {
    /* We are committed to flushing by the time we get here */

    mutex_enter(&buf_pool->flush_state_mutex);

    buf_page_set_io_fix(bpage, BUF_IO_WRITE);

    buf_page_set_flush_type(bpage, flush_type);

    if (buf_pool->n_flush[flush_type] == 0) {
      os_event_reset(buf_pool->no_flush[flush_type]);
    }

    ++buf_pool->n_flush[flush_type];

    if (bpage->get_oldest_lsn() > buf_pool->max_lsn_io) {
      buf_pool->max_lsn_io = bpage->get_oldest_lsn();
    }

    if (!fsp_is_system_temporary(bpage->id.space()) &&
        buf_pool->track_page_lsn != LSN_MAX) {
      auto frame = bpage->zip.data;

      if (frame == nullptr) {
        frame = ((buf_block_t *)bpage)->frame;
      }
      lsn_t frame_lsn = mach_read_from_8(frame + FIL_PAGE_LSN);

      arch_page_sys->track_page(bpage, buf_pool->track_page_lsn, frame_lsn,
                                false);
    }

    mutex_exit(&buf_pool->flush_state_mutex);

    mutex_exit(block_mutex);

    if (flush_type == BUF_FLUSH_SINGLE_PAGE) {
      mutex_exit(&buf_pool->LRU_list_mutex);
    }

    if (flush_type == BUF_FLUSH_LIST && is_uncompressed &&
        !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {
      if (!fsp_is_system_temporary(bpage->id.space()) && dblwr::enabled) {
        dblwr::force_flush(flush_type, buf_pool_index(buf_pool));
      } else {
        buf_flush_sync_datafiles();
      }

      rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);
    }

    /* If there is an observer that wants to know if the
    asynchronous flushing was sent then notify it.
    Note: we set flush observer to a page with x-latch, so we can
    guarantee that notify_flush and notify_remove are called in pair
    with s-latch on a uncompressed page. */
    if (bpage->get_flush_observer() != nullptr) {
      bpage->get_flush_observer()->notify_flush(buf_pool, bpage);
    }

    /* Even though bpage is not protected by any mutex at this
    point, it is safe to access bpage, because it is io_fixed and
    oldest_modification != 0.  Thus, it cannot be relocated in the
    buffer pool or removed from flush_list or LRU_list. */

    buf_flush_write_block_low(bpage, flush_type, sync);
  }

  return (flush);
}

看刷盘的方法有好几个。有dblwr::force_flush,buf_flush_sync_datafiles,buf_flush_write_block_low,同时,如果有观者者在等待着这个动作,还是调用bpage->get_flush_observer()->notify_flush(buf_pool, bpage)来通知相关的观察者。
看一下相关源码,只列举部分:

void force_flush(buf_flush_t flush_type) noexcept {
  for (;;) {
    mutex_enter(&m_mutex);
    if (!m_buf_pages.empty() && !flush_to_disk(flush_type)) {
      ut_ad(!mutex_own(&m_mutex));
      continue;
    }
    break;
  }
  mutex_exit(&m_mutex);
}
bool flush_to_disk(buf_flush_t flush_type) noexcept {
  ut_ad(mutex_own(&m_mutex));

  /* Wait for any batch writes that are in progress. */
  if (wait_for_pending_batch()) {
    ut_ad(!mutex_own(&m_mutex));
    return false;
  }

  MONITOR_INC(MONITOR_DBLWR_FLUSH_REQUESTS);

  /* Write the pages to disk and free up the buffer. */
  write_pages(flush_type);

  ut_a(m_buffer.empty());
  ut_a(m_buf_pages.empty());

  return true;
}

void Double_write::write_pages(buf_flush_t flush_type) noexcept {
  ut_ad(mutex_own(&m_mutex));
  ut_a(!m_buffer.empty());

  Batch_segment *batch_segment{};

  auto segments = flush_type == BUF_FLUSH_LRU ? s_LRU_batch_segments
                                              : s_flush_list_batch_segments;

  while (!segments->dequeue(batch_segment)) {
    std::this_thread::yield();
  }

  batch_segment->start(this);

  //调用 写文件
  batch_segment->write(m_buffer);

  m_buffer.clear();

#ifndef _WIN32
  if (is_fsync_required()) {
    batch_segment->flush();
  }
#endif /* !_WIN32 */

  batch_segment->set_batch_size(m_buf_pages.size());

  for (uint32_t i = 0; i < m_buf_pages.size(); ++i) {
    const auto bpage = std::get<0>(m_buf_pages.m_pages[i]);

    ut_d(auto page_id = bpage->id);

    bpage->set_dblwr_batch_id(batch_segment->id());

    ut_d(bpage->take_io_responsibility());
    auto err =
        write_to_datafile(bpage, false, std::get<1>(m_buf_pages.m_pages[i]),
                          std::get<2>(m_buf_pages.m_pages[i]));

    if (err == DB_PAGE_IS_STALE || err == DB_TABLESPACE_DELETED) {
      write_complete(bpage, flush_type);
      buf_page_free_stale_during_write(
          bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);

      const file::Block *block = std::get<1>(m_buf_pages.m_pages[i]);
      if (block != nullptr) {
        os_free_block(const_cast<file::Block *>(block));
      }
    } else {
      ut_a(err == DB_SUCCESS);
    }
    /* We don't hold io_responsibility here no matter which path through ifs and
    elses we've got here, but we can't assert:
      ut_ad(!bpage->current_thread_has_io_responsibility());
    because bpage could be freed by the time we got here. */

#ifdef UNIV_DEBUG
    if (dblwr::Force_crash == page_id) {
      DBUG_SUICIDE();
    }
#endif /* UNIV_DEBUG */
  }

  srv_stats.dblwr_writes.inc();

  m_buf_pages.clear();

  os_aio_simulated_wake_handler_threads();
}

dberr_t os_file_write_retry(IORequest &type, const char *name,
                            pfs_os_file_t file, const void *buf,
                            os_offset_t offset, ulint n) {
  dberr_t err;
  for (;;) {
    err = os_file_write(type, name, file, buf, offset, n);

    if (err == DB_SUCCESS || err == DB_TABLESPACE_DELETED) {
      break;
    } else if (err == DB_IO_ERROR) {
      ib::error(ER_INNODB_IO_WRITE_ERROR_RETRYING, name);
      std::chrono::seconds ten(10);
      std::this_thread::sleep_for(ten);
      continue;
    } else {
      ib::fatal(ER_INNODB_IO_WRITE_FAILED, name);
    }
  }
  return err;
}
dberr_t os_file_write_func(IORequest &type, const char *name, os_file_t file,
                           const void *buf, os_offset_t offset, ulint n) {
  ut_ad(type.validate());
  ut_ad(type.is_write());

  /* We never compress the first page.
  Note: This assumes we always do block IO. */
  if (offset == 0) {
    type.clear_compressed();
  }

  const byte *ptr = reinterpret_cast<const byte *>(buf);

  return os_file_write_page(type, name, file, ptr, offset, n,
                            type.get_encrypted_block());
}

这里需要注意两个宏:

#define os_file_write(type, name, file, buf, offset, n) \
  os_file_write_pfs(type, name, file, buf, offset, n)
#define os_file_write_pfs(type, name, file, buf, offset, n) \
  os_file_write_func(type, name, file, buf, offset, n)

最后调用的是os_file_write_func这个函数。最终调用os_file_write_page,写入了磁盘。其它如表空间的处理等前面的说明,可以查看其它几个写入的函数就可以明白。
当然有些情况下也是可以不需要DOUBLE WTITE直接刷盘的,一种是关闭这个选项,另外一个是就类似于一些不需要这个操作的动作,如Drop Table等。
然后再看一下数据落盘:

//在上面的Write_pages函数中会调用write_to_datafile
/** Writes a page that has already been written to the
doublewrite buffer to the data file. It is the job of the
caller to sync the datafile.
@param[in]  in_bpage          Page to write.
@param[in]  sync              true if it's a synchronous write.
@param[in]  e_block           block containing encrypted data frame.
@param[in]  e_len             encrypted data length.
@return DB_SUCCESS or error code */
static dberr_t write_to_datafile(const buf_page_t *in_bpage, bool sync,
    const file::Block* e_block, uint32_t e_len)
    noexcept MY_ATTRIBUTE((warn_unused_result));
    dberr_t Double_write::write_to_datafile(const buf_page_t *in_bpage, bool sync,
                                            const file::Block *e_block,
                                            uint32_t e_len) noexcept {
      ut_ad(buf_page_in_file(in_bpage));
      ut_ad(in_bpage->current_thread_has_io_responsibility());
      ut_ad(in_bpage->is_io_fix_write());
      uint32_t len;
      void *frame{};

      if (e_block == nullptr) {
        Double_write::prepare(in_bpage, &frame, &len);
      } else {
        frame = os_block_get_frame(e_block);
        len = e_len;
      }

      /* Our IO API is common for both reads and writes and is
      therefore geared towards a non-const parameter. */
      auto bpage = const_cast<buf_page_t *>(in_bpage);

      uint32_t type = IORequest::WRITE;

      if (sync) {
        type |= IORequest::DO_NOT_WAKE;
      }

      IORequest io_request(type);
      io_request.set_encrypted_block(e_block);

    #ifdef UNIV_DEBUG
      {
        byte *page = static_cast<byte *>(frame);
        ut_ad(mach_read_from_4(page + FIL_PAGE_OFFSET) == bpage->page_no());
        ut_ad(mach_read_from_4(page + FIL_PAGE_SPACE_ID) == bpage->space());
      }
    #endif /* UNIV_DEBUG */

      auto err =
          fil_io(io_request, sync, bpage->id, bpage->size, 0, len, frame, bpage);

      /* When a tablespace is deleted with BUF_REMOVE_NONE, fil_io() might
      return DB_PAGE_IS_STALE or DB_TABLESPACE_DELETED. */
      ut_a(err == DB_SUCCESS || err == DB_TABLESPACE_DELETED ||
           err == DB_PAGE_IS_STALE);

      return err;
    }
    dberr_t fil_io(const IORequest &type, bool sync, const page_id_t &page_id,
                   const page_size_t &page_size, ulint byte_offset, ulint len,
                   void *buf, void *message) {
      auto shard = fil_system->shard_by_id(page_id.space());
    #ifdef UNIV_DEBUG
      if (!sync) {
        /* In case of async io we transfer the io responsibility to the thread which
        will perform the io completion routine. */
        static_cast<buf_page_t *>(message)->release_io_responsibility();
      }
    #endif

      auto const err = shard->do_io(type, sync, page_id, page_size, byte_offset,
                                    len, buf, message);
    #ifdef UNIV_DEBUG
      /* If the error prevented async io, then we haven't actually transfered the
      io responsibility at all, so we revert the debug io responsibility info. */
      if (err != DB_SUCCESS && !sync) {
        static_cast<buf_page_t *>(message)->take_io_responsibility();
      }
    #endif
      return err;
    }

最后看一下如果不允许DOUBLE WRITE时的调用函数:

//意味着直接写数据到硬盘,一般有几种情况:
1、DML数据量巨大
2、对数据的细节损坏不敏感
3、写负载太大

/** Flush a batch of writes to the datafiles that have already been
written to the dblwr buffer on disk. */
static void buf_flush_sync_datafiles() {
  /* Wake possible simulated AIO thread to actually post the
  writes to the operating system */
  os_aio_simulated_wake_handler_threads();

  /* Wait that all async writes to tablespaces have been posted to
  the OS */
  os_aio_wait_until_no_pending_writes();

  /* Now we flush the data to disk (for example, with fsync) */
  fil_flush_file_spaces(FIL_TYPE_TABLESPACE);
}
/** Flush to disk the writes in file spaces of the given type
possibly cached by the OS.
@param[in]	purpose		FIL_TYPE_TABLESPACE or FIL_TYPE_LOG, can be
ORred. */
void fil_flush_file_spaces(uint8_t purpose) {
  fil_system->flush_file_spaces(purpose);
}

上面的代码其实特别简单其实就是唤醒操作线程,直接刷盘。最后看一下异步落盘:

/** Does an asynchronous write of a buffer page.
@param[in]	bpage		buffer block to write
@param[in]	flush_type	type of flush
@param[in]	sync		true if sync IO request */
static void buf_flush_write_block_low(buf_page_t *bpage, buf_flush_t flush_type,
                                      bool sync) {
  page_t *frame = nullptr;

#ifdef UNIV_DEBUG
  buf_pool_t *buf_pool = buf_pool_from_bpage(bpage);
  ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
#endif /* UNIV_DEBUG */

  DBUG_PRINT("ib_buf", ("flush %s %u page " UINT32PF ":" UINT32PF,
                        sync ? "sync" : "async", (unsigned)flush_type,
                        bpage->id.space(), bpage->id.page_no()));

  ut_ad(buf_page_in_file(bpage));

  /* We are not holding block_mutex here. Nevertheless, it is safe to
  access bpage, because it is io_fixed and oldest_modification != 0.
  Thus, it cannot be relocated in the buffer pool or removed from
  flush_list or LRU_list. */
  ut_ad(!buf_flush_list_mutex_own(buf_pool));
  ut_ad(!buf_page_get_mutex(bpage)->is_owned());
  ut_ad(bpage->is_io_fix_write());
  ut_ad(bpage->is_dirty());

#ifdef UNIV_IBUF_COUNT_DEBUG
  ut_a(ibuf_count_get(bpage->id) == 0);
#endif /* UNIV_IBUF_COUNT_DEBUG */

  ut_ad(recv_recovery_is_on() || bpage->get_newest_lsn() != 0);

  /* Force the log to the disk before writing the modified block */
  if (!srv_read_only_mode) {
    const lsn_t flush_to_lsn = bpage->get_newest_lsn();

    /* Do the check before calling log_write_up_to() because in most
    cases it would allow to avoid call, and because of that we don't
    want those calls because they would have bad impact on the counter
    of calls, which is monitored to save CPU on spinning in log threads. */

    if (log_sys->flushed_to_disk_lsn.load() < flush_to_lsn) {
      Wait_stats wait_stats;

      wait_stats = log_write_up_to(*log_sys, flush_to_lsn, true);

      MONITOR_INC_WAIT_STATS_EX(MONITOR_ON_LOG_, _PAGE_WRITTEN, wait_stats);
    }
  }

  DBUG_EXECUTE_IF("log_first_rec_group_test", {
    recv_no_ibuf_operations = false;
    const lsn_t end_lsn = mtr_commit_mlog_test(*log_sys);
    log_write_up_to(*log_sys, end_lsn, true);
    DBUG_SUICIDE();
  });

  switch (buf_page_get_state(bpage)) {
    case BUF_BLOCK_POOL_WATCH:
    case BUF_BLOCK_ZIP_PAGE: /* The page should be dirty. */
    case BUF_BLOCK_NOT_USED:
    case BUF_BLOCK_READY_FOR_USE:
    case BUF_BLOCK_MEMORY:
    case BUF_BLOCK_REMOVE_HASH:
      ut_error;
      break;
    case BUF_BLOCK_ZIP_DIRTY: {
      frame = bpage->zip.data;
      BlockReporter reporter =
          BlockReporter(false, frame, bpage->size,
                        fsp_is_checksum_disabled(bpage->id.space()));

      mach_write_to_8(frame + FIL_PAGE_LSN, bpage->get_newest_lsn());

      ut_a(reporter.verify_zip_checksum());
      break;
    }
    case BUF_BLOCK_FILE_PAGE:
      frame = bpage->zip.data;
      if (!frame) {
        frame = ((buf_block_t *)bpage)->frame;
      }

      buf_flush_init_for_writing(
          reinterpret_cast<const buf_block_t *>(bpage),
          reinterpret_cast<const buf_block_t *>(bpage)->frame,
          bpage->zip.data ? &bpage->zip : nullptr, bpage->get_newest_lsn(),
          fsp_is_checksum_disabled(bpage->id.space()),
          false /* do not skip lsn check */);
      break;
  }

  dberr_t err = dblwr::write(flush_type, bpage, sync);

  ut_a(err == DB_SUCCESS || err == DB_TABLESPACE_DELETED);

  /* Increment the counter of I/O operations used
  for selecting LRU policy. */
  buf_LRU_stat_inc_io();
}
dberr_t dblwr::write(buf_flush_t flush_type, buf_page_t *bpage,
                     bool sync) noexcept {
  dberr_t err;
  const space_id_t space_id = bpage->id.space();

  ut_ad(bpage->current_thread_has_io_responsibility());
  /* This is not required for correctness, but it aborts the processing early.
   */
  if (bpage->was_stale()) {
    /* Disable batch completion in write_complete(). */
    bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
    buf_page_free_stale_during_write(
        bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
    /* We don't hold io_responsibility here no matter which path through ifs and
    elses we've got here, but we can't assert:
      ut_ad(!bpage->current_thread_has_io_responsibility());
    because bpage could be freed by the time we got here. */
    return DB_SUCCESS;
  }

  if (srv_read_only_mode || fsp_is_system_temporary(space_id) ||
      !dblwr::enabled || Double_write::s_instances == nullptr ||
      mtr_t::s_logging.dblwr_disabled()) {
    /* Skip the double-write buffer since it is not needed. Temporary
    tablespaces are never recovered, therefore we don't care about
    torn writes. */
    bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
    err = Double_write::write_to_datafile(bpage, sync, nullptr, 0);
    if (err == DB_PAGE_IS_STALE || err == DB_TABLESPACE_DELETED) {
      buf_page_free_stale_during_write(
          bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
      err = DB_SUCCESS;
    } else if (sync) {
      ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_SINGLE_PAGE);

      if (err == DB_SUCCESS) {
        fil_flush(space_id);
      }
      /* true means we want to evict this page from the LRU list as well. */
      buf_page_io_complete(bpage, true);
    }

  } else {
    ut_d(auto page_id = bpage->id);

    /* Encrypt the page here, so that the same encrypted contents are written
    to the dblwr file and the data file. */
    uint32_t e_len{};
    file::Block *e_block = dblwr::get_encrypted_frame(bpage, e_len);

    if (!sync && flush_type != BUF_FLUSH_SINGLE_PAGE) {
      MONITOR_INC(MONITOR_DBLWR_ASYNC_REQUESTS);

      ut_d(bpage->release_io_responsibility());
      Double_write::submit(flush_type, bpage, e_block, e_len);
      err = DB_SUCCESS;
#ifdef UNIV_DEBUG
      if (dblwr::Force_crash == page_id) {
        force_flush(flush_type, buf_pool_index(buf_pool_from_bpage(bpage)));
      }
#endif /* UNIV_DEBUG */
    } else {
      MONITOR_INC(MONITOR_DBLWR_SYNC_REQUESTS);
      /* Disable batch completion in write_complete(). */
      bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
      err = Double_write::sync_page_flush(bpage, e_block, e_len);
    }
  }
  /* We don't hold io_responsibility here no matter which path through ifs and
  elses we've got here, but we can't assert:
    ut_ad(!bpage->current_thread_has_io_responsibility());
  because bpage could be freed by the time we got here. */
  return err;
}

write最终又调用了force_flush,也就是说,代码落盘保持了一致。

四、总结

代码好乱,头晕又晕,还好,基本看得明白了。

努力吧,归来的少年!