这个文章来源于我之前写的一段bug描述,本来是打算给mysql报bug的,所以用英语写的,后来打算把它扩展一下写成一篇文章,因此有了本文。我提交的本文所述的 bug 报告连接在这里: https://bugs.mysql.com/?id=87389

This year I've been working on TDSQL XA and spent a lot of time fixing MySQL XA bugs, one bug is that when slave executes a XA_PREPARE_LOG_EVENT, it doesn't update the mysql.slave_worker_info or mysql.slave_relay_log_info like when it executes the XID_LOG_EVENT in the correct way, and this causes a lot of inconsistency issues in our XA robust tests, in which we kill mysqld processes of master and/or slave frequently while the master-slave cluster is executing a lot of parallel XA transaction branches.

When executing XID_LOG_EVENT, the code in Xid_apply_log_event::do_apply_event_worker stores update-to-date replication position info (i.e. the master binlog name/pos and slave relay log name/pos of the committing transaction) into the current slave worker's row in the mysql.slave_worker_info table; and if the event is executed by the coordinator thread, in Xid_apply_log_event::do_apply_event similar work is performed to store the position info into mysql.slave_relay_log_info. Such updates are both done within the same user transaction, which guarantees replication position consistency perfectly. 

The current released code in the 2 functions are not able to handle an XA transaction(XA_PREPARE_LOG_EVENT executes the same functions above) correctly: if an XA_PREPARE_LOG_EVENT is executed by a slave worker, the mysql.slave_worker_info table are updated AFTER the XA transaction gets prepared in a separate transaction in Xid_apply_log_event::do_apply_event_worker, but if an  XA_PREPARE_LOG_EVENT is executed by slave sql thread, the mysql.slave_relay_log_info table is not even updated in Xid_apply_log_event::do_apply_event. 

So in the latter case, if slave mysqld is killed right after an XA transaction gets prepared and before updating the mysql.slave_worker_info table, the metadata in it is not consistent with slave's execution of the relay logs, and various problems occur. Similar issue could also happen to mysql.slave_relay_log_info table since it's updated in a separate transaction than the prepared XA transaction branch, but the chance is very remote since it's updated right after the XA_PREPARE_LOG_EVENT event is executed, leaving only a very small window of inconsistency. 

I tried out fixing this small window of inconsistency by updating the mysql.slave_relay_log_info and mysql.slave_worker_info tables right *before* executing the XA_PREPARE_LOG_EVENT event, however, it turned out to be wrong to do so, because after 'XA END' event is executed, it's not OK to do any DML in the transaction. I believe one can tweak the code to internally skip this constraint but I didn't try so and I don't think it's a good idea to break such constraints, so I decided to do above when executing Query_log_event("XA END") event, and I need to compute the length of the following XA_PREPARE_LOG_EVENT and add it to the master_group_log_pos, group_relay_log_pos fields of the tables above. And my fix works in a simple test, all the positions are accurate, but serious issues arise quickly, as detailed below. 

I'm not gonna list the details of how I made the fix,  but the key requirements to make the 2 meta data tables completely crash safe is to update the replication metadata in the very XA transaction that gets prepared when executing the XA_PREPARE_LOG_EVENT, just like how we handle simple transaction commits. But this very requirement means that the throughput/concurrency or TPS of slave replication would be severely harmed, and replication can even fail to proceed. In our tests such problems really occurred. The cause is that when a row in mysql.slave_worker_info is updated when executing an XA_PREPARE_LOG_EVENT, the innodb row lock isn't released until the 'XA COMMIT' for that prepared transaction is executed, and in a high throughput distributed database like TDSQL XA, in master's binlog there can be a LOT of transactions between the two parts of the same XA transaction's binlogs, which means a slave worker thread can't prepare next XA transaction if the earlier XA transaction whose XA_PREPARE_LOG_EVENT it executed isn't committed, the slave worker thread will simply block waiting for innodb row lock on the slave's row in mysql.slave_worker_info and keep retrying until slave_transaction_retries times. The worst case is that all the slave worker threads are occupied retrying XA_PREPARE_LOG_EVENT execution but fail all the time because the rows in mysql.slave_worker_info are still locked by the former batch of prepared transactions, and the XA COMMIT events for them don't even have a chance to be executed. and the slave's replication will simply stop. 

To avoid the bottlenecks, another way is to insert new rows into mysql.slave_worker_info instead of updating the same row again and again, and this would result in many rows for the same channel-slave_worker, and many of such rows each belongs to one prepared XA transaction, or the rest of them each belongs to one committed ordinary or XA transaction. and the challenge is to find the correct row in case of crash recovery --- if the latest row(use version number to distinguish row versions) is generated by a prepared XA transaction, we would not know whether we can use it because we don't know whether this XA transaction will commit or rollback. so this approach doesn't work either. 

So the conclusion is that the two meta tables can't be made completely crash safe when slave executes XA_PREPARE_LOG_EVENT like executing XID_LOG_EVENT of ordinary transactions. They have to be updated in a separate transaction right after executing the XA_PREPARE_LOG_EVENT, and if slave crashes right in between, the slave replication position info can still be inconsistent with its real position, and this is so far inevitable. And the fix I made is to update the mysql.slave_worker_info table in Xid_apply_log_event::do_apply_event for XA transactions with correct replication positions.