一个crash case问题的猜测

来源：这里教程网时间：2026-03-01 18:29:08 作者：

线上在批量resize buffer的时候，从库会遇到crash的情况

2024-12-27T02:52:17.267661Z 0 [Note] InnoDB: Completed to resize buffer pool from 15032385536 to 30064771072.
2024-12-27T02:52:17.267691Z 0 [Note] InnoDB: Completed resizing buffer pool at 241227 10:52:17.
02:52:17 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona Server better by reporting any
bugs at https://bugs.percona.com/
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=178
max_threads=100008
thread_count=106
connection_count=102
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 38479399 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f94a4e004c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f94b2f6b158 thread_stack 0x40000
/data/mysql_base/bin/mysqld(my_print_stacktrace+0x2c)[0x564d61d4f4ec]
/data/mysql_base/bin/mysqld(handle_fatal_signal+0x489)[0x564d61b84ef9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f98d03c9420]
/data/mysql_base/bin/mysqld(_ZNK14Relay_log_info22cannot_safely_rollbackEv+0xc5)[0x564d61d138d5]
/data/mysql_base/bin/mysqld(+0xcf97e8)[0x564d61cf97e8]
/data/mysql_base/bin/mysqld(_Z22mts_checkpoint_routineP14Relay_log_infoybb+0xd1)[0x564d61d0b271]
/data/mysql_base/bin/mysqld(_Z18slave_stop_workersP14Relay_log_infoPb+0x830)[0x564d61d0bfa0]
/data/mysql_base/bin/mysqld(handle_slave_sql+0x2ed)[0x564d61d0c30d]
/data/mysql_base/bin/mysqld(pfs_spawn_thread+0x1b4)[0x564d61d69c04]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f98d03bd609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f98cfba5353]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 10
Status: NOT_KILLED
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.

从这段error来开是停止worker的时候触发了null pointer导致了crash，这个空指针的问题，percona已经修复 https://perconadev.atlassian.net/browse/PS-8030，但是什么触发了停止worker，最终导致crash？看了下代码，做了下推测，线程被关闭的情况，一是主动关闭，二是执行event报错，日志中没有想过执行event报错，也没有人为去关闭，这个情况下，想到了线程池，5.7版本的io/sql 线程都是用线程池管理的，并没有单独拆分出去，所以线程池是有可能去关闭线程的。因此怀疑是slave idle太久导致了被关闭。在线程池管理中，slave idle超时，是看在epoll中获取event是否超时，并不会去看线程是否在处理event，所以如果线程长时间没有接收到网络数据，是会被关闭的。在结合master， slave中的其他日志

master:
2024-12-27T02:52:17.371038Z 3475276 [Note] While initializing dump thread for slave with 
UUID <b74d5018-7733-11ef-a717-0a82daabbcb2>, found a zombie dump thread with the same UUID. 
Master is killing the zombie dump thread(36).
slave:
2024-12-27T02:52:17.325341Z 3178352 [Warning] Storing MySQL user name or password 
information in the master info repository is not secure and is therefore not recommended. 
Please consider using the USER and PASSWORD connection options for START SLAVE; 
see the 'START SLAVE Syntax' in the MySQL Manual for more information.

所以怀疑，调大thread_pool_idle_timeout是有效果，能降低触发crash的概率，无法重现，所以这里只能是猜测。