MySQL案例-mysqld got signal 11

2024-07-24 12:35:33

字體：大中小

供稿：網(wǎng)友

　　背景:
　　MySQL-5.7.12, debian 8核16G虛擬機(jī), 業(yè)務(wù)方反饋在某一個時間點(diǎn), 出現(xiàn)了大量的數(shù)據(jù)庫報錯, 之后恢復(fù)正常;

　　場景:
　　開發(fā)查看日志后, 發(fā)現(xiàn)在某個時間點(diǎn), 應(yīng)用斷開了所有與數(shù)據(jù)庫的連接, 幾秒鐘以后就恢復(fù)了;
　　同時監(jiān)控系統(tǒng)的內(nèi)存使用率出現(xiàn)了異常的驟降;
　　MySQL案例-mysqld got signal 11(補(bǔ)充)

　　3min之后收到了報警系統(tǒng)的信息, 內(nèi)存使用率82%;

　　分析:
　　第一時間的判斷是網(wǎng)絡(luò)的問題造成了應(yīng)用層的連接斷開了, 但是這種內(nèi)存使用率驟降的現(xiàn)象不會是網(wǎng)絡(luò)造成的;
　　查看MySQL的日志, 發(fā)現(xiàn)MySQL實(shí)例發(fā)生了crash, 相關(guān)的報錯信息如下:

　　點(diǎn)擊(此處)折疊或打開

　　07:42:44 UTC - mysqld got signal 11 ;
　　This could be because you hit a bug. It is also possible that this binary
　　or one of the libraries it was linked against is corrupt, improperly built,
　　or misconfigured. This error can also be caused by malfunctioning hardware.
　　Attempting to collect some information that could help diagnose the problem.
　　As this is a crash and something is definitely wrong, the information
　　collection process might fail.
　　key_buffer_size=8388608
　　read_buffer_size=16777216
　　max_used_connections=29
　　max_threads=5000
　　thread_count=32
　　connection_count=22
　　It is possible that mysqld could use up to
　　key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
　　Hope that is ok; if not, decrease some variables in the equation.
　　Thread pointer: 0x7f607c0072c0
　　Attempting backtrace. You can use the following information to find out
　　where mysqld died. If you see no messages after this, something went
　　terribly wrong...
　　stack_bottom = 7f6141b36e80 thread_stack 0x40000
　　/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
　　/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
　　/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
　　/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
　　/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
　　/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
　　/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
　　/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
　　/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
　　/usr/sbin/mysqld[0x770ccf]
　　/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
　　/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
　　/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
　　/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
　　/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
　　/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
　　/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
　　/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
　　Trying to get some variables.
　　Some pointers may be invalid and cause the dump to abort.
　　Query (7f607c015ad0): select * from test where time>='2016-07-29 00:00:00' and time<='2016-07-29 23:59:59' and tag in (2,3,6) order by id desc limit 2000
　　Connection ID (thread ID): 138760
　　Status: NOT_KILLED
　　The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
　　information that should help you find out what is causing the crash.
　　2016-07-29T07:42:45.661724Z mysqld_safe Number of processes running now: 0
　　2016-07-29T07:42:45.664516Z mysqld_safe mysqld restarted
　　2016-07-29T15:42:45.991109+08:00 0 [Note] /usr/sbin/mysqld (mysqld 5.7.12-log) starting as process 8367 ...

　　首先是第一部分的信息:

　　點(diǎn)擊(此處)折疊或打開

　　mysqld got signal 11 ;
　　通過perror命令(感謝@楊奇龍的場外援助..._(:з」∠)_...)看到ErrorCode的信息:

　　點(diǎn)擊(此處)折疊或打開

　　Resource temporarily unavailable
　　代表MySQL發(fā)現(xiàn)某一項資源臨時不可用, 應(yīng)該是資源耗盡 or 申請失敗等情況;

　　然后是第二部分信息:

　　點(diǎn)擊(此處)折疊或打開

　　It is possible that mysqld could use up to
　　key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 245834871 K bytes of memory
　　這一段計算了當(dāng)前配置下, 需要的最大內(nèi)存數(shù), 大概換算了一下, 是234G;

　　這么高, 明顯是有問題的, 聯(lián)想到82%內(nèi)存使用率的報警信息, 推測是內(nèi)存耗盡造成的;

　　用max_used_connections來算一下使用的內(nèi)存的話,有約1.5G;

　　加上BP的9.6G, 有11G了, 算上MySQL本身占用的一部分內(nèi)存, 確實(shí)達(dá)到了比較高的程度;

　　但是看了一下kernel和message, 都沒有發(fā)現(xiàn)系統(tǒng)出現(xiàn)OOM的日志, 應(yīng)該不是系統(tǒng)kill的;

　　再看看堆棧相關(guān)的信息, 在里面記錄了MySQL crash時的狀態(tài);

　　點(diǎn)擊(此處)折疊或打開

　　stack_bottom = 7f6141b36e80 thread_stack 0x40000
　　/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xe77fec]
　　/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7a7019]
　　/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f643257a8d0]
　　/usr/sbin/mysqld(_ZN16Partition_helper25handle_ordered_index_scanEPh+0x5c)[0xbbabec]
　　/usr/sbin/mysqld(_ZN7handler13ha_index_lastEPh+0x1b0)[0x7f4410]
　　/usr/sbin/mysqld(_Z14join_read_lastP7QEP_TAB+0x65)[0xc1f605]
　　/usr/sbin/mysqld(_Z10sub_selectP4JOINP7QEP_TABb+0x11b)[0xc25e4b]
　　/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3b8)[0xc1ea78]
　　/usr/sbin/mysqld(_Z12handle_queryP3THDP3LEXP12Query_resultyy+0x238)[0xc8e408]
　　/usr/sbin/mysqld[0x770ccf]
　　/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x3403)[0xc51103]
　　/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xc531bd]
　　/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x817)[0xc53a47]
　　/usr/sbin/mysqld(_Z10do_commandP3THD+0x18f)[0xc54faf]
　　/usr/sbin/mysqld(handle_connection+0x278)[0xd108d8]
　　/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xe90784]
　　/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f64325730a4]
　　/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6430e1b87d]
　　從紅字等地方的信息, 可以推斷出當(dāng)時MySQL是正在執(zhí)行查詢, 這些查詢中有join, 也有subquery, 且查詢的表包含了分區(qū)表;

　　可以預(yù)料到在crash的時候, MySQL執(zhí)行這些語句時肯定需要申請一部分join用的buffer, 同時子查詢也會建立臨時表, 都需要占用內(nèi)存空間;

　　同時還有分區(qū)表的使用, 看了一下當(dāng)時候分區(qū)表的大小:
　　MySQL案例-mysqld got signal 11(補(bǔ)充)MySQL案例-mysqld got signal 11(補(bǔ)充)

　　發(fā)現(xiàn)當(dāng)天的數(shù)據(jù)超過了BP的大小, 且用到分區(qū)表的查詢走的全表掃描, 并且還有order by, 會用到sort的buffer, 且由于全表掃描的數(shù)據(jù)很多, 這個buffer有可能是需要申請滿的;

　　綜合這些信息, 基本確認(rèn)是內(nèi)存耗盡造成了MySQL crash;

　　那么根據(jù)堆棧的信息嘗試還原crash時的場景:

　　在內(nèi)存占用率很高的情況下, MySQL thread在執(zhí)行較大表的查詢時, 無法再申請到足夠的內(nèi)存(sort的buffer, join的buffer等), 因此發(fā)生了crash;

　　處理方式:
　　最終把BP從9.6G調(diào)整為9G, 并把sort, read等buffer的數(shù)值調(diào)整到了4M, 其他的buffer也調(diào)低了;

　　PS: 算是疏忽吧, 因?yàn)檎f在生產(chǎn)環(huán)境已經(jīng)用這么一套配置很久了, 沒出過問題, 所以也沒有仔細(xì)的排查配置文件中的設(shè)置;
　　PSS: sort的buffer原來是多少? 32M...sort的buffer還是per thread的...失職了..._(:з」∠)_

　　-------------------------------------------------------------------------------------------------后續(xù)---------------------------------------------------------------------------------------------------------------

　　峰回路轉(zhuǎn).....在調(diào)整了buffer的數(shù)量以后, 不可能再出現(xiàn)內(nèi)存不夠的現(xiàn)象了, 然后crash的現(xiàn)象重現(xiàn)了;

　　而且是主庫和備庫在非常短的時間內(nèi)都發(fā)生了crash;

　　報錯信息除了pointer不同以外, 堆棧的信息也是完全一致;

　　包括那個語句;
　　在之前出問題的時候, 記錄了一條語句:

　　點(diǎn)擊(此處)折疊或打開

　　select * from test where time>='2016-07-29 00:00:00' and time<='2016-07-29 23:59:59' and tag in (2,3,6) order by id desc limit 2000
　　在后來重現(xiàn)的時候, 兩次crash的語句中, 記錄的是同樣的語句, (而且堆棧的輸出信息也是完全一樣) , 僅僅只是時間不一樣:

　　點(diǎn)擊(此處)折疊或打開

　　select * from test where time>='2016-08-09 00:00:00' and time<='2016-08-09 23:59:59' and tag in (2,3,6) order by id desc limit 2000
　　如果是因?yàn)閮?nèi)存or系統(tǒng)資源的不可用導(dǎo)致了crash的話, 不可能有這么巧合的事情, 都是這個語句;

　　so, 在被拉起來的備庫上跑了一下這個語句, 結(jié)果MySQL馬上就crash了...
　
　　那么明顯就是這個語句的問題了, order by desc + limit, 看上去并沒什么問題, 看看explain的結(jié)果
　
　　雖然好久沒做開發(fā)了, 但是filtered在100%的情況下, rows只有1還是挺奇怪的, 一整張表只有一行數(shù)據(jù), 但是還有這種查詢一整天的語句;

　　看看表的結(jié)構(gòu);
　　隱去生產(chǎn)庫上的一部分信息, 留下關(guān)鍵的部分;

　　分區(qū)表的分區(qū)有問題....

　　問過業(yè)務(wù)以后, 原來是這個功能還沒做完, 所以表相關(guān)的操作并沒有一直執(zhí)行;
　　但是這個功能的頁面沒有屏蔽, 所以對應(yīng)的那條語句是有可能被觸發(fā)的;

　　考慮到用那條語句可以必現(xiàn)這個crash, 且輸出的堆棧信息和之前完全一致,
　　所以確定是這個分區(qū)表的分區(qū)缺失的前提下, 觸發(fā)那個查詢語句的時導(dǎo)致了MySQL的Crash;

　　處理方式:
　　雖然最后還是找到了問題所在, 但是最開始的時候還是被buffer和內(nèi)存使用率的現(xiàn)象誤導(dǎo)了, too young......
　　PS:本來還是覺得分區(qū)表在5.7改進(jìn)了一點(diǎn)以后, 應(yīng)該還挺好用的.....恩, 現(xiàn)在持保留意見....._(:з」∠)_
　　PPS:應(yīng)該不會再有后續(xù)了, 嗯嗯....

（編輯：武林網(wǎng)）

上一篇：mysql源碼編譯調(diào)整或者隱藏版本號

下一篇：MySQL查詢計劃key_len全知道