**故事背景:**
一线客户偶发性收到PG内存告警短信,平均2h小时左右发生一次。
初步判断是内存泄漏导致。
**1.信息收集:**
1)TOP 进程
2)数据库连接数
3)根据TOP pid查询TOP 会话
```sql SELECT s.procpid, s.start, now() - s.start AS elapsed_time, a.state, a.wait_event, s.current_query FROM (SELECT backendid, pg_stat_get_backend_pid(S.backendid) AS procpid, pg_stat_get_backend_activity_start(S.backendid) AS start, pg_stat_get_backend_activity(S.backendid) AS current_query FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS S ) AS s, pg_stat_activity a WHERE s.procpid=a.pid AND a.pid in( 29710, 29711, 29712, 29713, 29714, 29715, 29716, 29717, 29718, 29719 ); ```
4)根据top cpu/mem动态查找pg会话
```sql
while true; do for i in `ps -aux --sort -pcpu,-pmem | head -n 20| awk '{print $2}'| grep -v "PID" `; do psql -h 127.0.0.1 -c "SELECT
s.procpid,
s.start,
now() - s.start AS elapsed_time,
a.state,
a.wait_event,
s.current_query
FROM
(SELECT
backendid,
pg_stat_get_backend_pid(S.backendid) AS procpid,
pg_stat_get_backend_activity_start(S.backendid) AS start,
pg_stat_get_backend_activity(S.backendid) AS current_query
FROM
(SELECT pg_stat_get_backend_idset() AS backendid) AS S
) AS s,
pg_stat_activity a
WHERE
s.current_query <> ''
AND
s.procpid=a.pid
AND
s.procpid=${i}
AND
now() - s.start > interval '1 seconds'
AND
--exclude current pid
s.procpid <> pg_backend_pid()
ORDER BY
now() - s.start DESC;"; done; sleep 2 ; echo $(date); done
```
5)查看事务相关超时配置
6)将idle事务超时设置为60s,会话超时设置为15min
```sql
ALTER SYSTEM SET statement_timeout = 900000;
ALTER SYSTEM SET idle_in_transaction_session_timeout = 60000;
select pg_reload_conf();
```
7)重启数据库实例(业务低峰期)观察会话积压消失
8)初步解决问题,后续持续观察。
