运行在容器中Postgres数据库数据损坏后如何恢复?

运行,容器,postgres,数据库,数据,损坏,如何,恢复 · 浏览次数 : 0

小编点评

## postgres 数据库 WAL 损坏后恢复步骤 **问题:** postgres 数据库 WAL 损坏，导致 CrashBackoffLoop，无法正常启动。 **解决方案:** 1. **尝试启动 Pod** - 通过 `kubectl exec -it` 命令进入 `database-postgres-56cff865bb-92pcx` 容器中。 - 使用 `pg_reset_wal` 命令恢复 WAL。 2. **恢复 WAL** **步骤:** **a. 检查 WAL 恢复结果** - 使用 `pg_resetwal --dry-run` 命令检查 WAL 恢复结果。 - 如果结果符合预期，则使用 `pg_resetwal` 命令恢复 WAL。 **b. 恢复 WAL** - 使用 `pg_resetwal` 命令恢复 WAL。 - 注意： - `pg_resetwal` 命令仅适用于 WAL 文件已损坏的情况。 - 恢复 WAL 后，需要 restart postgres 服务。 **注意:** - 恢复 WAL 后，需要确保 `postgres` 服务能够正常启动。 - 使用 `pg_resetwal` 命令之前，请确保确保 WAL 文件的完整性。 **其他提示:** - 在执行 `pg_resetwal` 命令之前，请确保确保已备份重要数据。 - 恢复 WAL 后，请仔细检查数据库的性能，并根据需要进行优化。

正文

前言

在使用 K8S 部署 RSS 全套自托管解决方案- RssHub + Tiny Tiny Rss, 我介绍了将 RssHub + Tiny Tiny RSS 部署到 K8s 集群中的方案. 其中 TTRSS 会用到 Postgres 存储数据, 也一并部署到 K8s 容器中.

但是最近, 由于一次错误操作, 导致 Postgres 数据库的 WAL 损坏, Postgres 的 Pod 频繁 CrashBackoffLoop. 具体报错如下:

Postgres shutdown exit code 1:

2023-09-27 02:32:17.127 UTC [1] LOG:  received fast shutdown request
2023-09-27 02:32:17.181 UTC [1] LOG:  aborting any active transactions
2023-09-27 02:32:17.434 UTC [1] LOG:  background worker "logical replication launcher" (PID 26) exited with exit code 1
2023-09-27 02:32:17.481 UTC [21] LOG:  shutting down
2023-09-27 02:32:17.880 UTC [1] LOG:  database system is shut down
复制

Postgres "invalid resource manager ID in primary checkpoint record" and "could not locate a valid checkpoint record"

2023-09-27 02:33:23.189 UTC [1] LOG:  starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 02:33:23.190 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 02:33:23.190 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-09-27 02:33:23.199 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 02:33:23.210 UTC [21] LOG:  database system was shut down at 2023-09-27 02:32:22 UTC
2023-09-27 02:33:23.210 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2023-09-27 02:33:23.210 UTC [21] PANIC:  could not locate a valid checkpoint record
2023-09-27 02:33:24.657 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2023-09-27 02:33:24.657 UTC [1] LOG:  aborting startup due to startup process failure
2023-09-27 02:33:24.659 UTC [1] LOG:  database system is shut down
复制

如上, WAL文件已损坏, 应该如何恢复?

恢复步骤

🐾Warning:

目的是启动 Postgres 恢复应用的正常运行. 数据可能存在丢失.

这是一个 TTRSS feed 应用, 只供我自己使用, 只要能启动起来, 丢失一点数据无所谓.

首先, Postgres Pod 在 CrashBackoffLoop, 无法进行任何操作, 首要任务是使 Pod 启动起来, 不要关闭. 这里通过在 Deployment 添加一些命令来实现. 如下:

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    spec:
      containers:
      - image: postgres:13-alpine
        imagePullPolicy: IfNotPresent
        name: postgres
        command: ["sh"]
        args: ["-c", "tail -f /dev/null"]
...
复制

如上, 通过 sh -c tail -f /dev/null 实现 Pod 运行. 也可以通过类似 while true; do sleep 30; done; 等类似命令来实现.

Pod 稳定运行后, 通过 kubectl exec -it 进入该Pod:

k3s kubectl exec -it database-postgres-56cff865bb-92pcx -n rsshub -- /bin/sh
复制

并切换到 postgres 用户:

su - postgres
复制

🐾Warning:

切换到 postgres 用户方可执行下面命令.

接下来就顺利了, 使用 pg_reset_wal 恢复 WAL:

先用 --dry-run 看看运行结果:

pg_resetwal --dry-run /var/lib/postgresql/data/
复制

如果结果符合预期, 再运行:

pg_resetwal /var/lib/postgresql/data/
Write-ahead log reset
复制

成功后, 退出 Pod. 并移除 Deploy 的 command 和 args 后, postgres 即可正常启动. 如下:

2023-09-27 04:03:25.172 UTC [1] LOG:  starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 04:03:25.173 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 04:03:25.173 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-09-27 04:03:25.179 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 04:03:25.187 UTC [20] LOG:  database system was shut down at 2023-09-27 04:02:42 UTC
2023-09-27 04:03:25.210 UTC [1] LOG:  database system is ready to accept connections
复制

完成🎉🎉🎉

三人行, 必有我师; 知识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.