barman streaming replication stops working after switchover - normal wal archival continues to work #1035

thoro · 2024-11-29T11:10:00Z

Hi,

I'm using barman in combination with zalando/postgres-operator on kubernetes.

When the postgres instance does a failover, somehow the barman streaming replication connection can't recover. Sadly I don't have a explicit log right now of that occuring.

But I also use normal wal archival, and I think that streaming replication should try to restart based on the previously archived wals. Currently it's just stuck at some old timeline point and will retain wals on the postgres server.

Configuration:

cat /etc/barman.d/service-db.conf
[service-db]
conninfo=host=service-db.group-service.svc.cluster.local port=5432 user=barman dbname=postgres
retention_policy=RECOVERY WINDOW OF 24 MONTHS
retention_policy_mode=auto
description=service-db Database
streaming_conninfo=host=service-db.group-service.svc.cluster.local port=5432 user=barman
streaming_wals_directory=/data/backups/service-db/streaming_wals
errors_directory=/data/backups/service-db/wal_errors
incoming_wals_directory=/data/backups/service-db/incoming
backup_directory=/data/backups/service-db/backups
network_compression=False
minimum_redundancy=4
wals_directory=/data/backups/service-db/wals
backup_method=postgres
streaming_archiver=on
slot_name=barman
path_prefix=/usr/pgsql-15/bin/
archiver=on
post_backup_retry_script=/data/backups/barman-scripts/post_backup.sh
post_archive_retry_script=/data/backups/barman-scripts/post_wal.sh
compression=custom
custom_compression_filter=zstd
custom_decompression_filter=zstdcat
custom_compression_magic = 0x28b52ffd

Barman status:

barman status service-db
Server service-db:
	Description: service-db Database
	Active: True
	Disabled: False
	PostgreSQL version: 15.2
	Cluster state: in production
	Current data size: 4.5 GiB
	PostgreSQL Data directory: /home/postgres/pgdata/pgroot/data
	Current WAL segment: 000000160000000F00000020
	PostgreSQL 'archive_command' setting: barman-wal-archive barman.group-service.svc.cluster.local service-db %p
	Last archived WAL: 000000160000000F00000020, at Fri Nov 29 06:49:06 2024
	Failures of WAL archiver: 0
	Server WAL archiving rate: 1.28/hour
	Passive node: False
	Retention policies: enforced (mode: auto, retention: RECOVERY WINDOW OF 24 MONTHS, WAL retention: MAIN)
	No. of available backups: 15
	First available backup: 20240524T084621
	Last available backup: 20241013T183002
	Minimum redundancy requirements: satisfied (15/4)

Status of replication slot after failovers:

postgres-# ;
  slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
--------------+--------+-----------+--------+----------+-----------+--------+------------+-------+--------------+-------------+---------------------+------------+---------------+-----------
 service_db_1 |        | physical  |        |          | f         | t      |     154064 | 21130 |              | F/21000000  |                     | reserved   |               | f
 barman       |        | physical  |        |          | f         | f      |            |       |              | C/1F000058  |                     | extended   |               | f
(2 rows)

Errors produced by barman receive-wal:

[root@barman-0 /]# barman receive-wal service-db
2024-11-29 11:05:39,961 [40030] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
Starting receive-wal for server service-db
2024-11-29 11:05:40,036 [40030] barman.server INFO: Starting receive-wal for server service-db
2024-11-29 11:05:40,037 [40030] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
service-db: pg_receivewal: starting log streaming at A/98000000 (timeline 12)
2024-11-29 11:05:40,055 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: starting log streaming at A/98000000 (timeline 12)
service-db: pg_receivewal: error: could not send replication command "START_REPLICATION":
2024-11-29 11:05:40,055 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: error: could not send replication command "START_REPLICATION":
service-db: pg_receivewal: error: disconnected
2024-11-29 11:05:40,056 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: error: disconnected
ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1
2024-11-29 11:05:40,056 [40030] barman.server ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1

Commands necessary to restart wal archiving:

barman receive-wal service-db --reset
rm -rf /data/backups/service-db/streaming_wals/*

The text was updated successfully, but these errors were encountered:

martinmarques · 2024-11-29T11:43:13Z

We need more information here. For example a step by step of actions taken and results after each action.

Based on what I see, I believe that this would be solved by using models and switching models on switchover/failover (this is what we recommend when using Barman with Patroni).

martinmarques · 2024-11-29T11:44:44Z

Also, check the postgres logs. Barman is only starting pg_receivewal and that process is exiting with code 1. There should be an error on the postgres logs related with why the connection was closed

thoro · 2024-11-29T12:07:05Z

@martinmarques The main issue seems that the replication slot doesn't continue with a continous wal archive, if you check my output I have a restart_lsn of C/1F000058, but barman tries to start with A/98000000 with timeline 12 .

I haven't fixed that replication in some time, and currently it's on timeline 22.

I'll try to reproduce so you can have some more info on that.

thoro · 2024-11-29T12:12:20Z

Based on what I see, I believe that this would be solved by using models and switching models on switchover/failover (this is what we recommend when using Barman with Patroni).

I think it's not necessary to run with models, since none of my configs change (i.e. all urls stay exactly the same). Does barman do specific tasks (i.e. reset streaming wal position) on calling barman config-switch?

thoro · 2024-11-29T12:53:35Z

When testing the failover manually everything looked very good:

[root@barman-0 barman]# barman receive-wal service-db
Starting receive-wal for server service-db
service-db: pg_receivewal: starting log streaming at F/21000000 (timeline 22)


service-db: pg_receivewal: finished segment at F/22000000 (timeline 22)
service-db: pg_receivewal: not renaming "000000160000000F00000022.partial", segment is not complete
service-db: pg_receivewal: error: replication stream was terminated before stop point
service-db: pg_receivewal: error: disconnected
ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1
[root@barman-0 barman]#
[root@barman-0 barman]#
[root@barman-0 barman]# barman receive-wal service-db
Starting receive-wal for server service-db
service-db: pg_receivewal: starting log streaming at F/22000000 (timeline 22)
service-db: pg_receivewal: not renaming "000000160000000F00000022.partial", segment is not complete
service-db: pg_receivewal: switched to timeline 23 at F/220000A0

So I'm not sure how I'm constantly getting into that error state

martinmarques · 2024-11-29T15:44:09Z

I think it's not necessary to run with models, since none of my configs change (i.e. all urls stay exactly the same). Does barman do specific tasks (i.e. reset streaming wal position) on calling barman config-switch?

No, just switches from one configuration to another. Of course it doesn't some checks and tasks on the barman side so that the change applies cleanly. I hope that answers your question.

I'll wait for more info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

barman streaming replication stops working after switchover - normal wal archival continues to work #1035

barman streaming replication stops working after switchover - normal wal archival continues to work #1035

thoro commented Nov 29, 2024 •

edited

Loading

martinmarques commented Nov 29, 2024

martinmarques commented Nov 29, 2024

thoro commented Nov 29, 2024 •

edited

Loading

thoro commented Nov 29, 2024

thoro commented Nov 29, 2024

martinmarques commented Nov 29, 2024

barman streaming replication stops working after switchover - normal wal archival continues to work #1035

barman streaming replication stops working after switchover - normal wal archival continues to work #1035

Comments

thoro commented Nov 29, 2024 • edited Loading

martinmarques commented Nov 29, 2024

martinmarques commented Nov 29, 2024

thoro commented Nov 29, 2024 • edited Loading

thoro commented Nov 29, 2024

thoro commented Nov 29, 2024

martinmarques commented Nov 29, 2024

thoro commented Nov 29, 2024 •

edited

Loading

thoro commented Nov 29, 2024 •

edited

Loading