Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

barman streaming replication stops working after switchover - normal wal archival continues to work #1035

Open
thoro opened this issue Nov 29, 2024 · 6 comments

Comments

@thoro
Copy link
Contributor

thoro commented Nov 29, 2024

Hi,

I'm using barman in combination with zalando/postgres-operator on kubernetes.

When the postgres instance does a failover, somehow the barman streaming replication connection can't recover. Sadly I don't have a explicit log right now of that occuring.

But I also use normal wal archival, and I think that streaming replication should try to restart based on the previously archived wals. Currently it's just stuck at some old timeline point and will retain wals on the postgres server.

Configuration:

cat /etc/barman.d/service-db.conf
[service-db]
conninfo=host=service-db.group-service.svc.cluster.local port=5432 user=barman dbname=postgres
retention_policy=RECOVERY WINDOW OF 24 MONTHS
retention_policy_mode=auto
description=service-db Database
streaming_conninfo=host=service-db.group-service.svc.cluster.local port=5432 user=barman
streaming_wals_directory=/data/backups/service-db/streaming_wals
errors_directory=/data/backups/service-db/wal_errors
incoming_wals_directory=/data/backups/service-db/incoming
backup_directory=/data/backups/service-db/backups
network_compression=False
minimum_redundancy=4
wals_directory=/data/backups/service-db/wals
backup_method=postgres
streaming_archiver=on
slot_name=barman
path_prefix=/usr/pgsql-15/bin/
archiver=on
post_backup_retry_script=/data/backups/barman-scripts/post_backup.sh
post_archive_retry_script=/data/backups/barman-scripts/post_wal.sh
compression=custom
custom_compression_filter=zstd
custom_decompression_filter=zstdcat
custom_compression_magic = 0x28b52ffd

Barman status:

barman status service-db
Server service-db:
	Description: service-db Database
	Active: True
	Disabled: False
	PostgreSQL version: 15.2
	Cluster state: in production
	Current data size: 4.5 GiB
	PostgreSQL Data directory: /home/postgres/pgdata/pgroot/data
	Current WAL segment: 000000160000000F00000020
	PostgreSQL 'archive_command' setting: barman-wal-archive barman.group-service.svc.cluster.local service-db %p
	Last archived WAL: 000000160000000F00000020, at Fri Nov 29 06:49:06 2024
	Failures of WAL archiver: 0
	Server WAL archiving rate: 1.28/hour
	Passive node: False
	Retention policies: enforced (mode: auto, retention: RECOVERY WINDOW OF 24 MONTHS, WAL retention: MAIN)
	No. of available backups: 15
	First available backup: 20240524T084621
	Last available backup: 20241013T183002
	Minimum redundancy requirements: satisfied (15/4)
	

Status of replication slot after failovers:

postgres-# ;
  slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
--------------+--------+-----------+--------+----------+-----------+--------+------------+-------+--------------+-------------+---------------------+------------+---------------+-----------
 service_db_1 |        | physical  |        |          | f         | t      |     154064 | 21130 |              | F/21000000  |                     | reserved   |               | f
 barman       |        | physical  |        |          | f         | f      |            |       |              | C/1F000058  |                     | extended   |               | f
(2 rows)

Errors produced by barman receive-wal:

[root@barman-0 /]# barman receive-wal service-db
2024-11-29 11:05:39,961 [40030] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
Starting receive-wal for server service-db
2024-11-29 11:05:40,036 [40030] barman.server INFO: Starting receive-wal for server service-db
2024-11-29 11:05:40,037 [40030] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
service-db: pg_receivewal: starting log streaming at A/98000000 (timeline 12)
2024-11-29 11:05:40,055 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: starting log streaming at A/98000000 (timeline 12)
service-db: pg_receivewal: error: could not send replication command "START_REPLICATION":
2024-11-29 11:05:40,055 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: error: could not send replication command "START_REPLICATION":
service-db: pg_receivewal: error: disconnected
2024-11-29 11:05:40,056 [40030] barman.command_wrappers INFO: service-db: pg_receivewal: error: disconnected
ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1
2024-11-29 11:05:40,056 [40030] barman.server ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1

Commands necessary to restart wal archiving:

barman receive-wal service-db --reset
rm -rf /data/backups/service-db/streaming_wals/*
@martinmarques
Copy link
Contributor

We need more information here. For example a step by step of actions taken and results after each action.

Based on what I see, I believe that this would be solved by using models and switching models on switchover/failover (this is what we recommend when using Barman with Patroni).

@martinmarques
Copy link
Contributor

Also, check the postgres logs. Barman is only starting pg_receivewal and that process is exiting with code 1. There should be an error on the postgres logs related with why the connection was closed

@thoro
Copy link
Contributor Author

thoro commented Nov 29, 2024

@martinmarques The main issue seems that the replication slot doesn't continue with a continous wal archive, if you check my output I have a restart_lsn of C/1F000058, but barman tries to start with A/98000000 with timeline 12 .

I haven't fixed that replication in some time, and currently it's on timeline 22.

I'll try to reproduce so you can have some more info on that.

@thoro
Copy link
Contributor Author

thoro commented Nov 29, 2024

Based on what I see, I believe that this would be solved by using models and switching models on switchover/failover (this is what we recommend when using Barman with Patroni).

I think it's not necessary to run with models, since none of my configs change (i.e. all urls stay exactly the same). Does barman do specific tasks (i.e. reset streaming wal position) on calling barman config-switch?

@thoro
Copy link
Contributor Author

thoro commented Nov 29, 2024

When testing the failover manually everything looked very good:

[root@barman-0 barman]# barman receive-wal service-db
Starting receive-wal for server service-db
service-db: pg_receivewal: starting log streaming at F/21000000 (timeline 22)


service-db: pg_receivewal: finished segment at F/22000000 (timeline 22)
service-db: pg_receivewal: not renaming "000000160000000F00000022.partial", segment is not complete
service-db: pg_receivewal: error: replication stream was terminated before stop point
service-db: pg_receivewal: error: disconnected
ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1
[root@barman-0 barman]#
[root@barman-0 barman]#
[root@barman-0 barman]# barman receive-wal service-db
Starting receive-wal for server service-db
service-db: pg_receivewal: starting log streaming at F/22000000 (timeline 22)
service-db: pg_receivewal: not renaming "000000160000000F00000022.partial", segment is not complete
service-db: pg_receivewal: switched to timeline 23 at F/220000A0

So I'm not sure how I'm constantly getting into that error state

@martinmarques
Copy link
Contributor

I think it's not necessary to run with models, since none of my configs change (i.e. all urls stay exactly the same). Does barman do specific tasks (i.e. reset streaming wal position) on calling barman config-switch?

No, just switches from one configuration to another. Of course it doesn't some checks and tasks on the barman side so that the change applies cleanly. I hope that answers your question.

I'll wait for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants