SOLVED Postgresql WAL problems

kobus.morgan · Apr 14, 2020

Hi Guys,

Not sure if this is the right place for this question.

I am running FusionPBX 4.4 with postgresql 9.4 BDR. I am running into a problem where my second server is running out of disk space. The biggest files are the WAL files for postgres.

On my main server I can see the oldest file is date Apr this month. My secondary server's oldest files is January. I am talking under correction, but it seems that the autovacuum function isnt doing what it needs to. I have confirmed that its on on both server and process is running:

fusionpbx=# select name, setting from pg_settings where name = 'autovacuum' ;
name | setting
------------+---------
autovacuum | on
(1 row)

fusionpbx=#

root@c3po:/etc/postgresql/9.4/main# ps -aef | grep autovacuum
postgres 105175 105170 0 08:01 ? 00:00:00 postgres: autovacuum launcher process
postgres 105427 105170 1 08:21 ? 00:00:00 postgres: autovacuum worker process template1

From what I have read up on WAL files and autovacuum, this should be sufficient. In postgresl.conf file, those options are all hashed out, meaning that its running its default config. I have the same config on my main server and that is working as expected.

Now, I know a bit of postgresql to get by.

Any help would be greatly appreciated.

Thanks
Kobus

kobus.morgan · Apr 14, 2020

to add, found something on my secondary DB:

fusionpbx=# SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
fusionpbx-# FROM pg_replication_slots
fusionpbx-# WHERE plugin = 'bdr';
slot_name | database | active | retained_bytes
-----------------------------------------+------------+--------+----------------
bdr_16385_6563362034024864690_1_16385__ | freeswitch | f | 2066244639560
bdr_16384_6563362034024864690_1_16384__ | fusionpbx | f | 2066244639560
(2 rows)

fusionpbx=#

Main one is showing live:

fusionpbx=# SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
fusionpbx-# FROM pg_replication_slots
fusionpbx-# WHERE plugin = 'bdr';
slot_name | database | active | retained_bytes
-----------------------------------------+------------+--------+----------------
bdr_16385_6563361912103695863_1_16385__ | freeswitch | t | 33676602936
bdr_16384_6563361912103695863_1_16384__ | fusionpbx | t | 320944
(2 rows)

fusionpbx=#

Ideally I dont want to recreate the BDR. Had a hard enough time getting it to work in the first place

kobus.morgan · Apr 14, 2020

OK so an update....

I have acquired new storage and busy copying all my WAL files there...Afterwards going to symlink the pg_xlog to there and hope for the best.

Ian Oakes · Apr 14, 2020

Do you have a bunch of xlogs? I ran out of space when one of my nodes stopped syncing, might not be related, on phone and am not reading very well.

DigitalDaz · Apr 14, 2020

It looks like the BDR is broken

On each node what do you get for select * from pg_replication_slots
??

kobus.morgan · Apr 15, 2020

DigitalDaz · Apr 15, 2020

Its the active column showing 'f' that's important. Your second node is down?

You are likely going to have to rebuild this BDR.

kobus.morgan · Apr 15, 2020

Yeah I was afraid of that. I dont suppose you have a step-by-step lying around to remove and add the bdr again?

I'm a bit rusty and the stuff I have is a bit outdated

DigitalDaz · Apr 15, 2020

You don't need to redo the BDR, you are probably just going to have to start it with a base backup, the reason is likely to be bacause the longer it is of the more the backlog grows and in the end, it starts to overwrite itself so you do not have the initial info needed to catch up. You may still be good but you really need to get the secondary back up as soon as possible.

kobus.morgan · Apr 16, 2020

kobus.morgan · Apr 16, 2020

The main issue I am dealing with is that, although new WAL is being written to the replica DB, it seems to still have a lot of old WAL files that are not present on the master DB.

Would creating a new base backup fix that as well?

kobus.morgan · Apr 16, 2020

ok so I did the following:

pg_controldata /var/lib/postgresql/9.4/main/
- Got the Last WAL file

root@c3po:/usr/lib/postgresql/9.4/bin# ./pg_controldata /var/lib/postgresql/9.4/main/
pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6563361912103695863
Database cluster state: in production
pg_control last modified: Thu 16 Apr 2020 12:44:21 SAST
Latest checkpoint location: 2C7/BB55CAB8
Prior checkpoint location: 2C7/B835BB48
Latest checkpoint's REDO location: 2C7/BA008F78
Latest checkpoint's REDO WAL file: 00000001000002C7000000BA

pg_archivecleanup -n /var/lib/postgresql/9.4/main/pg_xlog/ 00000001000002C7000000BA
- This one did a dry run and showed a lot of my old files up for deletion. Changed the flag to -d to delete and ran again.

It removed ALL the WAL files except that one file. Started database again without any hassles...its busy populating normally now.

I do however think I might still have underlaying issues. Still seeing a lot of this in the logs (had them there before as well)

2020-04-16 12:42:36 SAST [15219-199] LOG: checkpoints are occurring too frequently (7 seconds apart)
2020-04-16 12:42:36 SAST [15219-200] HINT: Consider increasing the configuration parameter "checkpoint_segments".
2020-04-16 12:42:41 SAST [15436-1] WARNING: oldest xmin is far in the past
2020-04-16 12:42:41 SAST [15436-2] HINT: Close open transactions soon to avoid wraparound problems.
2020-04-16 12:42:41 SAST [15436-3] WARNING: oldest xmin is far in the past

Not sure how to go about fixing this

Any advise?

kobus.morgan · Apr 17, 2020

I have a stupid question. Do I have to have BDR nodes slots active on master and replica?

Master:

fusionpbx=# select * from bdr.bdr_node_slots;
node_name | slot_name | slot_restart_lsn | slot_confirmed_lsn | walsender_active | walsender_pid | sent_location | write_location | flush_location | replay_location
----------------+-----------------------------------------+------------------+--------------------+------------------+---------------+---------------+----------------+----------------+-----------------
172.16.152.245 | bdr_16384_6563361912103695863_1_16384__ | 1CE/15E87458 | 1CE/15E9A0F0 | t | 143216 | 1CE/15E9A0F0 | 1CE/15E9A0F0 | 1CE/15E9A0F0 | 1CE/15E9A0F0
(1 row)

fusionpbx=#

Slave:

fusionpbx=# select * from bdr.bdr_node_slots;
node_name | slot_name | slot_restart_lsn | slot_confirmed_lsn | walsender_active | walsender_pid | sent_location | write_location | flush_location | replay_location
----------------+-----------------------------------------+------------------+--------------------+------------------+---------------+---------------+----------------+----------------+-----------------
172.16.152.244 | bdr_16384_6563362034024864690_1_16384__ | C1/2E57E90 | C1/2E57E90 | f | | | | |
(1 row)

fusionpbx=#

There arent any errors on the logs for bdr:

root@c3po:/etc/postgresql/9.4/main# ps auxww | grep bdr
postgres 26111 0.0 0.0 236736 8088 ? Ss 10:54 0:00 postgres: bgworker: bdr supervisor
postgres 26112 0.0 0.0 242088 19184 ? Ss 10:54 0:00 postgres: bgworker: bdr db: freeswitch postgres 26113 0.0 0.0 242096 27508 ? Ss 10:54 0:00 postgres: bgworker: bdr db: fusionpbx postgres 26114 7.6 0.4 244240 152456 ? Ds 10:54 1:18 postgres: bgworker: bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,
postgres 26115 0.6 0.1 244112 43328 ? Ss 10:54 0:06 postgres: bgworker: bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,
root 26277 0.0 0.0 12744 2156 pts/2 R+ 11:11 0:00 grep bdr
root@c3po:/etc/postgresql/9.4/main#

On Postgres startup:

root@c3po:/etc/postgresql/9.4/main# tail -f /var/log/postgresql/postgresql-9.4-main.log
2020-04-17 10:54:21 SAST [26107-1] [unknown]@[unknown] LOG: incomplete startup packet
2020-04-17 10:54:21 SAST [26099-5] LOG: starting background worker process "bdr supervisor"
2020-04-17 10:54:21 SAST [26099-6] LOG: registering background worker "bdr db: fusionpbx"
2020-04-17 10:54:21 SAST [26099-7] LOG: registering background worker "bdr db: freeswitch"
2020-04-17 10:54:21 SAST [26099-8] LOG: starting background worker process "bdr db: freeswitch"
2020-04-17 10:54:21 SAST [26099-9] LOG: starting background worker process "bdr db: fusionpbx"
2020-04-17 10:54:21 SAST [26099-10] LOG: registering background worker "bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:21 SAST [26099-11] LOG: starting background worker process "bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:22 SAST [26099-12] LOG: registering background worker "bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:22 SAST [26099-13] LOG: starting background worker process "bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,"

Thanks for all the advise so far

kobus.morgan · Apr 21, 2020

Morning Guys and Girls,

Any further advise on the above?

Thanks so much for helping so far

DigitalDaz · Apr 21, 2020

Yes, you should have replication slots active on both. If its only active on one side, you only have one way replication.

kobus.morgan · May 4, 2020

Hi All,

Just to give an update.

I removed my replications slots for now. That has resolved most of the issues. Obviously replication is a bit one sided at the moment, which I will fix still.

For the most part, my WAL issues are sorted. So just to acknowledge the help that DigitalDaz assisted with. I removed non-active (but actually needs to be active) replication slots which stopped the WAL directory from bursting at the seems. For now, WAL is within limits and not exponentially growing out of control.

Thanks for all the help

DigitalDaz · May 4, 2020

I'll close this then so it doesn't get hijacked.

Search

Search

SOLVED Postgresql WAL problems

kobus.morgan

Member

kobus.morgan

Member

kobus.morgan

Member

Ian Oakes

Member

DigitalDaz

Administrator

kobus.morgan

Member

DigitalDaz

Administrator

kobus.morgan

Member

DigitalDaz

Administrator

kobus.morgan

Member

kobus.morgan

Member

kobus.morgan

Member

kobus.morgan

Member

kobus.morgan

Member

DigitalDaz

Administrator

kobus.morgan

Member

DigitalDaz

Administrator