SOLVED Postgresql WAL problems

Status
Not open for further replies.
Hi Guys,

Not sure if this is the right place for this question.

I am running FusionPBX 4.4 with postgresql 9.4 BDR. I am running into a problem where my second server is running out of disk space. The biggest files are the WAL files for postgres.

On my main server I can see the oldest file is date Apr this month. My secondary server's oldest files is January. I am talking under correction, but it seems that the autovacuum function isnt doing what it needs to. I have confirmed that its on on both server and process is running:

fusionpbx=# select name, setting from pg_settings where name = 'autovacuum' ;
name | setting
------------+---------
autovacuum | on
(1 row)

fusionpbx=#

root@c3po:/etc/postgresql/9.4/main# ps -aef | grep autovacuum
postgres 105175 105170 0 08:01 ? 00:00:00 postgres: autovacuum launcher process
postgres 105427 105170 1 08:21 ? 00:00:00 postgres: autovacuum worker process template1

From what I have read up on WAL files and autovacuum, this should be sufficient. In postgresl.conf file, those options are all hashed out, meaning that its running its default config. I have the same config on my main server and that is working as expected.

Now, I know a bit of postgresql to get by.

Any help would be greatly appreciated.

Thanks
Kobus
 
to add, found something on my secondary DB:

fusionpbx=# SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
fusionpbx-# FROM pg_replication_slots
fusionpbx-# WHERE plugin = 'bdr';
slot_name | database | active | retained_bytes
-----------------------------------------+------------+--------+----------------
bdr_16385_6563362034024864690_1_16385__ | freeswitch | f | 2066244639560
bdr_16384_6563362034024864690_1_16384__ | fusionpbx | f | 2066244639560
(2 rows)

fusionpbx=#

Main one is showing live:

fusionpbx=# SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
fusionpbx-# FROM pg_replication_slots
fusionpbx-# WHERE plugin = 'bdr';
slot_name | database | active | retained_bytes
-----------------------------------------+------------+--------+----------------
bdr_16385_6563361912103695863_1_16385__ | freeswitch | t | 33676602936
bdr_16384_6563361912103695863_1_16384__ | fusionpbx | t | 320944
(2 rows)

fusionpbx=#

Ideally I dont want to recreate the BDR. Had a hard enough time getting it to work in the first place
 
This is what I have at the moment:

fusionpbx=# select * from pg_replication_slots
fusionpbx-# ;
slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
-----------------------------------------+--------+-----------+--------+------------+--------+------+--------------+--------------
bdr_16385_6563361912103695863_1_16385__ | bdr | logical | 16385 | freeswitch | f | | 401618435 | 1BE/2516E688
bdr_16384_6563361912103695863_1_16384__ | bdr | logical | 16384 | fusionpbx | f | | 416965666 | 1C6/50824D68
(2 rows)

fusionpbx=#

Unfortunately my second node is down for the time being. But the replication slots are similar on the secondary server.

Thanks
 

DigitalDaz

Administrator
Staff member
You don't need to redo the BDR, you are probably just going to have to start it with a base backup, the reason is likely to be bacause the longer it is of the more the backlog grows and in the end, it starts to overwrite itself so you do not have the initial info needed to catch up. You may still be good but you really need to get the secondary back up as soon as possible.
 
Morning DigitalDaz,

So I move my WAL files to a bigger drive and got postgres to start again. Problem isnt sorted yet though.

This is what I have for the replication slots for both machines:

Main:

postgres@r2d2:~$ psql fusionpbx
psql (9.4.17)
Type "help" for help.

fusionpbx=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
-----------------------------------------+--------+-----------+--------+------------+--------+------+--------------+--------------
bdr_16385_6563361912103695863_1_16385__ | bdr | logical | 16385 | freeswitch | t | | 401727687 | 1BE/32CA5FA8
bdr_16384_6563361912103695863_1_16384__ | bdr | logical | 16384 | fusionpbx | t | | 424006492 | 1CA/862DA3F0
(2 rows)

fusionpbx=#

Secondary:

fusionpbx=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
-----------------------------------------+--------+-----------+--------+------------+--------+------+--------------+-------------
bdr_16385_6563362034024864690_1_16385__ | bdr | logical | 16385 | freeswitch | f | | 136940311 | C1/2E57E90
bdr_16384_6563362034024864690_1_16384__ | bdr | logical | 16384 | fusionpbx | f | | 136940311 | C1/2E57E90
(2 rows)

fusionpbx=#

So on main active=t but not on secondary.

Thanks
Kobus
 
The main issue I am dealing with is that, although new WAL is being written to the replica DB, it seems to still have a lot of old WAL files that are not present on the master DB.

Would creating a new base backup fix that as well?
 
ok so I did the following:

pg_controldata /var/lib/postgresql/9.4/main/
- Got the Last WAL file

root@c3po:/usr/lib/postgresql/9.4/bin# ./pg_controldata /var/lib/postgresql/9.4/main/
pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6563361912103695863
Database cluster state: in production
pg_control last modified: Thu 16 Apr 2020 12:44:21 SAST
Latest checkpoint location: 2C7/BB55CAB8
Prior checkpoint location: 2C7/B835BB48
Latest checkpoint's REDO location: 2C7/BA008F78
Latest checkpoint's REDO WAL file: 00000001000002C7000000BA

pg_archivecleanup -n /var/lib/postgresql/9.4/main/pg_xlog/ 00000001000002C7000000BA
- This one did a dry run and showed a lot of my old files up for deletion. Changed the flag to -d to delete and ran again.

It removed ALL the WAL files except that one file. Started database again without any hassles...its busy populating normally now.

I do however think I might still have underlaying issues. Still seeing a lot of this in the logs (had them there before as well)

2020-04-16 12:42:36 SAST [15219-199] LOG: checkpoints are occurring too frequently (7 seconds apart)
2020-04-16 12:42:36 SAST [15219-200] HINT: Consider increasing the configuration parameter "checkpoint_segments".
2020-04-16 12:42:41 SAST [15436-1] WARNING: oldest xmin is far in the past
2020-04-16 12:42:41 SAST [15436-2] HINT: Close open transactions soon to avoid wraparound problems.
2020-04-16 12:42:41 SAST [15436-3] WARNING: oldest xmin is far in the past

Not sure how to go about fixing this

Any advise?
 
I have a stupid question. Do I have to have BDR nodes slots active on master and replica?

Master:

fusionpbx=# select * from bdr.bdr_node_slots;
node_name | slot_name | slot_restart_lsn | slot_confirmed_lsn | walsender_active | walsender_pid | sent_location | write_location | flush_location | replay_location
----------------+-----------------------------------------+------------------+--------------------+------------------+---------------+---------------+----------------+----------------+-----------------
172.16.152.245 | bdr_16384_6563361912103695863_1_16384__ | 1CE/15E87458 | 1CE/15E9A0F0 | t | 143216 | 1CE/15E9A0F0 | 1CE/15E9A0F0 | 1CE/15E9A0F0 | 1CE/15E9A0F0
(1 row)

fusionpbx=#

Slave:

fusionpbx=# select * from bdr.bdr_node_slots;
node_name | slot_name | slot_restart_lsn | slot_confirmed_lsn | walsender_active | walsender_pid | sent_location | write_location | flush_location | replay_location
----------------+-----------------------------------------+------------------+--------------------+------------------+---------------+---------------+----------------+----------------+-----------------
172.16.152.244 | bdr_16384_6563362034024864690_1_16384__ | C1/2E57E90 | C1/2E57E90 | f | | | | |
(1 row)

fusionpbx=#

There arent any errors on the logs for bdr:

root@c3po:/etc/postgresql/9.4/main# ps auxww | grep bdr
postgres 26111 0.0 0.0 236736 8088 ? Ss 10:54 0:00 postgres: bgworker: bdr supervisor
postgres 26112 0.0 0.0 242088 19184 ? Ss 10:54 0:00 postgres: bgworker: bdr db: freeswitch postgres 26113 0.0 0.0 242096 27508 ? Ss 10:54 0:00 postgres: bgworker: bdr db: fusionpbx postgres 26114 7.6 0.4 244240 152456 ? Ds 10:54 1:18 postgres: bgworker: bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,
postgres 26115 0.6 0.1 244112 43328 ? Ss 10:54 0:06 postgres: bgworker: bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,
root 26277 0.0 0.0 12744 2156 pts/2 R+ 11:11 0:00 grep bdr
root@c3po:/etc/postgresql/9.4/main#

On Postgres startup:

root@c3po:/etc/postgresql/9.4/main# tail -f /var/log/postgresql/postgresql-9.4-main.log
2020-04-17 10:54:21 SAST [26107-1] [unknown]@[unknown] LOG: incomplete startup packet
2020-04-17 10:54:21 SAST [26099-5] LOG: starting background worker process "bdr supervisor"
2020-04-17 10:54:21 SAST [26099-6] LOG: registering background worker "bdr db: fusionpbx"
2020-04-17 10:54:21 SAST [26099-7] LOG: registering background worker "bdr db: freeswitch"
2020-04-17 10:54:21 SAST [26099-8] LOG: starting background worker process "bdr db: freeswitch"
2020-04-17 10:54:21 SAST [26099-9] LOG: starting background worker process "bdr db: fusionpbx"
2020-04-17 10:54:21 SAST [26099-10] LOG: registering background worker "bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:21 SAST [26099-11] LOG: starting background worker process "bdr (6563361912103695863,1,16385,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:22 SAST [26099-12] LOG: registering background worker "bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,"
2020-04-17 10:54:22 SAST [26099-13] LOG: starting background worker process "bdr (6563361912103695863,1,16384,)->bdr (6563362034024864690,1,"

Thanks for all the advise so far
 
Hi All,

Just to give an update.

I removed my replications slots for now. That has resolved most of the issues. Obviously replication is a bit one sided at the moment, which I will fix still.

For the most part, my WAL issues are sorted. So just to acknowledge the help that DigitalDaz assisted with. I removed non-active (but actually needs to be active) replication slots which stopped the WAL directory from bursting at the seems. For now, WAL is within limits and not exponentially growing out of control.

Thanks for all the help
 
Status
Not open for further replies.