[Tutorial] Creating a two node FusionPBX cluster the easy way.

Roget Hoffman · Nov 24, 2017

DigitalDaz said:
@Roget Hoffman My advice would be to give up trying with it, I always found it very flakey, combine with the additional resources its taking and its just not worth it. On any volume at all you are going to get a huge surge of activity at failover that causes even more problems. Most people will just look at the phone, curse it and redial!

Yes, I think I agree with you. In my testing however, when the failover happens, the phones never loose registration, but if i redial, the call doesn't complete (unavailable) and it seems to me that when the phone re-registers thenk begin to work correctly, which take a few minutes. Is there something I can do when the BACKUP takes over, phone calls go thru faster, it seems to me the BACKUP is having trouble completing calls for about 5 minutes.

thanks for your response

DigitalDaz · Nov 24, 2017

In this sort of scenario just keep the reg timers low like 120 secs. That for example with yealinks will mean usually they register once a minute.

I used to get all hot and bothered about all this myself but being as its only happened once in five years and by the time I had been alerted and got logged on, everything was smoothly sailing by itself, I don't get too stressed about it now.

In production cross datacenter failover I don't even bother replicating the freeswitch db, just the fusionpbx one, this makes it quite lightweight and works great in testing over wide geo areas.

Roget Hoffman · Nov 24, 2017

So you do this via DNS failover or do you do it by a vrrp proto like keepalive?

thanks again for your input

Andrew Pickett · Nov 24, 2017

Hi Digi - Run your script i to get same BDR issue on slave. Ive 2 boxes on digi ocean but tried using both public and private but get this on slave. Any idea's?

Thanks in advance.

Andrew Pickett · Nov 24, 2017

Andrew Pickett said:
Hi Digi - Run your script i to get same BDR issue on slave. Ive 2 boxes on digi ocean but tried using both public and private but get this on slave. Any idea's?

Thanks in advance.

View attachment 179

Scrap the above reading instruction properly helps. I used debian 8.9 in DO and it works forst time

DigitalDaz · Nov 25, 2017

Thanks, I'm gonna edit the original post, I think a lot are using Debian 9 now which ships a different version of postgresql and we need the 9.4 for the BDR.

yaboc · Nov 26, 2017

assuming there's some sort of firewall on both nodes should one just do any<-> any between the two nodes? are there a specific ports that need to be accessible between the two in more restrictive environment?also will fqdn work for in place of IP during the install ?

will there be a script for debian 9 or easy way to upgrade the underlying os when it's approaching eol.

Miguel SanMiguel · Dec 5, 2017

DigitalDaz said:
Yes, it will work with servers in existing time zones and no it will not work with an existing server.

i want to setup a failover system for a live server, what would you recommend?

My skills are not that big. I currently see these options:

Modify your scripts to adapt them for my case: a huge undertaking.
Setup a client with your script and modify my server copypasting commands from your script: too much error prone.
Follow your instructions to create a master-slave pair and afterwards import the configurations (SQL & XML) from my live system into them at a proper time (I know that there is no traffic in the midnight).

Do see the third option feasible? Any recommendations?

Thanks for sharing all this, it is very valuable!

DigitalDaz · Dec 5, 2017

The best option I would see for you is to contact mcrane in freenode irc #fusionpbx and pay him to do it.

Miguel SanMiguel · Dec 6, 2017

That one I already know, I was asking about DIY, I would like to explore that way.

Which drawbacks do you see on the third proposed strategy?

yaboc · Dec 7, 2017

MrTrueman said:
Hi All,

I have tried to set this up but I am seeing the following error on both servers:

FATAL: could not connect to the server in non-replication mode: timeout expired
DETAIL: dsn was: connect_timeout=30 keepalives=1 keepalives_idle=20 keepalives_interval=20 keepalives_count=5 ho
st=xx.xx.xx.xx port=5432 dbname=fusionpbx fallback_application_name='bdr (6488268677696983006,1,16385,):bdrnodein
fo'
CONTEXT: SQL statement "SELECT * FROM bdr_get_remote_nodeinfo(node_local_dsn)"
PL/pgSQL function internal_begin_join(text,text,text,text) line 42 at SQL statement
SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"\:
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
SQL statement "SELECT bdr.bdr_group_join(
local_node_name := local_node_name,
node_external_dsn := node_external_dsn,
join_using_dsn := null,
node_local_dsn := node_local_dsn,
apply_delay := apply_delay,
replication_sets := replication_sets)"
PL/pgSQL function bdr_group_create(text,text,text,integer,text[]) line 84 at PERFORM
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

I then just see "Waiting for BDR sync" on the slave.

Any ideas?

Thanks in advance!

MrT

i followed the tutorial with two publicly available nodes. firewall is any to any between the two
deb 8 post 9.4 master scrip completed successfully but the slave gives me the waiting for BDR error like above

ERROR: No peer nodes or peer node count unknown, cannot acquire global lock
HINT: BDR is probably still starting up, wait a while
ERROR: No peer nodes or peer node count unknown, cannot acquire global lock
HINT: BDR is probably still starting up, wait a while
ERROR: This node is already a member of a BDR group
HINT: Connect to the node you wish to add and run bdr_group_join from it instead
CONTEXT: SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
ERROR: This node is already a member of a BDR group
HINT: Connect to the node you wish to add and run bdr_group_join from it instead
CONTEXT: SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
Waiting for BDR sync.
Waiting for BDR sync.
Waiting for BDR sync.

postregsql log shows

2017-12-07 18:11:31 GMT [1248-367] LOG: worker process: bdr db: freeswitch (PID 2286) exited with exit code 1
2017-12-07 18:11:36 GMT [1248-368] LOG: starting background worker process "bdr db: fusionpbx"
2017-12-07 18:11:36 GMT [1248-369] LOG: starting background worker process "bdr db: freeswitch"
2017-12-07 18:11:37 GMT [1248-370] LOG: registering background worker "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:37 GMT [1248-371] LOG: starting background worker process "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:38 GMT [1248-372] LOG: registering background worker "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [1248-373] LOG: starting background worker process "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [2314-1] [unknown]@fusionpbx ERROR: data stream ended
2017-12-07 18:11:38 GMT [1248-374] LOG: worker process: bdr: catchup apply to 0/308EEC0 (PID 2314) exited with exit code 1
2017-12-07 18:11:38 GMT [1248-375] LOG: unregistering background worker "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:38 GMT [2315-1] [unknown]@freeswitch ERROR: data stream ended
2017-12-07 18:11:38 GMT [1248-376] LOG: worker process: bdr: catchup apply to 0/3090D48 (PID 2315) exited with exit code 1
2017-12-07 18:11:38 GMT [1248-377] LOG: unregistering background worker "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [2311-1] [unknown]@fusionpbx ERROR: catchup worker exited before catching up to target LSN 0/308EEC0
2017-12-07 18:11:38 GMT [1248-378] LOG: worker process: bdr db: fusionpbx (PID 2311) exited with exit code 1
2017-12-07 18:11:39 GMT [2312-1] [unknown]@freeswitch ERROR: catchup worker exited before catching up to target LSN 0/3090D48
2017-12-07 18:11:39 GMT [1248-379] LOG: worker process: bdr db: freeswitch (PID 2312) exited with exit code 1

DigitalDaz · Dec 8, 2017

This is ok now, it was a locale issue, so note to everybody, make sure the locale is the same on both nodes.

Alessandro · Dec 11, 2017

Hello @DigitalDaz thank you for your guide.
I installed two nodes and all seems fine.
Is possible to add another slave node?

DigitalDaz · Dec 11, 2017

Why would you want another slave node?

You do know this is failover, not load balancing?

DudeInMyrtleBeach · Dec 12, 2017

Hi! Newbie here. I wanted to say hello and give a really BIG Thank You to DigitalDaz for creating this. I've installed many * systems, and this is my first dive into FusionPBX.

I just deployed my first cluster! I do have one question -- Does this script set the hostname on the slave to the same hostname on the master or did I fat-finger something?

DigitalDaz · Dec 12, 2017

You must have fat fingered something, just changes it in /etc/hosts and /etc/hostname then reboot, you should be OK

Alessandro · Dec 13, 2017

Hello @DigitalDaz i installed two nodes with your procedure and created a domain on node1 .

When i try to accesso to node2 web interface i have "invalid username and password error" .

How it works the failover? I thought that node1 and node2 had the same admin password. And how to test failover? Can we set 2 dns records on our dns server with the same entry? And FusionPBX i suppose we have to use domains .

Thank you

DigitalDaz · Dec 13, 2017

If you read the instructions it would have shown you that to log into server two you use admin@firstserverip as username.

The failover will not work at all, it is for you to decide upon that and implement it.

DudeInMyrtleBeach · Dec 14, 2017

DigitalDaz said:
You must have fat fingered something, just changes it in /etc/hosts and /etc/hostname then reboot, you should be OK

I did indeed. Fixed -- Thank you!

A note to anyone deploying this -- It's in your best interest to read the entire thread before attempting to do this. I ran into the different locale situation, and on node2 saw an endless 'Waiting for BDR sync.'. I deployed one node at Digital Ocean, and the second at Vultr. To avoid this, I ended up recreating both nodes. Then -- before updating/upgrading or starting the process in this thread, I changed /etc/environment to read:

LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

and rebooted both nodes.

The install went without a hitch after that.

EDIT: This is for US English of course. YMMV.

DigitalDaz · Dec 14, 2017

I'll add this to the initial post.

[Tutorial] Creating a two node FusionPBX cluster the easy way.

New Member

Administrator

New Member

Member

Member

Administrator

Member

New Member

Administrator

New Member

Member

Administrator

New Member

Administrator

Member

Administrator

New Member

Administrator

Member

Administrator