[Tutorial] Creating a two node FusionPBX cluster the easy way.

Status
Not open for further replies.

Roget Hoffman

New Member
Nov 24, 2017
17
0
1
72
@Roget Hoffman My advice would be to give up trying with it, I always found it very flakey, combine with the additional resources its taking and its just not worth it. On any volume at all you are going to get a huge surge of activity at failover that causes even more problems. Most people will just look at the phone, curse it and redial!

Yes, I think I agree with you. In my testing however, when the failover happens, the phones never loose registration, but if i redial, the call doesn't complete (unavailable) and it seems to me that when the phone re-registers thenk begin to work correctly, which take a few minutes. Is there something I can do when the BACKUP takes over, phone calls go thru faster, it seems to me the BACKUP is having trouble completing calls for about 5 minutes.

thanks for your response
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
In this sort of scenario just keep the reg timers low like 120 secs. That for example with yealinks will mean usually they register once a minute.

I used to get all hot and bothered about all this myself but being as its only happened once in five years and by the time I had been alerted and got logged on, everything was smoothly sailing by itself, I don't get too stressed about it now.

In production cross datacenter failover I don't even bother replicating the freeswitch db, just the fusionpbx one, this makes it quite lightweight and works great in testing over wide geo areas.
 
May 16, 2017
103
7
18
38
Hi Digi - Run your script i to get same BDR issue on slave. Ive 2 boxes on digi ocean but tried using both public and private but get this on slave. Any idea's?

Thanks in advance.

upload_2017-11-24_23-3-43.png
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
Thanks, I'm gonna edit the original post, I think a lot are using Debian 9 now which ships a different version of postgresql and we need the 9.4 for the BDR.
 

yaboc

New Member
Nov 23, 2017
10
2
3
33
assuming there's some sort of firewall on both nodes should one just do any<-> any between the two nodes? are there a specific ports that need to be accessible between the two in more restrictive environment?also will fqdn work for in place of IP during the install ?

will there be a script for debian 9 or easy way to upgrade the underlying os when it's approaching eol.
 
Last edited:

Miguel SanMiguel

New Member
Dec 5, 2017
2
0
1
Leipzig, Germany
Yes, it will work with servers in existing time zones and no it will not work with an existing server.
i want to setup a failover system for a live server, what would you recommend?

My skills are not that big. I currently see these options:
  1. Modify your scripts to adapt them for my case: a huge undertaking.
  2. Setup a client with your script and modify my server copypasting commands from your script: too much error prone.
  3. Follow your instructions to create a master-slave pair and afterwards import the configurations (SQL & XML) from my live system into them at a proper time (I know that there is no traffic in the midnight).
Do see the third option feasible? Any recommendations?

Thanks for sharing all this, it is very valuable!
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
The best option I would see for you is to contact mcrane in freenode irc #fusionpbx and pay him to do it.
 

yaboc

New Member
Nov 23, 2017
10
2
3
33
Hi All,

I have tried to set this up but I am seeing the following error on both servers:

FATAL: could not connect to the server in non-replication mode: timeout expired
DETAIL: dsn was: connect_timeout=30 keepalives=1 keepalives_idle=20 keepalives_interval=20 keepalives_count=5 ho
st=xx.xx.xx.xx port=5432 dbname=fusionpbx fallback_application_name='bdr (6488268677696983006,1,16385,):bdrnodein
fo'
CONTEXT: SQL statement "SELECT * FROM bdr_get_remote_nodeinfo(node_local_dsn)"
PL/pgSQL function internal_begin_join(text,text,text,text) line 42 at SQL statement
SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"\:
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
SQL statement "SELECT bdr.bdr_group_join(
local_node_name := local_node_name,
node_external_dsn := node_external_dsn,
join_using_dsn := null,
node_local_dsn := node_local_dsn,
apply_delay := apply_delay,
replication_sets := replication_sets)"
PL/pgSQL function bdr_group_create(text,text,text,integer,text[]) line 84 at PERFORM
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

I then just see "Waiting for BDR sync" on the slave.

Any ideas?

Thanks in advance!

MrT

i followed the tutorial with two publicly available nodes. firewall is any to any between the two
deb 8 post 9.4 master scrip completed successfully but the slave gives me the waiting for BDR error like above

ERROR: No peer nodes or peer node count unknown, cannot acquire global lock
HINT: BDR is probably still starting up, wait a while
ERROR: No peer nodes or peer node count unknown, cannot acquire global lock
HINT: BDR is probably still starting up, wait a while
ERROR: This node is already a member of a BDR group
HINT: Connect to the node you wish to add and run bdr_group_join from it instead
CONTEXT: SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
ERROR: This node is already a member of a BDR group
HINT: Connect to the node you wish to add and run bdr_group_join from it instead
CONTEXT: SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
Waiting for BDR sync.
Waiting for BDR sync.
Waiting for BDR sync.

postregsql log shows

2017-12-07 18:11:31 GMT [1248-367] LOG: worker process: bdr db: freeswitch (PID 2286) exited with exit code 1
2017-12-07 18:11:36 GMT [1248-368] LOG: starting background worker process "bdr db: fusionpbx"
2017-12-07 18:11:36 GMT [1248-369] LOG: starting background worker process "bdr db: freeswitch"
2017-12-07 18:11:37 GMT [1248-370] LOG: registering background worker "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:37 GMT [1248-371] LOG: starting background worker process "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:38 GMT [1248-372] LOG: registering background worker "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [1248-373] LOG: starting background worker process "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [2314-1] [unknown]@fusionpbx ERROR: data stream ended
2017-12-07 18:11:38 GMT [1248-374] LOG: worker process: bdr: catchup apply to 0/308EEC0 (PID 2314) exited with exit code 1
2017-12-07 18:11:38 GMT [1248-375] LOG: unregistering background worker "bdr: catchup apply to 0/308EEC0"
2017-12-07 18:11:38 GMT [2315-1] [unknown]@freeswitch ERROR: data stream ended
2017-12-07 18:11:38 GMT [1248-376] LOG: worker process: bdr: catchup apply to 0/3090D48 (PID 2315) exited with exit code 1
2017-12-07 18:11:38 GMT [1248-377] LOG: unregistering background worker "bdr: catchup apply to 0/3090D48"
2017-12-07 18:11:38 GMT [2311-1] [unknown]@fusionpbx ERROR: catchup worker exited before catching up to target LSN 0/308EEC0
2017-12-07 18:11:38 GMT [1248-378] LOG: worker process: bdr db: fusionpbx (PID 2311) exited with exit code 1
2017-12-07 18:11:39 GMT [2312-1] [unknown]@freeswitch ERROR: catchup worker exited before catching up to target LSN 0/3090D48
2017-12-07 18:11:39 GMT [1248-379] LOG: worker process: bdr db: freeswitch (PID 2312) exited with exit code 1
 
Last edited:

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
This is ok now, it was a locale issue, so note to everybody, make sure the locale is the same on both nodes.
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
Why would you want another slave node?

You do know this is failover, not load balancing?
 
Sep 2, 2017
46
2
8
Hi! Newbie here. I wanted to say hello and give a really BIG Thank You to DigitalDaz for creating this. I've installed many * systems, and this is my first dive into FusionPBX.

I just deployed my first cluster! I do have one question -- Does this script set the hostname on the slave to the same hostname on the master or did I fat-finger something?
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
You must have fat fingered something, just changes it in /etc/hosts and /etc/hostname then reboot, you should be OK
 

Alessandro

New Member
Dec 1, 2017
8
1
3
46
Hello @DigitalDaz i installed two nodes with your procedure and created a domain on node1 .

When i try to accesso to node2 web interface i have "invalid username and password error" .

How it works the failover? I thought that node1 and node2 had the same admin password. And how to test failover? Can we set 2 dns records on our dns server with the same entry? And FusionPBX i suppose we have to use domains .

Thank you
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
If you read the instructions it would have shown you that to log into server two you use admin@firstserverip as username.

The failover will not work at all, it is for you to decide upon that and implement it.
 
Sep 2, 2017
46
2
8
You must have fat fingered something, just changes it in /etc/hosts and /etc/hostname then reboot, you should be OK

I did indeed. Fixed -- Thank you!

A note to anyone deploying this -- It's in your best interest to read the entire thread before attempting to do this. I ran into the different locale situation, and on node2 saw an endless 'Waiting for BDR sync.'. I deployed one node at Digital Ocean, and the second at Vultr. To avoid this, I ended up recreating both nodes. Then -- before updating/upgrading or starting the process in this thread, I changed /etc/environment to read:

LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

and rebooted both nodes.

The install went without a hitch after that.

EDIT: This is for US English of course. YMMV.
 
Status
Not open for further replies.