[Tutorial] Creating a two node FusionPBX cluster the easy way.

DigitalDaz · Jun 8, 2017

I am guessing fail2ban is getting you for some reason, check the fail2ban logs to see if you are getting banned.

Matthew Main · Jun 8, 2017

yup. fail2ban being triggered

2017-06-08 23:15:12,586 fail2ban.actions[1226]: WARNING [freeswitch-udp] Ban 82.**.***.***
2017-06-08 23:15:13,181 fail2ban.actions[1226]: WARNING [freeswitch-tcp] Ban 82.**.***.***
2017-06-08 23:25:13,224 fail2ban.actions[1226]: WARNING [freeswitch-udp] Unban 82.**.***.***
2017-06-08 23:25:13,825 fail2ban.actions[1226]: WARNING [freeswitch-tcp] Unban 82.**.***.***
2017-06-08 23:27:46,992 fail2ban.actions[1226]: WARNING [freeswitch-tcp] Ban 82.**.***.***
2017-06-08 23:27:47,398 fail2ban.actions[1226]: WARNING [freeswitch-udp] Ban 82.**.***.***
2017-06-08 23:37:47,633 fail2ban.actions[1226]: WARNING [freeswitch-tcp] Unban 82.**.***.***
2017-06-08 23:37:48,038 fail2ban.actions[1226]: WARNING [freeswitch-udp] Unban 82.**.***.***

not sure why no profiles on phones are incorrectly set up, the network is small and sweet here with nothing complicated.

DigitalDaz · Jun 8, 2017

If its a static IP, add them to the fail2ban ignore list

Matthew Main · Jun 9, 2017

It's not a static,
Is there any steps I can take or do that will help me to understand why this is happening, There must be a reason for the triggering?
I will whitelist my IP and see what happens

Cheers DigitalDaz

DigitalDaz · Jun 9, 2017

It may well be auth failure, look stop fail2ban and then use fs_cli and watch the console for auth failures.

SteinerAcedo · Jul 14, 2017

Hi, DigitalDaz. I've used your failover scripts in two fresh new VMs. I'm using the NAPTR/SRV scheme and everything seems to be working fine with the phone registrations, when I turn down the FS in one server (let's say server A) the registrations goes to the other one (server B) and new calls can be made without any problem. The thing is that when I have an active calls in server A and turn down the FS service, it suppose that the FS should recover that call in server B. I've already enabled the option track-calls [true] and actually the FS in server B tries to get the call but binding the RTP IP from the server A

IP Server A: 192.168.123.80
IP Server B: 192.168.123.75

2017-07-14 17:20:12.969703 [NOTICE] switch_channel.c:1104 New Channel sofia/internal/501@fpbx.local [f1048ecc-a65b-4f76-8307-6fb9c785968c]
2017-07-14 17:20:12.969703 [NOTICE] switch_channel.c:1102 Rename Channel sofia/internal/501@fpbx.local->sofia/internal/501@fpbx.local [f1048ecc-a65b-4f76-8307-6fb9c785968c]
2017-07-14 17:20:12.969703 [DEBUG] switch_core_media.c:3057 Set Codec sofia/internal/501@fpbx.local PCMU/8000 20 ms 160 samples 64000 bits 1 channels
2017-07-14 17:20:12.969703 [DEBUG] switch_core_codec.c:111 sofia/internal/501@fpbx.local Original read codec set to PCMU:0
2017-07-14 17:20:12.969703 [DEBUG] switch_core_media.c:6874 AUDIO RTP [sofia/internal/501@fpbx.local] 192.168.123.80 port 17520 -> 192.168.123.62 port 5008 codec: 0 ms: 20
2017-07-14 17:20:12.969703 [DEBUG] switch_rtp.c:4108 Starting timer [soft] 160 bytes per 20ms
2017-07-14 17:20:12.969703 [ERR] switch_core_media.c:7562 AUDIO RTP REPORTS ERROR: [Bind Error! 192.168.123.80:17520] <= It should be 192.168.123.75

I guess it's happened to you already, any advice??

Steiner,

DigitalDaz · Jul 15, 2017

It hasn't happened to me as even in lab testing I found sofia recover to be very flakey so I do not use it. I also found that on a busy server it caused massive server load when trying to recover a large volume. For my needs and the amount of times freeswitch goes down, near enough zero in five years it just was not worth it. If we do have the scenario, customer looks at dead phone, curses it and redials

This is just a personal opinion but over time I'm coming much more to the method of operation where I just run my servers in pairs. I find this easier for cross datacenter replication and I no longer even need to replicate the freeswitch db, just the fusionpbx db. As long as you have a fairly stable service, which I have, this means I have a very lightweight replication method.

I know this won't suit everyone but being as I will easily do maybe up to 1000 concurrent calls on a single box it also suits me not to keep all my eggs in one basket. You also get the added advantage in that if a single box goes down its only actually affecting that group of customers that are on that box.

Samael28 · Jul 17, 2017

Not sure if you tested this, but table "members" in "freeswitch" table should not be replicated over servers, as mod_callcenter will not work (both of mod_callcenter will try to update info on database and will break functionality)

Samael28 · Jul 18, 2017

Also made own version of sql templates for bdr (based on DigiDaz's) - https://gist.github.com/samael33/47598763a394e4314e5205e66d841e5c

DigitalDaz · Jul 19, 2017

Samael28 said:
Also made own version of sql templates for bdr (based on DigiDaz's) - https://gist.github.com/samael33/47598763a394e4314e5205e66d841e5c

I see mcrane threw in his messed up opinion on the original source of the file and seems to incorrectly think I got it from the fusionpbx training. Shame he didn't check his dates and realise that I did mine in the days of Freeswitch 1.4

If I remember rightly, it was an age ago, I think I made a single install and allowed freeswitch to auto create schemas, then dumped it out and edited it. There were also a couple of tables added later that volga alerted me to that only become present when using verto or something similar.

Samael28 · Jul 19, 2017

I made mine adding some of your fields, but just exporting original Freeswitch created database. (But yes, I was on Mark's training too, but his version was too buggy for me)

MrTrueman · Nov 14, 2017

Hi All,

I have tried to set this up but I am seeing the following error on both servers:

FATAL: could not connect to the server in non-replication mode: timeout expired
DETAIL: dsn was: connect_timeout=30 keepalives=1 keepalives_idle=20 keepalives_interval=20 keepalives_count=5 ho
st=xx.xx.xx.xx port=5432 dbname=fusionpbx fallback_application_name='bdr (6488268677696983006,1,16385,):bdrnodein
fo'
CONTEXT: SQL statement "SELECT * FROM bdr_get_remote_nodeinfo(node_local_dsn)"
PL/pgSQL function internal_begin_join(text,text,text,text) line 42 at SQL statement
SQL statement "SELECT bdr.internal_begin_join(
'bdr_group_join',
local_node_name,
CASE WHEN node_local_dsn IS NULL THEN node_external_dsn ELSE node_local_dsn END,
join_using_dsn)"\:
PL/pgSQL function bdr_group_join(text,text,text,text,integer,text[]) line 21 at PERFORM
SQL statement "SELECT bdr.bdr_group_join(
local_node_name := local_node_name,
node_external_dsn := node_external_dsn,
join_using_dsn := null,
node_local_dsn := node_local_dsn,
apply_delay := apply_delay,
replication_sets := replication_sets)"
PL/pgSQL function bdr_group_create(text,text,text,integer,text[]) line 84 at PERFORM
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

I then just see "Waiting for BDR sync" on the slave.

Any ideas?

Thanks in advance!

MrT

DigitalDaz · Nov 14, 2017

You definitely on postgres 9.4?

MrTrueman · Nov 14, 2017

Yes I have already checked it is 9.4.

The boxes are in Google compute, so was wondering if it is something to do with the local and public IPs.

The select statement that fails shows the public ip as per what I used for the install script but there is also a local nic on the VMs. I see in some of the scripts the hostname is referenced "domain_name=$(hostname" however this hostname points to the local IP. Would this cause an issue???

Note: I can see postgres is bound to 0.0.0.0

DigitalDaz · Nov 14, 2017

I guess it probably is an IP thing the best thing to do in your case would probably be to try and step through the script manually. I haven't run it for a while. I'll get a couple of boxes up and try it just to make sure all is well with the script.

MrTrueman · Nov 14, 2017

The following command is failing in the finish.sh script:

sudo -u postgres psql fusionpbx -c "SELECT bdr.bdr_group_create(local_node_name := 'node1_fusionpbx', node_external_dsn := 'host=$master_ip port=5432 dbname=fusionpbx');"

If I manually run the command and change the host ip to localhost it works but the command does not work with either my local ip or my public ip.

I presume it is possibly a setting in the "pg_hba.conf" but I have tried every combination possible but it still fails:

"FATAL: could not connect to the server in non-replication mode: timeout expired"

I have found some info here:
https://github.com/2ndQuadrant/bdr/issues/144

yaboc · Nov 23, 2017

will this work with two servers in two different time zones? also is it possible to create this cluster with existing single node?

DigitalDaz · Nov 23, 2017

yaboc said:
will this work with two servers in two different time zones? also is it possible to create this cluster with existing single node?

Yes, it will work with servers in existing time zones and no it will not work with an existing server.

Roget Hoffman · Nov 24, 2017

Hello All

I been having an issue when i try to perform a "fsctl recover" on the fail-over server here is my config:

I have two identical servers from linode. I used the tutorial "the easy way" here. For the IP failover side of things I am using keepalived and from an IP/Pinging standpoint - that part seems to be working correctly. The logs show starting MASTER, starting BACKUP all works as expected (1 maybe 2 packets lost during cutover during a ping).

So:
1. I have 2 ext registered (101 and 102), i place a call from one ext to the other - the MASTER FS, makes the call and as is well. I check the FS db on both nodes while this call is up and I see data in the recovery and calls tables within it. Seems good so far...
2. I take down the eth0 port on the master and the call stops passing media (as I expect).
3. I see the IP get moved to the BACKUP
4. in the fs_cli on the BACKUP, i manually run (for now) a "fsctl recover"
5. i see the following in the BACKUP fs_cli:

227a5117-4039-41ee-98ef-350c7a15891c 2017-11-24 12:11:58.573620 [DEBUG] switch_core_state_machine.c:646 (sofia/internal/101@192.168.23.143:5060) State RESET going to sleep
2017-11-24 12:11:58.753625 [DEBUG] switch_pgsql.c:415 Query (insert into channels (uuid,direction,created,created_epoch, name,state,callstate,dialplan,context,hostname,initial_cid_name,initial_cid_num,initial_ip_addr,initial_dest,initial_dialplan,initial_context) values('64800f2b-eee1-4a90-9cb0-f55c2f7341d7','inbound','2017-11-24 12:11:58','1511543518','sofia/internal/102@cluster.testpbx.com','CS_INIT','ACTIVE','XML','cluster.testpbx.com','vg-cluster-1','My MAC','102','96.239.out.ip','101','XML','cluster.testpbx.com')) returned PGRES_FATAL_ERROR
2017-11-24 12:11:58.753625 [DEBUG] switch_pgsql.c:415 Query (insert into channels (uuid,direction,created,created_epoch, name,state,callstate,dialplan,context,hostname,initial_cid_name,initial_cid_num,initial_ip_addr,initial_dest,initial_dialplan,initial_context) values('227a5117-4039-41ee-98ef-350c7a15891c','outbound','2017-11-24 12:11:58','1511543518','sofia/internal/101@192.168.23.143:5060','CS_INIT','DOWN','XML','cluster.testpbx.com','vg-cluster-1','My MAC','102','96.239.out.ip','101','XML','cluster.testpbx.com')) returned PGRES_FATAL_ERROR
2017-11-24 12:11:58.753625 [ERR] switch_pgsql.c:656 Error executing query:
ERROR: current transaction is aborted, commands ignored until end of transaction block

It appears the BACKUP box is trying to insert the call (which I assume FS got it by doing a lookup in the replicated db during the beginning of "fsctl recover" although the fs_cli output doesn't show anything about that. and btw, i see basically the same fatal error in the postgres logs. The phones do not hang-up on the BACKUP - but no media. I also noticed this entry in log right after the above error:

64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] sofia.c:7084 Channel sofia/internal/102@cluster.testpbx.com entering state [terminated][503]
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [NOTICE] sofia.c:8273 Hangup sofia/internal/102@cluster.testpbx.com [CS_SOFT_EXECUTE] [NORMAL_TEMPORARY_FAILURE]
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:712 sofia/internal/102@cluster.testpbx.com ending bridge by request from read function
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:787 BRIDGE THREAD DONE [sofia/internal/102@cluster.testpbx.com]
227a5117-4039-41ee-98ef-350c7a15891c 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:787 BRIDGE THREAD DONE [sofia/internal/101@192.168.23.143:5060]

I have seen a video on YT that shows this recovery in action and I see the PGRES_FATAL_ERROR in his logs also, but the call is created anyway. Maybe there is a setting I am missing?

ANY help would be appreciated!

thanks

DigitalDaz · Nov 24, 2017

@Roget Hoffman My advice would be to give up trying with it, I always found it very flakey, combine with the additional resources its taking and its just not worth it. On any volume at all you are going to get a huge surge of activity at failover that causes even more problems. Most people will just look at the phone, curse it and redial!

[Tutorial] Creating a two node FusionPBX cluster the easy way.

Administrator

Member

Administrator

Member

Administrator

New Member

Administrator

Member

Member

Administrator

Member

New Member

Administrator

New Member

Administrator

New Member

New Member

Administrator

New Member

Administrator