High Availability Installation
Description
The module ProvHA provides functionality for failover setups.
This includes the following services:
- DHCP
- TFTP
- time
- DNS
Typically the load is balanced – master and slave share the provisioning load. If one instance is down all work is done by the other machine keeping your network alive. Your customer doesn't notice the incident…
The configuration files needed for provisioning are updated as follows:
- master: same as in standalone installation (every time a relevant database value has changed)
- slave:
- rebuild every n seconds (this can be configured via master GUI (Global config ⇒ ProvHA))
- rebuild on changes of several database tables (cmts, ippool)
DHCP
We use the failover functionality of ISC DHCP which supports a setup with one master (primary) and on slave (secondary) instance, called peers. Each server handles by default 50% of each IP pool, the load balance is configurable. Servers inform each other about leases – if one instance goes down the failover peer takes over the complete pools. If both servers are active again the pools will be balanced automatically.
Configuration is done in /etc/dhcp-nmsprime/failover.conf
, the pools in /etc/dhcp-nmsprime/cmts_gws/*.conf
are configured with a failover statement.
TFTP
Theory
For TFTP we have to distinct between DOCSIS versions:
For DOCSIS versions less 3 one can only provide one TFTP server, realized via
option next-server
statement in global.conf. In our setup each of the two DHCP servers sets this to its own IP address.check: will this cause problems if the configured server goes offline? Or will CM get new values in DHCPACK from failover peer?
- For higher versions the value in
next-server
can be overwritten usingoption vivso
(125.2: CL_V4OPTION_TFTPSERVERS) – there can be multiple TFTP servers configured. For our system each DHCP server provides its IP address first and the peer's IP address as second.
Practice
Looks like – at least Cisco 7225 – is not able to make use of this information. Attached two tcpumps of the same DHCP ACK; one sent from slave to CMTS, the other sent from the CMTS to the modem: dhcp_ack_slave_to_cmts.txt dhcp_ack_slave_to_cmts.txt
And here a screenshot of the diff:
- CMTS replaces the IP of “next server” and “tftp server” by it's own one
- no matter in what order the IPs in option 125.2 are – every single TFTP request of a modem is directed to “Next server IP address“
- that means:
- if at an HA NMS the DHCP server is working all ACKs of this server contain it's own IP address
- if the TFTP server is dead on this server or a configfile is missing the CM will be stuck in init(o) and rebooting endlessly
Time
option time-servers
accept a comma separated list of IP addresses. Each DHCP server provides its IP address first and the peer's IP address as second.
DNS
option domain-name-servers
accept a comma separated list of IP adresses. Each DHCP server provides its IP address first and the peer's IP address as second.
what about zone
sections in global.conf? Shoud peer IP be given as secondary?
open question: needs DNS server configuration to be changed, too?
Preparation
You will need two NMSPrime installations to provide failover functionality.
Master
One is defined as MASTER and is comparable to an installation without failover (a “classical” NMSPrime). This instance is the only one with write access to the databases – all tasks changing the database are done here:
- GUI actions (like adding/changing/deleting elements like contracts, modems, cmts, etc)
- API
- cacti polling
- communication with external API like “envia TEL” (module ProvVoipEnvia)
Slave
The other installation (currently only one slave is supported due to restrictions of ISC DHCPd) has read-only access to the database:
- at the slave the database can not be changed
- provisioning of CM/MTA/CPE will be done completely by the slave if the master fails
Database
The database is set up as a galera cluster; this way all data is duplicated and stored e.g. at master and slave machine.
CMTSs
CMTS configs need to be extended: For each of the two instances we need
cable helper-address
statement (interface Bundle x)ntp server
statement
Installation
Most of the work is done by our installation scripts deployed with the ProvHA module. Details are given for your understanding of the whole process and for later configuration changes.
Master
First set up your master following the Installation guide. The database should be redundant – we assume that you have set up a galera cluster.
Change /etc/nmsprime/env/*.env
that master uses galera cluster. If not using a proxy add a comma separated list of your galera nodes
NMSPrime caches all configuration to improve performance. After changing .env
files run:
cd /var/www/nmsprime
php artisan config:cache
When facing messages like Host '172.20.0.100' is not allowed to connect to this MariaDB server …
you have to add user privileges like:
GRANT ALL PRIVILEGES ON `nmsprime`.* TO 'nmsprime'@'172.20.0.100' IDENTIFIED BY '<yourSecurePassword>';
GRANT ALL PRIVILEGES ON `nmsprime`.* TO 'nmsprime'@'172.20.0.101' IDENTIFIED BY '<yourSecurePassword>';
GRANT ALL PRIVILEGES ON `nmsprime`.* TO 'nmsprime'@'172.20.0.102' IDENTIFIED BY '<yourSecurePassword>';
GRANT ALL PRIVILEGES ON `nmsprime_ccc`.* TO 'nmsprime_ccc'@'172.20.0.100' IDENTIFIED BY '<yourOtherSecurePassword>';
…
FLUSH PRIVILEGES;
for all your cluster IPs and all databases used in .env files (nmsprime, nmsprime_ccc, icinga2, cacti) and their users.
Make sure that all is running as expected (check NMSPrime GUI) and all devices can be provisioned (try to rebuild config and restart a CM):
cd /var/www/nmsprim
e
php artisan nms:dhcp
php artisan nms:configfile
systemctl status dhcpd
Then install the module ProvHA via yum install nmsprime-provha
.
This:
- creates a new file:
/etc/nmsprime/env/provha.env
⇒ setPROVHA__OWN_STATE=master
– this is the central setting to define the role of a NMSPrime instance - creates a new file:
/etc/dhcp-nmsprime/failover.conf
⇒ setaddress
to master's andpeer address
to slave's IP - adds/comments out include statement in
/etc/dhcp-nmsprime/dhcpd.conf
⇒ check if dhcpd.conf contains the lineinclude "/etc/dhcp-nmsprime/failover.conf";
– else add manually after thedeny bootp
line
Allow incoming traffic for slave IP for ports 7911/tcp
and 647/tcp
in your firewalld settings.
Edit ProvHA settings in global configuration page (GUI) – set master and slave IPs. Then refresh the configs:
cd /var/www/nmsprime
php artisan config:cache
php artisan nms:dhcp
Attention: At this point the DHCPd will not be able to provide IP address until the peer is reachable (you will see log messages like DHCPDISCOVER from ff:ee:dd:cc:bb:aa via 10.0.63.254: not responding (startup)
forever!
Slave
Set up the slave following the Installation guide, then install module ProvHA.
Allow incoming traffic for slave IP for ports 7911/tcp
and 647/tcp
in your firewalld settings.
On your galera cluster create read-only users for all databases (nmsprime, nmsprime_ccc, cacti, icinga2)
CREATE USER 'slaveread'@'172.20.0.%' IDENTIFIED BY '<yourNextSecurePassword>';
GRANT SELECT ON `nmsprime`.* TO 'slaveread'@'172.20.0.%';
FLUSH PRIVILEGES;
- use this credentials in your slave's
/etc/nmsprime/env/*.env
files and check if slave has access to the database cluster
what happens to cacti/icinga if database access is limited to read??
Could e.g. CM HF diagrams be rendered at slave instance?
check journalctl -u dhcpd
– there now should be entries like
failover peer dhcpd-failover: I move from communications-interrupted to normal
balancing pool 55983d4101d0 …
balanced pool 55983d4101d0 …
indicating that DHCP failover is set up properly
Establish ssh connection from master to slave
- Create a ssh key (e.g.
ssh-keygen -b 4096
) at master machine - add public key of master to slave's
/root/.ssh/authorized_keys
- test if you can establish a ssh connection from master to slave:
ssh root@
<slave_ip>
- rebuild config files:
php artisan nms:dhcp && php artisan nms:config
- on master execute
cd /var/www/nmsprime/ && php artisan provha:sync_ha_master_files
– this should rsync files to slave; it is expected that one sees no errors
rsyslogd
To collect log entries of both instances configure like the following:
Remember to open 514/tcp in master's iptables for slave IP
If all works fine you will see log entries with master's and slave's hostname in master's /var/log/messages
DNS
On both machines we expect to find name resolutions for every device (CM, MTA, CPE) in /var/named/dynamic/nmsprime.test.zone.jnl
. Check this using
rndc sync -clean
which converts the binary journal file to textfile /var/named/dynamic/nmsprime.test.zone
icingaweb2
To monitor the slave machine from the master one has to configure icingaweb:
- make sure that icinga has access to your database (https://your.nmsprime.domain:8080/icingaweb2/director#!/icingaweb2/director/importsources)
- if not: edit
/etc/icingaweb2/resources.ini
and check privileges in galera
- if not: edit
- reinstall icingaweb director:
yum reinstall -y icingaweb2-module-director
There should a netelement NMSPrime ProvHA slave
in your NMSPrime, which should be detected by icinga. Some service checks should exist, too. Check this out in icingaweb2.
In our test environment icinga is running on dev0 (HA master) as “configuration master” and on dev3 (HA slave) as “secondary master”. That means that the config is done on dev0 in /etc/icinga2/zones.d/<ZONENAME>
and automatically send to dev3 (using the same zonename) – this way they are redundant (both offering icingaweb and a backend) and tests have shown that they load-balance, too.
Anyway: The setup is a bit tricky and unstable – and the error messages are not guiding to the real problem in most cases. For the initial config one can use the tool icinga2 node wizard
cacti
ATM there is only one place where .rrd files are stored – by default at /var/lib/cacti/rra
at the master instance. We use sshfs to mount this directory at the slave to show cacti diagrams there, too. We highly recommend to use autofs to ensure that data is available when needed (see e.g. https://whattheserver.com/automounting-remote-shares-via-autofs).
This has to be configured manually (maybe you prefer a different setup?):
- master: As .rrd files are world readable (644) at the moment we recommend to add a linux user for slave at the master instance and use this for sshfs.
- check if you can connect from slave to master (
ssh slaveuser@NMSPRIME_MASTER_IP
), on problems check authorized_keys file, firewalld settings and /etc/host.(allow|deny) settings - slave: Create a directory for sshfs; we use
/mnt/master/var__lib__cacti__rra
(mkdir -p /mnt/nmsprime-master/var__lib__cacti__rra
) - slave: replace the directory
/var/lib/cacti/rra
by a symlink to the mount dir (rm -rf /var/lib/cacti/rra && ln -s /mnt/nmsprime-master/var__lib__cacti__rra /var/lib/cacti/rra
) slave: create
/etc/auto.master.d/nmsprime-master.autofs/mnt/nmsprime-master /etc/auto.nmsprime-master.sshfs uid=0,gid=0,--timeout=600,--ghost
/etc/auto.nmsprime-master.sshfsvar__lib__cacti__rra -fstype=fuse,ro,nodev,nonempty,noatime,allow_other :sshfs#slaveuser@NMSPRIME_MASTER_IP:/var/lib/cacti/rra
- slave:
systemctl enable autofs && systemctl start autofs
- add read-only user for cacti database:
CREATE USER 'cactireader'@'172.20.0.%' IDENTIFIED BY '<secret_password>;
GRANT SELECT ON `cacti`.* TO 'cactireader'@'172.20.0.%';
FLUSH PRIVILEGES;
- configure database access in
/etc/nmsprime/env/provmon.env
and/etc/cacti/db.php
- to see live modem values at your slave instance: edit modem base configuration file (allow slave ip for SNMP, too)
Now every time the directory is visited it will be mounted automatically (and unmounted after --timeout seconds)
If your master is down your modem values will not be updated. Making cacti fully redundant is out of scope of this module at the moment.
Clean slave system
There are several files that should be removed at slave system; especially one does not run the modempoller from multiple systems.
You can use the following command:
php artisan provha:clean_up_slave
In later update/install cycles this command will be called automatically. The files are not deleted but moved to /var/www/nmsprime/storage/app/data/provha/moved_system_files
in case one needs them later on.