Failover Architecture

review and discussion wished!

Overview

First of all we split between critical- and non-critical services. Everything that has to do with customer is "critical". All that has to do with service staff is "non-critical" at the time.

Critical Services

Non Critical Services

MySQL / MariadB

All GUI related stuff

NMS PRIME GUI
Apache
Monitoring (Cacti)
Icinga / Nagios

Failover Possible

No Failover considerations at the time

Failover Layers

1. NMS Prime GUI

no failover at the time

2. Apache

no failover at the time

3. Database

Normal MySQL / MariaDB failover cluster with N nodes. Possible Solutions:

MaxScale
MariaDB Galera Cluster (not tested)

4. NMS Prime Lower Layers

We differ between the master and N x slave NMS PRIME instances. The primary instance is also running the NMS PRIME GUI. Any changes in GUI will trigger realtime changes in the master config(s), like DHCP and TFTP. This is done via Laravel Observers or Jobs (e.g. Modem Observer).

The slaves are running on separate machines without a GUI. They are rebuilding DHCP, BIND, and TFTP configfiles on a regular base (e.g. 1 hour) e.g. via cronjob. The slaves are independent from Master and they are only connected towards MariaDB SQL cluster via a SQL read-only connection. So any changes in Master will be directly distributed towards SQL cluster and later automatically fetched from the slaves.

This concept offers:

a Master with real-time changes towards all critical configs
redundant slaves who is independent off Master
a redundant database with load-sharing possibility
Load-Sharing for either DHCP, DNS and TFTP for all Modems

5. Critical Services

ISC-DHCP

Normal ISC-DHCP failover with Master-Slave Concept:

Slaves rebuild their DHCP configs by them self after a defined time (see above).

DNS / BIND

Slaves rebuild their configs by them self after a defined time (see above).

More research required, but a good starting point could be here:

TFTP

Cronjob at slave will rebuild all configfiles on a recurring basis (e.g. every hour). In NMS Prime this could simply be done by running

Possible Cronjob(s) for Slaves

e.g. possible cronjob

php artisan nms:dhcp && systemctl restart dhcp

e.g. possible cronjob

php artisan nms:configfile

Github TODO: #687

implementing this into Laravel scheduling framework (for slaves only!) will be a advance especially if building all config files could take longer than rebuild loop, since this could be easy avoided using ->withoutOverlapping():

See: https://github.com/nmsprime/nmsprime/blob/dev/app/Console/Kernel.php#L35

I would love to see a /etc/nmsprime/env statement for a possible slave configuration, like

SLAVE_CONFIG_REBUILD_INTERVALL=3600 # time in seconds

Workflow

Considerations on Failover from 22.5.2019

Ole Ernst

Torsten Schmidt

(Christian Schramm )