Failover Architecture

review and discussion wished!

Overview

First of all we split between critical- and non-critical services. Everything that has to do with customer is "critical". All that has to do with service staff is "non-critical" at the time.

Critical ServicesNon Critical Services


All GUI related stuff

  • NMS PRIME GUI
  • Apache
  • Monitoring (Cacti)
  • Icinga / Nagios
Failover PossibleNo Failover considerations at the time

Failover Layers

1. NMS Prime GUI

no failover at the time

2. Apache

no failover at the time

3. Database

Normal MySQL / MariaDB failover cluster with N nodes. Possible Solutions:

  1. MaxScale  

Max Scale von Maria DB empfiehlt sich als schlauer Proxyserver für MySQL, der das Protokoll des Datenbankservers spricht. Neben solider Hochverfügbarkeit kündigt der Hersteller auch eine gute Skalierbarkeit an. 
Ist in erster Linie Zugriffsproxy für nach gelagertes HA-DB Cluster (z.B. MariaDB Galera) .

2. MariaDB Galera Cluster (not tested)

Ein Galera-Cluster besteht aus beliebig vielen Servern, auf denen ein normaler »mysqld« -Dienst läuft. Alle MySQL-Instanzen des Clusters kommunizieren miteinander und stellen so sicher, dass jeder der Knoten stets die neuesten Daten hält. In Kombination mit einem Loadbalancer wirkt das Konstrukt nach außen wie eine einzige kompakte Datenbank – dass im Hintergrund tatsächlich eine größere Anzahl arbeitet, merkt die Clientsoftware nicht.

3. MySQL Cluster 

Profilösung . Kauf bei Oracle . Kosten: ca. 10.000 - 30.000 €


Diskussion   Galera Cluster  versus MySQL Cluster

→  https://bobcares.com/blog/mysql-cluster-vs-galera/

MySQL cluster vs Galera – How to make the right choice

Databases are literally the heart of any application.

As the app becomes popular, the read/write access to databases also increases. Unfortunately, it can become a bottle neck for your application performance too. That’s where database clustering like MySQL cluster, Galera, etc. helps.

At Bobcares, we often get requests from customers about choosing the best database clustering option as part of our Infrastructure Management Services.

Today, we’ll see a comparative analysis of MySQL cluster vs Galera and how our Dedicated Engineers recommend it as per customer requirement.

Why go for Database Clustering?

Let’s now have a look on the need to go for database clustering.

When you have a website with heavy access, the number of requests that actually reach the database will be high. And, speed of the website depends on how fastly the server can get results from the database.

At the same time, the database server has to write data in the tables too. Thus, it has to do simultaneous operations of read/write. That’s where clustering of database server make things work.

In other words, in database clustering, there is a group of servers to handle the work load than a single server. Thus, it can provide data redundancy, load balancing features too. Also, this makes the web application highly available.

However, there is no one size fits all” solution when coming to database clustering. It depends largely on the type of web application, amount of read, write to databases, type of content and many more.

Let’s now have a closer look at the clustering options Galera and MySQL cluster.

[Looking for the perfect database clustering option for your servers? Our experts can help you here.]

MySQL Cluster

Firstly, we’ll check the MySQL cluster option.

MySQL cluster contains the data nodes that store the cluster data and management node that store the cluster’s configuration. Here,  MySQL clients first communicate with the management node and then connect directly to these data nodes.

For synchronization of data in the data nodes, MySQL cluster uses a special data engine called NDB (Network Database). Therefore, in MySQL Cluster there is typically no replication of data. It has only data node synchronization. Again, it uses automatic shrading aka splitting of a large database into small units.

Similarly, MySQL Cluster works in a shared-nothing environment. As a result, no two components of the cluster will share the same hardware. The cluster will be fully operational when at least one node is up on each data node group. As a result, MySQL cluster avoids single point failure and ensures 99.99% availability.

Coming to the real-time scenario, MySQL cluster can provide a response time as low as less than 3 ms.

For all these reasons, our Support Engineers have been able to achieve best results with MySQL clustering in scenarios that need to handle high volume of traffic. Also, it is one of the best methods when you have to scale up read access on the databases.

Galera Cluster

Moving on, let’s have look at the Galera cluster too.

In simple words, Galera Cluster consists of a database server and uses the Galera Replication Plugin to manage replication. It is nothing but a multi-master database cluster that supports synchronous replication. As a result, it provides multiple, up-to-date copies of the data.Thus, it becomes really useful in scenarios where there is a need for instant fail-over.

Galera cluster allows the read and write of data in any node. Again, other typical benefits of Galera cluster include guaranteed write consistency, automatic node provisioning, etc.

Luckily, in Galera cluster, when the network connection between nodes is lost, those who still have access will form a new cluster view. And, those who lost keep trying to re-connect to the primary component. Upon restoring the connection, the separated nodes will sync back and rejoin the cluster automatically. This becomes really useful when you have servers in various geographical locations.

Additionally, it is pretty easy to scale up Galera cluster by adding nodes. Moreover, the process to monitor the cluster status remains simple. Therefore, there is no need to have management node like MySQL cluster.

However, Galera comes with limited MyISAM support. But, it gives best results with the InnoDB storage engine.

That’s why, our Dedicated Engineers suggested Galera cluster to one of our customers who was scaling up the write access to his InnoDB database.

Conclusion

In short, database clustering using MySQL cluster or via Galera has its own advantages. And, the real choice depends on the exact usage scenario. Today, we just discussed the top features of MySQL cluster and Galera and saw how our Support Engineers recommend the best solution based on customer requirement.

4. NMS Prime Lower Layers

We differ between the master and N x slave NMS PRIME instances. The primary instance is also running the NMS PRIME GUI. Any changes in GUI will trigger realtime changes in the master config(s), like DHCP and TFTP. This is done via Laravel Observers or Jobs (e.g. Modem Observer).

The slaves are running on separate machines without a GUI. They are rebuilding DHCP, BIND, and TFTP configfiles on a regular base (e.g. 1 hour) e.g. via cronjob. The slaves are independent from Master and they are only connected towards MariaDB SQL cluster via a SQL read-only connection. So any changes in Master will be directly distributed towards SQL cluster and later automatically fetched from the slaves.

This concept offers:

  1. a Master with real-time changes towards all critical configs
  2. redundant slaves who is independent off Master
  3. a redundant database with load-sharing possibility
  4. Load-Sharing for either DHCP, DNS and TFTP for all Modems

5. Critical Services

ISC-DHCP

Normal ISC-DHCP failover with Master-Slave Concept:

Slaves rebuild their DHCP configs by them self after a defined time (see above).

DNS / BIND

Slaves rebuild their configs by them self after a defined time (see above).

More research required, but a good starting point could be here:

TFTP

Cronjob at slave will rebuild all configfiles on a recurring basis (e.g. every hour). In NMS Prime this could simply be done by running a artisan command. See below.


Possible Cronjob(s) for Slaves

e.g. possible cronjob
php artisan nms:dhcp && systemctl restart dhcp

e.g. possible cronjob
php artisan nms:configfile


Github TODO: #687

implementing this into Laravel scheduling framework (for slaves only!) will be a advance especially if building all config files could take longer than rebuild loop, since this could be easy avoided using ->withoutOverlapping():

See: https://github.com/nmsprime/nmsprime/blob/dev/app/Console/Kernel.php#L35

I would love to see a /etc/nmsprime/env statement for a possible slave configuration, like

SLAVE_CONFIG_REBUILD_INTERVALL=3600 # time in seconds


Now there is a collective ticket: #771


Workflow



Considerations on Failover from 22.5.2019

Ole Ernst

Torsten Schmidt

(Christian Schramm )