Breathe Easy with a Self-Healing Conjur Cluster 

 

“The great thing about software and servers is they always work as expected.”
– Said no one, ever

When your software securely stores the passwords, keys, and certificates that allow applications and infrastructure to work, scale, and connect to the bits they need, uptime of your product becomes eminently important. To provide a reliable, durable store, Conjur stands on the back of giants to provide high availability and predictable behaviors under extreme load. 

Elements of a Conjur Cluster  

Before we pull back the cover and show how we achieve high uptime in Conjur, let’s talk about the three types of Conjur server instances: Master, Standbys, and Followers. 

Master 

We start our story with the Conjur master. The master is responsible for the three R’s: Reading, wRiting, and Replication (plus a few other responsibilities). The master is where everything starts. All of your security policies and secret data are written in the master before being distributed.  Without a master in a cluster, there can’t be updates or changes. 

Standby 

A standby is like a backup quarterback, just waiting on the sidelines, staying warmed up, hoping to get a shot at being the master. Standbys are running, receiving updates from the master, but standbys are not allowed to service any user requests. They stand in waiting, ready for the opportunity to take over if the master fails. 

Standbys come in two flavors: Synchronous and Asynchronous.  Database writes are made to Synchronous Standbys as part of any transactions made on the master.  This means they are in lock-step with the master’s view of the world: anything on the master is guaranteed to be on the Synchronous standby. Synchronous standbys are usually found in the same data center as the master. 

Asynchronous Standbys receive a continuous stream of updates from the master but use an eventually consistent model. That means their data updates will lag the master by some amount of time (usually not more than milliseconds or seconds depending on proximity, latency, and the size of the transactions recently made on the master). Asynchronous standbys can and should be distributed across data centers to ensure the cluster is tolerant of data center failure. If the master and all Synchronous Standbys are lost, an Asynchronous Standby can be promoted to master in its place. 

Followers 

Last, but certainly not least is the follower. The follower is the workhorse of the cluster, providing read access to all the secure items stored in Conjur, and servicing authentication requests. Since those requests form the vast majority of traffic in a Conjur cluster, a Follower is an ideal unit of scaling.  

Followers receive a continues stream of updates and changes from the master. They also stream back a continuous stream of audit information to the master about who and what’s been accessing those credentials. Followers are found wherever secrets need to be accessed. 

Replicating Data 

Data is replicated from master to standbys and followers using Postgres streaming replication.  Regardless of type, every Conjur instance has two databases inside it: one for policy data, and another for audit data. Postgres replication provides a very durable mechanism for replication that is battle-tested, efficient, and predictable. If an instance becomes disconnected from the master, it replays all transactions from the master once it reconnects. 

Audit data streams from each follower back to the master.  Once again, the power of Postgres means that if a follower becomes disconnected, it continues to collect audit records, and pushes those records up to the master once connectivity is restored. 

To secure connections between instances in the cluster, Conjur uses Mutual TLS.  Mutual TLS forces both sides of a connection to provide valid certificates to connect to the other before the connection is established. Mutual TLS allows Conjur to be run securely across data centers and availability zones. 

Auto-Failover

Ask any operations person how much they love being woken up at 3am when a key service goes down, and you’ll probably get a nasty look. Everyone on the CyberArk Conjur team values their sleep, which is why Conjur clusters can be configured to automatically failover to a standby in the event a master becomes unhealthy.  Failover is a coordinated, orchestrated process to perform the steps to promote a standby into the role of master. Conjur leverages the battle-tested tool etcd to detect failure and elect a healthy standby. 

Before we dig into how to set a cluster up for auto-failover, let’s look at what happens when a failover occurs. 

Failure Detection

To detect a master failure, we use a “dead-man’s switch”. In etcd, an expired time to live (TTL) counter determines the failure of the master. In normal operation, the master continuously resets this counter.  If the master becomes unhealthy, it no longer resets the counter. When the counter expires, the cluster members know that they must begin the election and subsequent promotion process. 

Promotion

In order for a cluster to initiate auto-failover, the cluster requires: 

  • Two or more standbys be present 
  • Majority of the cluster be operational 
  • At least one synchronous standby be present 

When a failure event is detected using the consensus algorithm built into etcd, the eligible standbys race to obtain a promotion lock held within etcd. The standby that wins the race and gets the lock is the one that begins the promotion process. The others return to being standbys. 

In a successful promotion, the new master updates the cluster configuration, removing the old master (to ensure the old master cannot rejoin again thinking it’s still the master). The remaining standbys are rebased against the new master and continue to stream replication from the new master. 

Clustering

Having auto-failover in place is great, but without your services knowing where the new master is, we don’t really have a self-healing system. We again reach for industry standard tools to solve this problem: DNS and load balancers.  A correctly-configured Conjur cluster will be configured behind a load balancer with a FQDN pointing to the load balancer, which will know the addresses of the master and each synchronous standby.  When queried via the /health endpoint, the master will report healthy to the load balancer. Each standby node will report that it is not accepting requests, so the load balancer knows to send all incoming requests only to the master. 

In the event of a fail-over event, the old master will report unhealthy, ether because it’s unhealthy or because it’s been removed from the cluster. The load balancer now directs all traffic to the new master.  All followers and asynchronous standbys connect to the master through the load balancer, which keeps them blissfully unaware of the promotion event. 

Wrapping it Up 

Standing on the shoulders of giants, Conjur can be run in a highly distributed and scalable configuration. Auto-healing provides security against master failure. Distributed standbys provide insurance against data center or availability zone outages. Replication allows followers to continue to operate with an upstream failure, bringing itself into sync and pushing audit information when connectivity is restored. All these approaches in parallel help your operations, security, and development teams sleep at night (or at least about the operation of Conjur). 

To read more about how you can setup a Conjur cluster with auto-failover enabled, read through our documentation here: Auto-Failover Docs