Cassandra is a highly-distributable NoSQL database with tunable consistency. What makes it highly distributable makes it also, in part, vulnerable: the whole deployment must run on synchronized clocks.
It’s quite surprising that, given how crucial this is, it is not covered sufficiently in literature. And, if it is, it simply refers to installation of a NTP daemon on each node which – if followed blindly – leads to really bad consequences. You will find blog posts by users who got burned by clock drifting.
In the first installment of this two part series, I’ll cover how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. In the next installment, coming out next week, I’ll go over some disadvantages of off-the-shelf NTP installations, and how to overcome them.
About clocks in Cassandra clusters
Cassandra serializes write operations by time stamps you send with a query. Time stamps solve an important serialization problem with inherently loosely coupled nodes in large clusters. At the same time however, time stamp are its Achilles’ heel. If system clocks runs off each other, so does time stamps of write operations and you are about to experience inexplicable data inconsistencies. It is crucial for Cassandra to have clocks right.
Boot-time system clock synchronization is not enough unfortunately. No clock is the same and you will eventually see clock drifting, i.e. growing difference among clocks in the system. You have to maintain clock synchronization continually.
It is a common misconception that clocks on virtual machines are somewhat resistant against clock drifting. In fact, virtual instances are especially prone to, even in a dramatic way, if the system is under heavy load. On Amazon EC2, you can easily observe drift about 50ms per day on unloaded instance and seconds per day on a loaded instance.
How much clocks need to be synchronized? It depends on your type of work load. If you run read-only or append-only queries, you are probably fine with modestly synchronization. However if you run concurrent read-update queries it’s starting to be serious. And if you do so because of API calls or concurrent job processing, it’s critical down to milliseconds.
Unfortunately, there is great, off-the-shelf ready solution. Why unfortunately?
Network Time Protocol
Network Time Protocol (NTP) gets the time from external time source in the network and propagates it further down the network. NTP uses hierarchical tree-like topology, where each layer is referred to as “clock strata”, starting with Stratum 0 as the authoritative time source, and continuing with Strata 1, 2, etc. Nodes which synchronizes clocks with nodes on Stratum n become nodes on Stratun n+1. NTP daemon sends time queries periodically to specified servers, adjusts the value to network latency associated with the message transmission, and re-adjusts the local clock to the calculated time. Running NTP daemon will help to avoid clock drifting especially on loaded machines.
In order to make NTP work you need to specify a set of servers where the current time will be pulled from. NTP servers may be provided by your network supplier, or you can use publicly available NTP servers. The best list of available public NTP servers is NTP pool project where you can also find best options for you geographical region. It is a good thing to use this pool. You should not use NTP servers without consent of the provider.
How to install NTP daemon
Installing NTP daemon is as simple as:
aptitude install ntpd
and it works immediately. That’s because it is pre-configured to use a standard pool of NTP servers. If you look at
/etc/ntp.conf you will see servers defined with the server parameter, for example:
server 0.debian.pool.ntp.org iburst server 1.debian.pool.ntp.org iburst server 2.debian.pool.ntp.org iburst server 3.debian.pool.ntp.org iburst
This is default for Debian systems, you may see a slightly different list in your distribution. The
iburst parameter is there for optimization. If you want to check how NTP daemon works, run the following command:
ntpq -p. You will get a list similar to this one:
remote refid st t when poll reach delay offset jitter ============================================================================== *dns1.dns.imagin 220.127.116.11 3 u 17 64 7 1.979 0.035 0.235 -eu-m01.nthweb.c 18.104.22.168 2 u 19 64 7 1.064 9.067 0.094 +tshirt.heanet.i .PPS. 1 u 15 64 7 3.276 -0.193 0.066 +ns0.fredprod.co 22.214.171.124 2 u 15 64 7 0.818 -0.699 8.112
It shows you the list of servers it synchronizes to, its reference, stratum,
synchronization periods, response delay, offset from the current time and jitter.
NTP uses optimizing algorithm which selects the best source of current clock as well as a working set of servers it takes into account. Node marked with “*” is the current time source. Nodes marked with “+” are used in the final set. Nodes marked with “-” are discarded by the algorithm.
You can restart NTP daemon with
service ntpd restart
and watch grabbing a different set of servers, selecting the best source and gradually increasing the period when servers are contacted when the clock gets stabilized.
Works like a charm.
Why not to just install NTP daemon on each node
If NTP works so great out of the box, why not simply install it on all boxes? In fact, this is exactly the advice you commonly get for cluster setup.
With respect to Cassandra, it’s the relative difference among clocks that matters, not their absolute accuracy. By default NTP will sync against a set of random NTP servers on the Internet which will result in synchronization of absolute clocks. Therefore the relative difference of clocks in the C* cluster will depend on how clocks are synchronized to absolute values from several randomly chosen public servers.
Look at the (real) example output from the ntpq command, the offset column. The difference among clocks is about 0.1ms, 0.5ms, but there is also an outlier with 9ms difference. Synchronization to the millisecond is a reasonable requirement, which requires one to synchronize absolute clocks to 0.5ms after/before boundary.
How precise, in absolute values, are public NTP servers? We ran a quick check of 560 randomly chosen public NTP servers from the public pool. The statistics are:
- 11% are below 0.5ms drift
- 15% are below 1ms drift
- 62% are below 10ms drift
- 11% are below 100ms drift
There are also outliers, with one being off by multiple hours.
Assuming: (1) our checks are representative, (2) each NTP daemon picks up 4 random NTP servers, and (3) synchronizing to the second best option (this is optimistic) these are the probabilities of our cluster clocks being off:
Nodes 5 10 25 100 95% 2.489 5.180 9.349 19.723 50% 7.122 10.892 18.872 44.394 25% 10.917 16.969 30.855 54.197 10% 18.584 30.291 45.311 66.942
How to read it: assume a cluster of 25 nodes, then with the probability of 50% there will be two nodes with clock difference of more than 18.8ms.
The results may be surprising – even in a small cluster of 10 nodes they will be off by more than 10.9ms half of the time, and with a probability of 10% it will be off by more than 30ms.