How etcd works with and without Kubernetes

July 2021

TL;DR: In this article, you will learn why Kubernetes uses etcd as a database by building (and breaking) a 3-node etcd cluster.

If you've ever interacted with a Kubernetes cluster in any way, chances are it was powered by etcd under the hood.

But even though etcd is at the heart of how Kubernetes works, it's rare to interact with it directly on a day-to-day basis.

This article will introduce how etcd works so you can get a deeper understanding of the inner workings of Kubernetes, as well as giving you some extra tools in your cluster troubleshooting toolbox.

How etcd fits into Kubernetes

At a high level, a Kubernetes cluster has three categories of control-plane processes:

Centralized controllers like the scheduler, controller-manager, and third-party controllers, which configure pods and other resources.
Node-specific processes, the most important of which is Kubelet, which handle the nitty-gritty of setting up pods and networking based on the desired configuration.
The API server, which coordinates between all of the control plane processes and the nodes.

1/3
The Kubernetes control plane includes the controller manager, the API server, the scheduler and etcd (among other components).
Next
2/3
Previous
Every node in the cluster has the kubelet — a Kubernetes agent that executes tasks such as creating containers, attaching them to the network, mounting volumes, etc.
Next
3/3
Previous
The Kubernetes API is the glue that connects the internal controllers to the kubelet.

One of the interesting design choices in Kubernetes is that the API server itself does very little.

When a user or process performs an API call, the API server:

Determines whether the API call is authorized (using RBAC).
Possibly changes the payload of the API call through mutating webhooks.
Determines whether the payload is valid (using internal validation and validating webhooks).
Persists the API payload and returns the requested information.
Potentially notifies subscribers of the API endpoint that the object has changed (more on this later).

1/3
Let's assume you want to create a deployment with kubectl apply -f deployment.yaml.
Next
2/3
Previous
The API server receives the requests and checks that you are a valid user (authentication) and you have the rights to create Deployments (authorization).
Next
3/3
Previous
The Deployment definition is then saved in etcd.

Everything else that happens in the cluster — determining which pods should run based on high-level objects like Deployments, scheduling pods to nodes with the right resources, setting up networking, and so on — is handled by controllers and node-specific processes.

1/2
As soon as a Deployment is created in etcd, the controller manager is notified of the new resource.
Next
2/2
Previous
The Deployment and ReplicaSet controllers will eventually create and store the pods in etcd.

Architecturally speaking, the API server is a CRUD application that's not fundamentally different from, say, WordPress — most of what it does is storing and serving data.

And like WordPress, it needs a database to store its persisted data, which is where etcd fits into the picture.

Why etcd?

Could the API server use a SQL database like MySQL or PostgreSQL to persist its data?

It certainly could (more on this later), but it's not a great fit for the specific requirements of Kubernetes.

Thinking from first principles, what are some desirable traits for the API server's backing database?

1. Consistency

Since the API server is the central coordination point of the entire cluster, strong consistency is essential.

It would be a disaster if, say, two nodes tried to attach the same persistent volume over iSCSI because the API server told them both that it was available.

This requirement rules out eventually consistent NoSQL databases and some distributed multi-master SQL configurations.

2. Availability

API downtime means that the entire Kubernetes control plane comes to a halt, which is undesirable for production clusters.

The CAP theorem says that 100% availability is impossible with strong consistency, but minimizing downtime is still a critical goal.

3. Consistent Performance

The API server for a busy Kubernetes cluster receives a fair amount of read and write traffic.

Unpredictable slowdowns from concurrent use would be a huge problem.

4. Change Notification

Since the API server acts as a centralized coordinator between many different types of clients, streaming changes in real-time would be an excellent feature (and turns out to be central to how Kubernetes work in practice).

5. Other considerations

It's also worth noting what's not needed by the API database.

Large Datasets: Since the API server only holds metadata about pods and other objects, there is a natural limit to how much data will be stored. A large cluster will have hundreds of megabytes or perhaps a few gigabytes of data, but certainly never terabytes.
Complex Queries: The Kubernetes API is pretty predictable in its access patterns: most objects are accessed by type, namespace, and sometimes name. If there's additional filtering, it will typically be by labels or annotations. SQL-like capabilities with joins and complex analytic queries are overkill for how the API works.

Traditional SQL databases tend to optimize for strong consistency with large and complex datasets, but it's often difficult to achieve high availability and consistent performance out of the box.

They're not a great fit for the Kubernetes API use-case.

Enter etcd

According to its website, etcd is a "strongly consistent, distributed key-value store".

Breaking down what that means:

Strongly consistent: etcd has strict serializability, which means a consistent global ordering of events. Practically speaking: after a write from one client succeeds, another client won't ever see stale data before writing. (This isn't the case for eventually consistent NoSQL databases.)
Distributed: Unlike traditional SQL databases, etcd is designed from the ground up to be run with multiple nodes. etcd's particular design allows for high availability (although not 100%) without sacrificing consistency.
Key-value store: Unlike SQL databases, etcd's data model is straightforward, involving keys and values instead of arbitrary data relationships. This helps it ensure relatively predictable performance, at least compared to traditional SQL databases.

In addition, etcd has another killer feature that Kubernetes makes extensive use of change notifications.

Etcd allows clients to subscribe to changes to a particular key or set of keys.

Along with these features, etcd is relatively easy to deploy (at least as far as distributed databases go), so it's pretty easy to see why it was chosen for Kubernetes.

It's worth noting that etcd is not the only distributed key-value store available with similar characteristics.

Some other options include Apache ZooKeeper and HashiCorpConsul.

How etcd Works

The secret behind etcd's balance of strong consistency and high availability is the Raft algorithm.

Raft solves a particular problem: how can multiple independent processes decide on a single value for something?

In computer science, this problem is known as distributed consensus; it was famously solved by the Leslie Lamport's Paxos algorithm, which is effective but notoriously tricky to understand and implement in practice.

Raft was designed to solve similar problems to Paxos but in a much more understandable way.

Raft works by electing a leader among a set of nodes and forcing all write requests to go to the leader.

1/4
When a cluster is formed, all nodes start in the follower state.
Next
2/4
Previous
If followers don't hear from a leader, then they can become a candidate and request votes from other nodes.
Next
3/4
Previous
Nodes reply with their vote.
Next
4/4
Previous
The candidate becomes the leader if it gets votes from a majority of nodes.

Changes are then replicated from the leader to all other nodes; if the leader ever goes offline, a new election is held, and a new leader is chosen.

If you're interested in the details, here's a great visual walthrough. The original Raft paper is also relatively easy to understand and well worth reading.

An etcd cluster always has a single leader node at any given time (elected using the Raft protocol).

Data writes follow a consistent path:

The client can send the write request to any one of the etcd nodes in the cluster.
If the client happened to communicate with the leader node, the write will be performed and replicated to the other nodes.
If the client chose a node other than the leader, the write request is forwarded to the leader and the process is the same from then on.
Once the write succeeds, an acknowledgement is sent back to the client.

1/4
In an etcd cluster, the nodes reach consensus using the RAFT protocol. A node can be labelled either as a Follower, Leader or Candidate.
Next
2/4
Previous
What happens when you want to write a value in the database? First, all write requests go to the leader. The leader adds the entry to log, but it is not committed.
Next
3/4
Previous
To commit the entry, the node first replicates it to the follower nodes.
Next
4/4
Previous
Finally, the leader waits until a majority of nodes have written the entry and commits the value. The state of the database contains my_key=1.

Read requests go through the same basic path, although as an optimization, you can allow read requests to be performed by replica nodes at the expense of linearizability.

If the cluster leader goes offline for any reason, a new election is held so that the cluster can stay online.

Crucially, a majority of nodes have to agree to elect a new leader (2/3, 4/6, etc.), and if a majority can't be reached, the entire cluster will be unavailable.

What this means in practice is that etcd will remain available as long as a majority of nodes is online.

How many nodes should an etcd cluster have to achieve "good enough" availability?

As with most design questions, the answer is "it depends", but from looking at a quick table, it's possible to see a rule of thumb:

Total Number of Nodes	Failed Node Tolerance
1	0
2	0
3	1
4	1
5	2
6	2

One thing that jumps out from this table is that adding one node to an odd number of nodes doesn't increase availability.

For instance, etcd clusters of 3 or 4 nodes will both only be able to tolerate one node failure, so it's pointless to add the fourth node.

Therefore, a good rule of thumb is that it only makes sense to use an odd number of etcd nodes in a cluster.

How many nodes is the correct number?

Again, the answer is "it depends", but it's important to remember that all writes must be replicated to follower nodes; as more nodes are added to the cluster, that process will become slower.

So there's an availability/performance tradeoff: the more nodes are added, the better the availability of the cluster but the worse the performance.

In practice, typical production etcd clusters tend to have 3 or 5 nodes.

Giving etcd a spin

Now that you've seen etcd in theory, let's look at using it in practice.

Getting started with etcd is relatively straightforward.

You can download the etcd binaries directly from its release page.

Here's an example for Linux:

bash

curl -LO https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
tar xzvf etcd-v3.5.0-linux-amd64.tar.gz
cd etcd-v3.5.0-linux-amd64

If you inspect the contents of the release, you can see three binaries (along with some documentation):

etcd, which runs the actual etcd server.
etcdctl, which is a client binary for interacting with the server.
etcdutl, which provides some helper utilities for operations like backups.

Starting a single-node etcd "cluster" is as easy as running:

bash

./etcd
...
{"level":"info","caller":"etcdserver/server.go:2027","msg":"published local member..." }
{"level":"info","caller":"embed/serve.go:98","msg":"ready to serve client requests"}
{"level":"info","caller":"etcdmain/main.go:47","msg":"notifying init daemon"}
{"level":"info","caller":"etcdmain/main.go:53","msg":"successfully notified init daemon"}
{"level":"info","caller":"embed/serve.go:140","msg":"serving client traff...","address":"127.0.0.1:2379"}

You can see from the logs that etcd has already set up a "cluster" and started serving traffic insecurely on 127.0.0.1:2379.

To talk to this running "cluster", you can use the etcdctl binary.

There's one complication to be aware of: the etcd API changed significantly between v2 and v3.

To use the new API, you have to explicitly set an environment variable before running etcdctl.

To use the v3 API, you can use the following command in every terminal window you use:

bash

export ETCDCTL_API=3

Now you can actually use etcdctl to write and read key-value data:

bash

./etcdctl put foo bar
OK
./etcdctl get foo
foo
bar

As you can see, etdctl get will echo the key and the value; you can use the --print-value-only flag to disable this behaviour.

To get a more detailed response, you can use the --write-out=json option:

bash

./etcdctl get --write-out=json foo
{
  "header": {
    "cluster_id": 14841639068965180000,
    "member_id": 10276657743932975000,
    "revision": 2,
    "raft_term": 2
  },
  "kvs": [
    {
      "key": "Zm9v",
      "create_revision": 2,
      "mod_revision": 2,
      "version": 1,
      "value": "YmFy"
    }
  ],
  "count": 1
}

Here you get a glimpse at the metadata that etcd maintains.

The data is at:

version: 1
create_revision: 2, and
mod_revision: 2.

The etcd documentation describes what these values mean: version refers to the version of that particular key.

In contrast, the various revision values refer to the global revision of the overall cluster.

Every time a write operation happens in the cluster, etcd creates a new version of its dataset, and the revision number is incremented.

This system is known as multiversion concurrency control, or MVCC for short.

It's also worth noting that both the key and value are returned base64-encoded.

This is because keys and values in etcd are arbitrary byte arrays instead of strings.

Unlike with, say, MySQL, there's no built-in concept of string encoding in etcd.

When you overwrite the value, notice that the version and revision numbers are incremented:

bash

./etcdctl put foo baz
OK
./etcdctl get --write-out=json foo
{
  "header": {
    "cluster_id": 14841639068965180000,
    "member_id": 10276657743932975000,
    "revision": 3,
    "raft_term": 2
  },
  "kvs": [
    {
      "key": "Zm9v",
      "create_revision": 2,
      "mod_revision": 3,
      "version": 2,
      "value": "YmF6"
    }
  ],
  "count": 1
}

Specifically, the version and mod_revision fields have been incremented, but create_revision hasn't.

That makes sense: mod_revision refers to the revision where the key was last modified, and create_revision refers to the revision where it was created.

etcd also allows "time travel" queries using the --rev command, which shows you the value of a key that existed at a particular cluster revision:

bash

./etcdctl get foo --rev=2 --print-value-only
bar
./etcdctl get foo --rev=3 --print-value-only
baz

As you might expect, you can also delete keys.

The command is etcdctl del:

bash

./etcdctl del foo
1
./etcdctl get foo

The 1 returned by etcdctl del refers to the number of deleted keys.

But deletion isn't permanent, and you can still "time-travel" back to before the key was deleted with the --rev flag:

bash

./etcdctl get foo --rev=3 --print-value-only
baz

Returning multiple results

One nice feature of etcd is the ability to return multiple values at once.

To try this out, first create a few keys and values:

bash

./etcdctl put myprefix/key1 thing1
OK
./etcdctl put myprefix/key2 thing2
OK
./etcdctl put myprefix/key3 thing3
OK
./etcdctl put myprefix/key4 thing4
OK

If you give two arguments to etdctctl get, it will use a range query to return all key/value pairs in that range.

Here's an example:

bash

./etcdctl get myprefix/key2 myprefix/key4
myprefix/key2
thing2
myprefix/key3
thing3

Here, etcd returns all keys between myprefix/key2 and myprefix/key4, including the start key but excluding the end key.

Another way to retrieve multiple values is to use the --prefix command, which (unsurprisingly) returns all keys with a particular prefix.

Here's how you would get all the key/value pairs that start with myprefix/:

bash

./etcdctl get --prefix myprefix/
myprefix/key1
thing1
myprefix/key2
thing2
myprefix/key3
thing3
myprefix/key4
thing4

Note that there's nothing special about the / here, and --prefix myprefix/key would have worked just as well.

Believe it or not, you've now seen most of what etcd has to offer in terms of database capabilities!

There are a few more options for transactions, key ordering, and response limits, but at its core etcd's data model is extremely simple.

But etcd has a few other interesting features up its sleeve.

Watching for changes

One key feature of etcd that you haven't seen yet is the etcdctl watch command, which works more or less the same as etcdctl get but streams changes back to the client.

Let's see it in action!

In one terminal window, watch for the changes to anything with the prefix myprefix/:

bash

./etcdctl watch --prefix myprefix/

And then in another terminal, change some data and see what happens:

bash

./etcdctl put myprefix/key1 anewthing
OK
./etcdctl put myprefix/key5 thing5
OK
./etcdctl del myprefix/key5
1
./etcdctl put notmyprefix/key thing
OK

In the original window, you should see a stream of all the changes that happened in the myprefix prefix, even for keys that didn't previously exist:

bash

PUT
myprefix/key1
anewthing
PUT
myprefix/key5
thing5
DELETE
myprefix/key5

Note that the update to the key starting with notmyprefix didn't get streamed to the subscriber.

But the watch command isn't just limited to real-time changes—you can also see a "time travel" version of events using the --rev option, which will stream all the changes since that revision.

Let's see the history of the foo key from earlier:

bash

./etcdctl watch --rev=2 foo
PUT
foo
bar
PUT
foo
baz
DELETE
foo

This handy feature can let clients ensure they don't miss updates if they go offline.

Setting up a multi-node etcd cluster

So far, your etcd "cluster" has involved just one node, which isn't particularly exciting.

Let's set up a 3-node cluster and see how high availability works in practice!

All the nodes would be on separate servers in a real cluster, but you can set up a cluster with all the nodes on the same computer by giving each node its unique ports.

First, let's make data directories for each node:

bash

mkdir -p /tmp/etcd/data{1..3}

When setting up an etcd cluster, you have to know in advance what the IP addresses and ports of the nodes will be so the nodes can discover each other.

Let's bind to localhost for all three nodes, give "client" ports (the ports that clients use to connect) of 2379, 3379, and 4379, and "peer" ports (the ports that are used between etcd nodes) of 2380, 3380, and 4380.

Starting up the first etcd node is a matter of supplying the right CLI flags:

bash

./etcd --data-dir=/tmp/etcd/data1 --name node1 \
  --initial-advertise-peer-urls http://127.0.0.1:2380 \
  --listen-peer-urls http://127.0.0.1:2380 \
  --advertise-client-urls http://127.0.0.1:2379 \
  --listen-client-urls http://127.0.0.1:2379 \
  --initial-cluster node1=http://127.0.0.1:2380,node2=http://127.0.0.1:3380,node3=http://127.0.0.1:4380 \
  --initial-cluster-state new \
  --initial-cluster-token mytoken

Breaking down some of those options:

--name: Unsurprisingly, this specifies the name of this particular node.
--listen-peer-urls and --initial-advertise-peer-urls: These specify the URLs on which to listen to peer (node-to-node) traffic.
--advertise-client-urls and --listen-client-urls: These specify the URLs on which to listen to client traffic.
--initial-cluster-token: This sets a shared token for the cluster so that nodes don't accidentally join the wrong cluster.
--initial-cluster: Here, you give etcd the entire initial cluster configuration so that nodes can find each other and begin communicating with each other.

Starting the second node is very similar; in a new terminal, run:

bash

./etcd --data-dir=/tmp/etcd/data2 --name node2 \
  --initial-advertise-peer-urls http://127.0.0.1:3380 \
  --listen-peer-urls http://127.0.0.1:3380 \
  --advertise-client-urls http://127.0.0.1:3379 \
  --listen-client-urls http://127.0.0.1:3379 \
  --initial-cluster node1=http://127.0.0.1:2380,node2=http://127.0.0.1:3380,node3=http://127.0.0.1:4380 \
  --initial-cluster-state new \
  --initial-cluster-token mytoken

And start a third node in a third terminal:

bash

./etcd --data-dir=/tmp/etcd/data3 --name node3 \
  --initial-advertise-peer-urls http://127.0.0.1:4380 \
  --listen-peer-urls http://127.0.0.1:4380 \
  --advertise-client-urls http://127.0.0.1:4379 \
  --listen-client-urls http://127.0.0.1:4379 \
  --initial-cluster node1=http://127.0.0.1:2380,node2=http://127.0.0.1:3380,node3=http://127.0.0.1:4380 \
  --initial-cluster-state new \
  --initial-cluster-token mytoken

To communicate with an etcd cluster, you have to tell etcdctl which endpoints to share using the --endpoints option, which is just a list of etcd endpoints (equivalent to the --listen-client-urls from earlier).

Let's verify that all the nodes joined successfully using the member list command:

bash

export ENDPOINTS=127.0.0.1:2379,127.0.0.1:3379,127.0.0.1:4379
./etcdctl --endpoints=$ENDPOINTS member list --write-out=table
+------------------+---------+-------+-----------------------+-----------------------+------------+
|        ID        | STATUS  | NAME  |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
+------------------+---------+-------+-----------------------+-----------------------+------------+
| 3c969067d90d0e6c | started | node1 | http://127.0.0.1:2380 | http://127.0.0.1:2379 |      false |
| 5c5501077e83a9ee | started | node3 | http://127.0.0.1:4380 | http://127.0.0.1:4379 |      false |
| a2f3309a1583fba3 | started | node2 | http://127.0.0.1:3380 | http://127.0.0.1:3379 |      false |
+------------------+---------+-------+-----------------------+-----------------------+------------+

You can see that all three nodes have successfully joined the cluster!

Don't worry about the IS LEARNER field; that's related to a special kind of node called a learner node with has a specialized use-case.

Let's confirm that the cluster can actually read and write data:

bash

./etcdctl --endpoints=$ENDPOINTS put mykey myvalue
OK
./etcdctl --endpoints=$ENDPOINTS get mykey
mykey
myvalue

What happens when one of the nodes goes down?

Let's kill the first etcd process (using Ctrl-C) and find out!

The first thing to try after killing the first node is the member list command, but the output is a bit surprising:

bash

./etcdctl --endpoints=$ENDPOINTS member list --write-out=table
+------------------+---------+-------+-----------------------+-----------------------+------------+
|        ID        | STATUS  | NAME  |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
+------------------+---------+-------+-----------------------+-----------------------+------------+
| 3c969067d90d0e6c | started | node1 | http://127.0.0.1:2380 | http://127.0.0.1:2379 |      false |
| 5c5501077e83a9ee | started | node3 | http://127.0.0.1:4380 | http://127.0.0.1:4379 |      false |
| a2f3309a1583fba3 | started | node2 | http://127.0.0.1:3380 | http://127.0.0.1:3379 |      false |
+------------------+---------+-------+-----------------------+-----------------------+------------+

All three members are still present, even though node1 has been killed.

Confusingly, member list actually lists all cluster members, regardless of their health.

The command for checking member health is endpoint status, which does show the current status of each endpoint:

bash

./etcdctl --endpoints=$ENDPOINTS endpoint status --write-out=table
{
  "level": "warn",
  "ts": "2021-06-23T15:43:40.378-0700",
  "logger": "etcd-client",
  "caller": "v3/retry_interceptor.go:62",
  "msg": "retrying of unary invoker failed",
  "target": "etcd-endpoints://0xc000454700/#initially=[127.0.0.1:2379;127.0.0.1:3379;127.0.0.1:4379]",
  "attempt": 0,
  "error": "rpc error: code = DeadlineExceeded ... connect: connection refused\""
}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)
+----------------+------------------+-----------+------------+--------+
|    ENDPOINT    |        ID        | IS LEADER | IS LEARNER | ERRORS |
+----------------+------------------+-----------+------------+--------+
| 127.0.0.1:3379 | a2f3309a1583fba3 |      true |      false |        |
| 127.0.0.1:4379 | 5c5501077e83a9ee |     false |      false |        |
+----------------+------------------+-----------+------------+--------+

Here you see a warning about the failed node and some interesting information about the current state of each node.

But does the cluster still work?

Theoretically it should (since the majority is still online), so let's verify:

bash

./etcdctl --endpoints=$ENDPOINTS get mykey
mykey
myvalue
./etcdctl --endpoints=$ENDPOINTS put mykey newvalue
OK
./etcdctl --endpoints=$ENDPOINTS get mykey
mykey
newvalue

Things are looking good—reads and writes both work!

And if you bring back the original node (using the same command as before) and try endpoint status again, you can see that it quickly rejoins to the cluster:

bash

./etcdctl --endpoints=$ENDPOINTS endpoint status --write-out=table
+----------------+------------------+-----------+------------+--------+
|    ENDPOINT    |        ID        | IS LEADER | IS LEARNER | ERRORS |
+----------------+------------------+-----------+------------+--------+
| 127.0.0.1:2379 | 3c969067d90d0e6c |     false |      false |        |
| 127.0.0.1:3379 | a2f3309a1583fba3 |      true |      false |        |
| 127.0.0.1:4379 | 5c5501077e83a9ee |     false |      false |        |
+----------------+------------------+-----------+------------+--------+

What happens if two nodes become unavailable?

Let's kill node1 and node2 with Ctrl-C and try endpoint status again:

bash

./etcdctl --endpoints=$ENDPOINTS endpoint status --write-out=table
{"level":"warn","ts":"2021-06-23T15:47:05.803-0700","logger":"etcd-client","caller":"v3/retry_i ...}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)
{"level":"warn","ts":"2021-06-23T15:47:10.805-0700","logger":"etcd-client","caller":"v3/retry_i ...}
Failed to get the status of endpoint 127.0.0.1:3379 (context deadline exceeded)
+----------------+------------------+-----------+------------+-----------------------+
|    ENDPOINT    |        ID        | IS LEADER | IS LEARNER |        ERRORS         |
+----------------+------------------+-----------+------------+-----------------------+
| 127.0.0.1:4379 | 5c5501077e83a9ee |     false |      false | etcdserver: no leader |
+----------------+------------------+-----------+------------+-----------------------+

This time, there's an error message saying there's no leader available.

If you try to perform reads or writes to the cluster, you'll just get errors:

bash

./etcdctl --endpoints=$ENDPOINTS get mykey
{
  "level": "warn",
  "ts": "2021-06-23T15:48:31.987-0700",
  "logger": "etcd-client",
  "caller": "v3/retry_interceptor.go:62",
  "msg": "retrying of unary invoker failed",
  "target": "etcd-endpoints://0xc0001da000/#initially=[127.0.0.1:2379;127.0.0.1:3379;127.0.0.1:4379]",
  "attempt": 0,
  "error": "rpc error: code = Unknown desc = context deadline exceeded"
}
./etcdctl --endpoints=$ENDPOINTS put mykey anewervalue
{
  "level": "warn",
  "ts": "2021-06-23T15:49:04.539-0700",
  "logger": "etcd-client",
  "caller": "v3/retry_interceptor.go:62",
  "msg": "retrying of unary invoker failed",
  "target": "etcd-endpoints://0xc000432a80/#initially=[127.0.0.1:2379;127.0.0.1:3379;127.0.0.1:4379]",
  "attempt": 0,
  "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
}
Error: context deadline exceeded

But if you bring the other two nodes back up (using the original commands), things will return to normal pretty quickly:

bash

./etcdctl --endpoints=$ENDPOINTS endpoint status --write-out=table
+----------------+------------------+-----------+------------+--------+
|    ENDPOINT    |        ID        | IS LEADER | IS LEARNER | ERRORS |
+----------------+------------------+-----------+------------+--------+
| 127.0.0.1:2379 | 3c969067d90d0e6c |     false |      false |        |
| 127.0.0.1:3379 | a2f3309a1583fba3 |     false |      false |        |
| 127.0.0.1:4379 | 5c5501077e83a9ee |      true |      false |        |
+----------------+------------------+-----------+------------+--------+
./etcdctl --endpoints=$ENDPOINTS get mykey
mykey
newvalue

A new leader has been elected, and the cluster is back online.

Thankfully, there was no data loss, even though there was some downtime.

Kubernetes and etcd

Now that you know how etcd works in general, let's investigate how Kubernetes uses etcd under the hood.

These examples will use minikube, but production-ready Kubernetes setups should work very similarly.

To see etcd in action, start minikube, SSH into minikube, and download etcdctl in the same way as before:

bash

minikube start
minikube ssh
curl -LO https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
tar xzvf etcd-v3.5.0-linux-amd64.tar.gz
cd etcd-v3.5.0-linux-amd64

Unlike the test setup from before, minikube deploys etcd with mutual TLS authentication, so you have to provide TLS certificates and keys with every request.

This is a bit tedious, but a Bash variable can help speed things along:

bash

export ETCDCTL=$(cat <<EOF
sudo ETCDCTL_API=3 ./etcdctl --cacert /var/lib/minikube/certs/etcd/ca.crt \n
  --cert /var/lib/minikube/certs/etcd/healthcheck-client.crt \n
  --key /var/lib/minikube/certs/etcd/healthcheck-client.key
EOF
)

The certificate and keys used here are generated by minikube during Kubernetes cluster bootstrapping. They're designed for use with etcd health checks, but they also work well for debugging.

Then you can run etcdctl commands like this:

bash

$ETCDCTL member list --write-out=table
+------------------+---------+----------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |   NAME   |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------+---------------------------+---------------------------+------------+
| aec36adc501070cc | started | minikube | https://192.168.49.2:2380 | https://192.168.49.2:2379 |      false |
+------------------+---------+----------+---------------------------+---------------------------+------------+

As you can see, the minikube cluster has a single etcd node running.

How does the Kubernetes API store data in etcd?

A little investigation suggests that pretty much all the Kubernetes data has the /registry prefix:

bash

$ETCDCTL get --prefix /registry | wc -l
5882

Some more investigation shows that pod definitions are under the /registry/pods prefix:

bash

$ETCDCTL get --prefix /registry/pods | wc -l
412

The naming scheme is /registry/pods/<namespace>/<pod-name>.

For instance, here you can see the scheduler pod definition:

bash

$ETCDCTL get --prefix /registry/pods/kube-system/ --keys-only | grep scheduler
/registry/pods/kube-system/kube-scheduler-minikube

This uses the --keys-only option, which, as you might expect, just returns the keys of the queries.

What does the actual data look like? Let's investigate:

bash

$ETCDCTL get /registry/pods/kube-system/kube-scheduler-minikube | head -6
/registry/pods/kube-system/kube-scheduler-minikube
k8s

v1Pod�
�
kube-scheduler-minikube�
                        kube-system"*$f8e4441d-fb03-4c98-b48b-61a42643763a2��نZ

It looks kind of like junk!

That's because the Kubernetes API stores the actual object definitions in a binary format instead of something human-readable.

If you want to see an object specification in a friendly format like JSON, you have to go through the API instead of accessing etcd directly.

Kubernetes' naming scheme for etcd keys should make perfect sense now: it allows the API to query or watch all objects of a particular type in a specific namespace using an etcd prefix query.

This is a widespread pattern in Kubernetes and is how Kubernetes controllers and operators subscribe to changes for objects that they're interested in.

Let's try subscribing to pod changes in the default namespace to see this in action.

First, use the watch command with the appropriate prefix:

bash

$ETCDCTL watch --prefix /registry/pods/default/ --write-out=json

Then, in another terminal, create a pod and see what happens:

bash

kubectl run --namespace=default --image=nginx nginx
pod/nginx created

You should see several JSON messages appear in the etcd watch output, one for each status change in the pod (for instance, going from Pending to Scheduled to Running statuses).

Each message should look something like this:

output.json

{
  "Header": {
    "cluster_id": 18038207397139143000,
    "member_id": 12593026477526643000,
    "revision": 935,
    "raft_term": 2
  },
  "Events": [
    {
      "kv": {
        "key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9uZ2lueA==",
        "create_revision": 935,
        "mod_revision": 935,
        "version": 1,
        "value": "azh...ACIA"
      }
    }
  ],
  "CompactRevision": 0,
  "Canceled": false,
  "Created": false
}

To satisfy any curiosity about what the data actually contains, you can run the value through xxd to investigate interesting strings. As an example:

bash

$ETCDCTL get /registry/pods/default/nginx --print-value-only | xxd | grep -A2 Run
00000600: 5072 696f 7269 7479 1aba 030a 0752 756e  Priority.....Run
00000610: 6e69 6e67 1223 0a0b 496e 6974 6961 6c69  ning.#..Initiali
00000620: 7a65 6412 0454 7275 651a 0022 0808 e098  zed..True.."....

Here you can infer that the pod you created currently has a status of Running.

In real-world usage, you would rarely interact with etcd directly in this way and would instead subscribe to changes through the Kubernetes API.

But it's not hard to imagine how the API interacts with etcd using precisely these kinds of watch queries.

Replacing etcd

etcd works terrifically in thousands of Kubernetes clusters in the real world, but it might not be the best tool for all use cases.

For instance, if you want an extremely lightweight cluster for testing purposes or embedded environments, etcd might be overkill.

That's the theory of k3s, a lightweight Kubernetes distribution designed for precisely those kinds of use-cases.

One of the distinguishing features that sets k3s apart from "vanilla" Kubernetes is its ability to swap out etcd with SQL databases.

The default backend is SQLite, which is an ultra-lightweight embedded SQL library.

That allows users to run Kubernetes without having to worry about operating an etcd cluster.

How does k3s accomplish this?

The Kubernetes API offers no way to swap out databases — etcd is more or less "hardcoded" into the codebase.

k3s also didn't rewrite the Kubernetes API to have pluggable databases, which would work but would impose a huge maintenance burden.

Instead, k3s uses a special project called Kine (for "Kine is not etcd").

Kine is a shim that translates etcd API calls into actual SQL queries.

Kine is an adapter that lets you use MySQL, SQLite and PostgreSQL as replacements for etcd

Since the etcd data model is so simple, the translation is relatively straightforward.

For example, here's the template for listing keys by prefix in the Kine SQL driver:

query.sql

SELECT (%s), (%s), %s
FROM kine AS kv
JOIN (
  SELECT MAX(mkv.id) AS id
  FROM kine AS mkv
  WHERE
    mkv.name LIKE ?
    %%s
  GROUP BY mkv.name) maxkv
ON maxkv.id = kv.id
WHERE
    (kv.deleted = 0 OR ?)
ORDER BY kv.id ASC

Kine uses a single table that holds keys, values, and some extra metadata.

Prefix queries are just translated into SQL LIKE queries.

Of course, since Kine uses SQL, it won't get the same performance or availability characteristics as etcd.

In addition to SQLite, Kine also can use MySQL or PostgreSQL as its backend.

Why would you want to do that?

Sometimes the best database is one that's managed by someone else!

Suppose your company has an existing database team with a lot of experience running production-worthy SQL databases.

In that case, it might make sense to leverage that expertise instead of operating etcd independently (again, with the caveat that availability and performance characteristics may not be equivalent to etcd).

Summary

Peeking under the covers of Kubernetes is always interesting, and etcd is one of the most central pieces of the Kubernetes puzzle.

etcd is where Kubernetes stores all of the information about a cluster's state; in fact, it's the only stateful part of the entire Kubernetes control plane.

etcd provides a feature set that's a perfect match for Kubernetes.

It's strongly consistent so that it can act as a central coordination point for the cluster, but it's also highly available thanks to the Raft consensus algorithm.

And the ability to stream changes to clients is a killer feature that helps all the components of a Kubernetes cluster stay in sync.

Even though etcd works well for Kubernetes, as Kubernetes starts getting used in more unusual environments (like embedded systems), it might not always be the ideal choice.

Projects like Kine let you swap out etcd with another database where it makes sense.