Kubernetes Harbor Image Proxy Cache — From Minutes to Milliseconds

Element
6 min readJul 20, 2023

When you deploy a workload to Kubernetes, the containers in a certain Pod are naturally based on OCI container Images. These Images can be pulled from private/public repositories of many kinds. Kubernetes caches the images locally on every node that has pulled them, so that other Pods might use the same image. The settings for how and when Kubernetes pulls images can be found in the documentation.

In most use-cases it’s not enough. Most Cloud Kubernetes clusters today require auto-scaling and are dynamically allocated Nodes based on the customer’s usage. What if multiple nodes have to pull the same image multiple times? And if this image is heavy, that can take minutes. In the applicative autoscaling world, that is a relatively long time.

The Solution

The expected solution needs to have a cache layer on top of Kubernetes, so that Kubernetes has a centralized image cache and all nodes “pull” from it. But because the cache needs to be very fast, the caching solutions needs to sit inside Kubernetes, and all nodes should have the fastest latency towards it.

Here comes Harbor. Harbor is a CNCF Graduated project that functions as a container registry, but most importantly as a Pull Through Proxy Cache.

A pull-through proxy cache is a caching mechanism designed to optimize the distribution and retrieval of container images within a container registry environment. It acts as an intermediary between clients (such as container runtimes or build systems) and the upstream container registry.

When a client requests a container image, the pull-through proxy cache checks if it already has a local copy of the requested image. If the image is present, the proxy cache serves it directly to the client, eliminating the need to download it from the upstream registry. This reduces network latency and conserves bandwidth.

If the requested image is not present in the local cache, the proxy cache acts as a regular proxy and forwards the request to the upstream registry. The proxy cache then retrieves the image from the registry and serves it to the client. Additionally, the proxy cache stores a copy of the image in its local cache for future requests.

This caching mechanism offers several benefits, including improved performance, reduced network traffic, and enhanced reliability. By reducing the reliance on the upstream registry, it can significantly speed up container image distribution and deployment in containerized environments.

In our case, the Harbor Pull Through Proxy Cache sits on one (or more) of the local Kubernetes nodes, that share networking and are close to all other nodes latency-wise. What this basically means is that instead of pulling from the internet/remotely, each node pulls from another node. This essentially means that the pulls from remote happens in “one node — once” instead of from “multiple nodes — multiple times”.

The How

The first component that needs to be setup on Kubernetes is Harbor. For this purpose, we can use the official Bitnami Harbor Helm Chart. The values have to include an ingress, persistence for the image cache if you want to use it, and some other settings I’ve found optimal for this use-case (You should include the label provded in the commonLabels, it will be explained shortly).helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-harbor bitnami/harborexternalURL: https://my-harbor.my.domain
exposureType: ingress
adminPassword: Harbor12345
commonLabels:
goharbor.io/harbor-container-webhook-disable: 'true'
ingress:
core:
hostname: my-harbor.my.domain
persistence:
enabled: true
persistentVolumeClaim:
registry:
size: 100Gi
accessModes:
- ReadWriteOnce
postgresql:
architecture: replication
readReplicas:
extendedConfiguration: |
max_connections = 1024
primary:
extendedConfiguration: |
max_connections = 1024

After the installation is complete, you should log into the ingress endpoint you have provided, with the default credentials:

Next, you should configure your first registry endpoint. In this case, you can put any registry you use, but we will put DockerHub provider with the username/password of the account we own:

Next, you should create a project that points to your registry (proxy cache). In my case it already exists. You should define Public or not based on your use case. This defines if this project itself requires permissions when pulling/pushing images.

This configuration can also be applied using Terraform and the Harbor Provider:

provider "harbor" {
url = "https://harbor-${var.cluster}.my.domain"
username = "admin"
password = "Harbor12345"
}

resource "harbor_project" "dockerhub" {
name = "dockerhub"
registry_id = harbor_registry.dockerhub.registry_id
vulnerability_scanning = false
public = true
force_destroy = true
}

resource "harbor_registry" "dockerhub" {
provider_name = "docker-hub"
name = "dockerhub"
endpoint_url = "https://hub.docker.com"
access_id = var.dockerhub_username
access_secret = var.dockerhub_password
}

Now, technically, you should be able to pull any dockerhub images THROUGH your harbor dockerhub project. It will look like this.

Instead of:

docker pull redis:latest

You could now do:

docker pull my-harbor.my.domain/dockerhub/redis:latest

Neat! You can now enjoy the benefits of a proxy cache. The next stage answers the question of: “How do we make Kubernetes use this proxy cache automatically”?

The Harbor Cache Mutating Webhook

In this repository, you will find a harbor-container-webhook project. If you are not familiar, Kubernetes mutating webhooks are API extensions you can provide the Kubernetes scheduling mechanism in order to alter and change objects in real-time before they are applied and provisioned. A mutating webhook includes custom logic exposed in an API server that the Kubernetes uses.

In our case, this webhook will “convert” all images with a certain prefix/logic to use harbor instead of their original repository. That is without thinking, meaning you will still apply your pod assuming it will reach the remote repository, but behind the scenes it will use Harbor Proxy Cache.

The values of this webhook installed through Helm can include rules. These rules are regex-based and provide you control of which images you would like to mutate. (You should notice I exclude Harbor itself, becuase Harbor cannot use harbor if harbor is down).

rules:
- name: 'docker.io rewrite rule'
matches:
- '^docker.io'
excludes:
- '.*goharbor.*'
replace: 'my-harbor.my-domain/dockerhub'
checkUpstream: true

You could also add protections using labels in Namespaces and Pods, that will allow you to granularly add this feature and not all at once. For example:

webhook:
namespaceSelector:
matchExpressions:
- key: "goharbor.io/harbor-container-webhook-enable"
operator: In
values: ["true"]
objectSelector:
matchExpressions:
- key: "goharbor.io/harbor-container-webhook-disable"
operator: NotIn
values: ["true"]

Now you have a fully automated Kubernetes image cache sitting as close to your nodes as possible. As proof, my workloads image-pull-time was reduced from seconds to milliseconds at large scale!

The past:

The future!

Hope you enjoyed this article. If you liked it please clap, and if you have any questions I’d be happy to answer in the comments.

--

--