Shifting (even further) Left on Kubernetes Resource Compliance

Thomas Desrosiers
Google Cloud - Community
8 min readSep 14, 2022

--

As infrastructure-as-code dominates more and more organizations, the process of code development has also been evolving. This new status quo, which essentially involves a faster development cycle and improvements to the quality of each new release, is often referred to as “shifting left.”

What does it really mean to shift left? That’s a little tricky to define. In simple terms, shifting left refers to the process of automating more and more of the development pipeline, in order to give developers more control over what they do. This added freedom is then regulated by the automated testing of resources, ensuring a high standard of compliance and quality for each push to the repository, for example.

I mentioned infrastructure-as-code (IaC) earlier. In cloud environments, the actual computing resources are quite obfuscated. We’re not going into the data center ourselves anymore, and all facets of equipment management and maintenance are handled by cloud providers. The benefit of this approach is that the actual infrastructure becomes symbolic. We say “I want to spin up four C6g machines on AWS EC2 to host this workload,” or “I want my Google Kubernetes Engine cluster to be resized up to 10 nodes.” Notice how command-like these phrases are. That’s IaC.

Let’s Talk Kubernetes Resource Definitions

I know that I’m starting to sound like I’m introducing Terraform, so I’ll get us back on track. Kubernetes Resource Definitions are yaml files that declaratively update the state of a Kubernetes cluster. Let’s look at this example for a minute:

# Sample nginx deployment on Kubernetes:
apiVersion
: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector
:
matchLabels:
app: nginx
owner: "Bob Robertson"
template:
metadata:
labels:
app: nginx
owner: "Bob Robertson"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80

In this file, we can see a large number of data points that our cluster will be looking for, such as the type of resource (deployment, in this case), the name of the deployment, the person who owns this deployment, how many replicas of the application to run, labels, and the container image to run. We can then use kubectl apply to deploy this resource in our cluster. Think about it, we’re using this file to control infrastructure!

Yes, a Kubernetes cluster may not be an actual physical cluster of computers (it could just be a bunch of VMs contained inside a single machine), but we’re using code to control, manage, update, and maintain computing resources, however obfuscated they may be. How cool is that? Very cool. Until some developer at “X” Corporation decides to use kubectl apply on this deployment:

# Sample nginx deployment on Kubernetes:
apiVersion
: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80

Do you spot the issue? Maybe you see the difference between the two yaml files, but don’t believe the discrepancy is particularly problematic. Well, I’m sorry to break it to you, but it really doesn’t matter what you think. It also doesn’t matter if this particular developer didn’t see it as a problem either.

The Missing “owner” Label

The engineering manager at “X” Corporation has repeatedly told our developer to add owner labels to any resources they create for accountability purposes. They’re not too happy that this mistake just slipped on by and made it into the production environment. What are some possible ways to prevent something like this from happening in the future? Perhaps you could:

  1. Require manual code reviews for every commit to “X” Corporation’s codebase, or
  2. Write sticky-note reminders that your developers stick around their monitors, or maybe even
  3. Publicly scold our developer in front of their team, teaching them an important lesson through endless shame.

How about automated policy checks? What if “X” Corporation could maintain a repository of organization policies and constraints that the Kubernetes cluster could then use to verify each and every resource change? This is where tools like Open Policy Agent and Gatekeeper come into play.

Automating Security and Compliance

Attempting to ensure Kubernetes resource compliance manually would be error-prone and frustrating. Automating policy enforcement ensures consistency, lowers development latency through immediate feedback, and helps with agility by allowing developers to operate independently without sacrificing compliance.

In Kubernetes, you can decouple policy decisions from the API Server by means of admission controller webhooks, which are executed whenever a resource is created, updated or deleted. Gatekeeper makes use of these admission controller webhooks in order to enforce Custom Resource Definition-based policies executed by Open Policy Agent, which is a policy engine for Cloud Native environments.

Typically, an organization’s security team will maintain a Repository of Gatekeeper resources, which can be identified by their kind label, such as ConstraintTemplate:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate

or a custom policy label:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels

Gatekeeper then uses these templates and constraints to determine how, when, and why a resource should fail a check, and what happens when it does. The Gatekeeper template contains language that targets specific updates to Kubernetes resources. This logic is written in Rego, which focuses on providing powerful support for referencing nested documents and ensuring that queries are correct and unambiguous. The moment a resource is applied to a cluster, Gatekeeper can determine whether it truly belong there or not, before it even has a chance to do damage. It’s pretty powerful stuff that I’m positive would really help “X” Corporation’s development team.

What about Speed?

The way I just described it, it sounds like Gatekeeper can only exist in the cluster itself. That’s not entirely true. For example, agile organizations with many code changes will likely be using a Continuous Integration / Continuous Deployment (CI/CD) architecture. A CI/CD pipeline will automatically build code, compile binaries, run unit tests, and safely deploy the new version of an application. Sometimes it’s ALL automated, but most of the time, there is at least a group of humans who look at proposed changes before they’re deployed in production. These pipelines do all the heavy lifting for developers, while also ensuring hands-off security and compliance. Sounds like a great deal, but there is one problem:

It’s slow. I’ve seen pipelines run for hours before ultimately failing. Luckily for me, I’ve only experienced waits on the order of minutes, but it still takes a CI/CD pipeline some time to run all of its testing and building steps.

However, having to wait several minutes until my pipeline shows me that Gatekeeper found a violation in one of my deployments is still not ideal. I’ll need to correct the mistake, reapply my commit, and then wait again to see if it went through successfully.

Running Gatekeeper Tests Locally

I’ve heard from many organizations who also face this problem. Obviously, it’s wonderful be able to automate security (or some parts of it), but speed is always of the essence in organizations with agile development cycles.

Fortunately, within the Gatekeeper project is a spinoff project entitled Gator. Gator is a Command Line Utility (CLI) for evaluating Gatekeeper ConstraintTemplates and Constraints in a local environment, without the need for a local Kubernetes development cluster. Since Gator is a part of the Gatekeeper open source project, it is continuously being maintained and improved by the community. It is also highly performant, and can validate Kubernetes resources against constraints on the order of seconds. That’s why it’s the main ingredient in a special project I’ve been working on along with my friend and coworker, Janine Bariuan. We’re calling it Pre-Validate, but please let us know if you can think of a more catchy/snazzy name! If you’re interested in contributing to the code, feel free to fork the repository and give it a go!

Pre-Validate

Pre-Validate uses tools like Gator, Kpt, and Kustomize to automatically run compliance tests on Kubernetes resources any time a changed resource is committed to a branch. It brings the policy validation step of a CI/CD pipeline to the beginning of the pipeline, which helps to empower developers by providing quick, consistent feedback on their work’s adherence to organization policies.

It works by writing a Bash script to Git’s “hooks” folder. The pre-commit script is automatically run by Git before a commit can go through (as the name implies). The script will locate changed Kubernetes resources, download and organize Gatekeeper policy templates and constraints, and then use Gator to test each and every resource against those templates and constraints.

Here’s what a typical CI/CD pipeline might look like with OSS policy validations throughout. In this case, we use Gatekeeper to define Constraints and ConstraintTemplates, and they are stored in a Git Repo that can be accessed by the pipeline (Jenkins, Google Cloud Build, etc.).

Typical CI/CD pipeline with policy validation steps included throughout.

And what left-shift validation does is extend these redundant validation steps into the developer’s local development environment, much earlier in the pipeline, like so:

Developer attempts commit, and Pre-Validate handles the rest!

Important Things to Note

Left-shift validation is an Enhancement, not a Replacement

We do not intend for left-shift validation to replace other automated policy control systems. Instead, this is a project that can be used to help support developers who work on Kubernetes manifests by enhancing the delivery pipeline. Using left-shift validation means you learn if your deployments are going to fail on the order of seconds, rather than minutes or hours.

The only thing that happens if you don’t use left-shift validation to shift left on automated policy validation is it takes longer for you to learn if you have a problem. That’s all!

The script in action! I submitted our developer’s problematic code…

Closing Thoughts

While Shifting-Left may sound gimmicky, it provides a helpful framework for optimizing our usage of fully-managed cloud environments and all of the managed services that come along with them. Kubernetes has been around for some time now, but because the hardware is so obfuscated in many cases, it can be hard to truly understand the security implications of improperly-managed resources.

Tools like Open Policy Agent and Gatekeeper are there to help organizations respond to these new environments, and they both make a great case for automated compliance. They depend on their communities, and their communities depend on them. Janine and I couldn’t be happier to be working so close to these worlds, and we both hope to continue our work for a long time to come.

Thanks for your time, and if you have any feedback for our project, please visit us on Github! We’re always looking for help and suggestions for improvement.

--

--

Thomas Desrosiers
Google Cloud - Community

Cloud Security Engineer at Google. Passionate about Kubernetes and cloud architecture, with an emphasis on accessibility and evangelism.