GKE with NGINX Service Mesh 1

Implement NGINX Service Mesh with strict MutualTLS

Joaquín Menchaca (智裕)
18 min readSep 15, 2022

--

A service mesh offers automation to secure and control traffic within your Kubernetes infrastructure. With a service mesh, you can get automatic mutual TLS so that traffic is encrypted between services. It does this by injecting a side car container next to the container hosting your service.

NGINX is famous for high performance reverse-proxy load balancer, so I was curious to explore using this solution for a service mesh.

This tutorial has two parts:

  • Part 1: NSM (NGINX Service Mesh) for east-west traffic to secure traffic between services in the cluster
  • Part 2: NGINX-Plus Ingress Controller for north-south traffic to secure incoming traffic from outside of Kubernetes into the service mesh.

Unlike other tutorials that might have hello-world or a useless book info application, for this tutorial, we’ll use an a real world application that can support both gRPC and HTTPS traffic.

The real world application is Dgraph, a highly performant distributed graph database, and a pydgraph client to load data into the cluster.

📔 NOTE: This was tested on following below and may not work if versions are significantly different.* Kubernetes API v1.22
* kubectl v1.22
* helm v3.8.2
* helmfile v0.144.0
* gcloud 402.0.0
* gsutil 5.13
* external-dns v0.12.2
* cert-manager v1.9.1
* NGINX Ingress Controller 2.3.0
* NGINX Service Mesh 1.5.0
* nginx-meshctl v1.5.0
* Docker 20.10.17
* Dgraph v21.03.2

What is a Service Mesh?

First, what is a service mesh anyways?

[The service mesh] is a tool for adding observability, security, and reliability features to “cloud native” applications by transparently inserting this functionality at the platform layer rather than the application layer. (ref)

Solo also describes service mesh as aiding “in communication between services or microservices, using a proxy.” (ref).

The proxy for each service will “improve efficiency and reliability of service requests, polices and configurations”, and “will have capabilities that include load balancing and fault injection” (ref). Through this proxy, “the service mesh can automatically encrypt communications and distribute security policies including authentication and authorization” (ref).

A service mesh often have observability where aggregate telemetry data is collected to determine health, such as traffic and latency, and distributed tracing and access logs.

Service Mesh Architecture

A service mesh has three planes:

  • Control plane: responsible for configuration and management (ref).
  • Data plane: provides network functions valuable to distributed applications (ref).
  • Observability plane: tools that enable further monitoring and visualization.

Prerequisites

These are the prerequisites for part 1.

Accounts

  • Sign up for an F5 account to download tools like nginx-meshctl and any other components.
  • Google Cloud account with ownership of a project where you can deploy resources (where billing account was linked to the project)

Knowledge Requirements

You should be familiar or have exposure to the following concepts to get more thorough understanding of this tutorial:

For Kubernetes, experience with deploying applications with service and ingress resources is useful, but even if you don’t have this, this guide will walk you through it. Configuring KUBECONFIG to access the Kubernetes cluster with Kubernetes client (kubectl) and using Helm (helm), so familiarity to this is useful.

For Google Cloud, you should be familiar Google Cloud SDK (gcloud tool) with setting up an account, project, and provisioning resources. This is important as there are cost factors involved in setting these things up.

Tool Requirements

  • Google Cloud SDK (gcloud command) to interact with Google Cloud
  • Kubernetes client (kubectl command) to interact with Kubernetes
  • Helm (helm command) to install Kubernetes packages
  • helm-diff plugin to see differences about what will be deployed.
  • helmfile (helmfile command) to automate installing many helm charts
  • Docker Engine (docker command) to automate building and pushing running pydgraph client to Google container registry.
  • NSM command line tool (nginx-meshctl) is tool used to deploy and interact with the service mesh, as well as automatically inject init container and side car to the Kubernetes manifest.
    NOTE: This tool is gated behind https://downloads.f5.com that seems to have a lot of problems, so it is recommend you try to get this tool earlier.

Tools (Recommended)

  • POSIX shell (sh) such as GNU Bash (bash) or Zsh (zsh): these scripts in this guide were tested using either of these shells on macOS and Ubuntu Linux.
  • GNU stream-editor (sed) and GNU grep (grep): scripts were tested with these tools and the macOS or BSD equivalents may not work.
  • curl (curl): tool to interact with web servers from the command line.
  • jq (jq): a JSON processor tool that can transform and extract objects from JSON, as well as providing colorized JSON output greater readability.
  • gprcurl (gprcurl): tool to interact with gRPC servicers from the command line.

These tools can be installed with Homebrew on macOS and with Chocolatey and MSYS2 on Windows.

Project Setup

This will setup all the content for this tutorial.

Directory structure

We want to create this directory structure in your project area:

~/projects/nsm
├── clients
│ └── fetch_scripts.sh
├── dgraph
│ ├── dgraph_allow_lists.sh
│ └── helmfile.yaml
├── nsm
│ └── helmfile.yaml
└── o11y
└── fetch_manifests.sh

In GNU Bash, you can create the above structure like this:

export PROJECT_DIR=~/projects/nsm
mkdir -p $PROJECT_DIR/{clients,dgraph,nsm,o11y}
cd $PROJECT_DIR
touch {nsm,dgraph}/helmfile.yaml \
o11y/fetch_manifests.sh \
dgraph/dgraph_allow_lists.sh \
clients/fetch_scripts.sh

Environment variables

These environment variables will be used in this project. Create a file called gke_env.sh with the contents below, changing values as appropriate, and then run source gke_env.sh.

Nginx Service Mesh command line tool

Visit https://downloads.f5.com and download the FREE command line tool for the open source NGINX Service Mesh called nginx-meshctl.

📔 NOTE: The F5 download site may take several hours after registering with your personal information before you get instructions that allow you access to download any content. And even then, there may be other issues, so I suggest getting this tool downloaded earlier rather than later.

Download the appropriate package for your platform (or future platforms you may want to use) to your Downloads folder:

From either macOS or Linux (tested with Ubuntu), you can install the downloaded binaries with the following commands:

Google project setup

There will be two Google cloud projects created to provision cloud resources. One project will contain the GKE cluster, and another project will have GCR for the container images.

The reason for this setup with two projects is that for environments that have projects that have shared cloud resources, especially with different levels of access required, multiple projects are often used. This makes it easier to manage cloud resources, especially with regards to security configuration.

You can set this up in the web console, or by typing these commands:

Provision cloud resources

These instructions will create the necessary cloud resources for this project.

Provision Google Kubernetes Engine cluster

The steps below will allow you to bring up a Kubernetes cluster with 3 worker nodes.

You can test access to the cluster as well as the components installed with:

kubectl get nodes
kubectl
get all --all-namespaces

Another useful command to test a new cluster is see how many resources are available and what is consumed in a new cluster:

kubectl top nodes
kubectl top pods --all-namespaces

Provision Google Container Registry

The GCR (Google Container Registry) is implicitly available to all projects, provided the APIs are enabled, and it will leverage from GCS (Google Cloud Storage) buckets to store the images.

In order to have local access with Docker, so that you can push images to GCR:

gcloud auth configure-docker

We also need to allow GKE to access the images stored on GCS, which can be done with this command:

source gke_env.shgsutil iam ch \
serviceAccount:$GKE_SA_EMAIL:objectViewer \
gs://artifacts.$GCR_PROJECT_ID.appspot.com

This will grant read access to all containers running on the GKE cluster. Only the operator (or owner) has write access using the access you used from their workstation (like one’s development laptop), which was initially created with the gcloud auth command when first setting up Google Cloud SDK.

Deploy NSM with Observability

NGINX provides some example manifests to install o11y (observability) tools with NSM. You can deploy these basic observability tools (Jaeger, Prometheus, Grafana, Otel Collector) with the instructions below.

Using Raw Manifests (Not Preferred)

The online tutorial for this area instructs readers to download several manifests and install them using kubectl. If you want to use this method, you can do that by running in GNU Bash (bash) the following commands:

Using ad-hoc Helm Charts (Preferred)

Alternatively, if you would like to install all of these as a helm charts that are embedded into a single helmfile.yaml, you can do that with the script below.

Save the following below as o11y/fetch_manifests.sh:

Note this requires GNU Bash (bash), GNU stream-editor (sed) and GNU grep (grep) to run this script.

When ready run the following with:

source gke_env.shpushd o11y && ./fetch_manfests.sh && helmfile apply && popd

You can verify the helm charts were deployed by typing:

helm ls --namespace nsm-monitoring

You can verify the components are running by typing:

kubectl get all --namespace nsm-monitoring

You should see something like this:

Deploy NGINX service mesh

NGINX service mesh components (SPIRE, NATS, nginx-mesh-api, nginx-mesh-api) can be installed by running the following steps below.

Save the following at nsm/helmfile.yaml:

When ready to deploy, you can install NSM with the following command:

helmfile --file nsm/helmfile.yaml apply

You can check the status with:

kubectl get all --namespace nginx-mesh

This should look something like:

Observability dashboards

You can visit the various dashboards for observability components through the following steps.

Prometheus dashboard

Run the following in a separate terminal tab:

source gke_env.sh
kubectl --namespace nsm-monitoring port-forward svc/prometheus 9090

You can view it at http://localhost:9090.

Grafana Dashboard

Run the following in a separate terminal tab:

source gke_env.sh
kubectl --namespace nsm-monitoring port-forward svc/grafana 3000

You can view it at http://localhost:3000.

Jaeger Dashboard

Run the following in a separate terminal tab:

source gke_env.sh
kubectl --namespace nsm-monitoring port-forward svc/jaeger 16686

You can view it at http://localhost:16686.

Deploy Dgraph

Dgraph, the highly performant distributed graph database, will be deployed with the service mesh side-cars manually injected.

Create the following file below as dgraph/helmfile.yaml:

This cannot be installed directly, as we need to first inject the the side-car proxy containers as part of the deployment.

We also want to ignore or skip traffic between Dgraph cluster members on port 5080 and port 7080, as NSM will run into issues with SSL_do_handshake errors with gRPC.

We can do all the following with the commands below:

This will deploy 3 Dgraph Alpha nodes and 3 Dgraph Zero nodes along with the proxy side-cars to secure traffic for anything communicating to the service endpoint within the Kubernetes cluster.

You can verify the components are installed with:

kubectl get all --namespace dgraph

You should see something like this:

You notice that 2 of 2 containers are ready: dgraph-dgraph-alpha for the Dgraph container and the proxy container nginx-mesh-sidecar for the proxy container.

Deploy Pydgraph clients

For this exercise, the pydgraph client will be deployed twice, once without proxy-side car injection (negative test) and again with proxy-side car injection (positive test).

The purpose of this is to demonstrate that strict mode is working, so that any clients outside of the service mesh should NOT be able to communicate with the Dgraph service.

Clients that ARE part of the service mesh will be able to securely communicate with the Dgraph service.

Download Scripts

In this section, we’ll fetch some scripts used to build and publish the client image:

You can download the scripts from the following script:

Running this above should create the following directory structure:

./clients/
└─── examples
└── pydgraph
├── Dockerfile
├── Makefile
├── helmfile.yaml
├── load_data.py
├── requirements.txt
├── sw.nquads.rdf
└── sw.schema

Build and publish the client image

For this part, it is important that the steps to enable Docker login to GCR and read access is granted to the principal identity used for the worker nodes (GCE). This process should have been handled in an earlier step.

When ready, run the following:

source gke_env.shpushd ./clients/examples/pydgraph
make build
make push
popd

Negative Test: client is outside of the Mesh

This small exercise will do a negative test where only the server server are on the service mesh set to strict mode.

When ready, run the following command to deploy the client for a negative test:

helmfile \
--file ./clients/examples/pydgraph/helmfile.yaml \
--namespace "pydgraph-no-mesh" \
apply

Run the following to exec into the client container:

export CLIENT_NAMESPACE="pydgraph-no-mesh"PYDGRAPH_POD=$(
kubectl get pods --namespace $CLIENT_NAMESPACE --output name
)
kubectl exec -ti --namespace $CLIENT_NAMESPACE \
${PYDGRAPH_POD} -- bash

Afterward, once in the container, you can run the following commands to test HTTP and gRPC connections.

Test the HTTP will fail fail with the following command

curl --silent -v ${DGRAPH_ALPHA_SERVER}:8080/health

This should show something like the following:

Since the connection expects HTTPS, we can try that as well:

curl  --silent -k https://${DGRAPH_ALPHA_SERVER}:8080/health

This will show the following failure where the server expects a client certificate to be sent to the server, hence mutual TLS:

Test the gRPC will fail fail with the following command:

# test gRPC connection
grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

This should show something like this:

Removing the plaintext tag so that h2 traffic is used, not h2c (HTTP/2 in clear-text):

grpcurl -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

This should show something like:

Now attempt to upload data to Dgraph with the following command:

# Load Data with pydgraph-client
python3 load_data.py \
--plaintext \
--alpha ${DGRAPH_ALPHA_SERVER}:9080 \
--files ./sw.nquads.rdf \
--schema ./sw.schema

These should all timeout or fail immediately with something like this:

When finished, logout of the container:

logout

Positive Test: client is part of the Mesh

This small exercise will do a positive test where both the client and server are on the same service mesh.

For this section, we have to inject the side-car proxy with the following command:

Afterwards exec into the client with the following command

export CLIENT_NAMESPACE="pydgraph-client"PYDGRAPH_POD=$(
kubectl get pods --namespace $CLIENT_NAMESPACE --output name
)
kubectl exec -ti --container "pydgraph-client" \
--namespace
$CLIENT_NAMESPACE \
${PYDGRAPH_POD} -- bash

Once inside the container, run these commands to test HTTP and gRPC connections.

Test that HTTP works with this command:

curl --silent ${DGRAPH_ALPHA_SERVER}:8080/health | jq

This should show something like this:

Test that gRPC works with this command

# test gRPC connection
grpcurl -plaintext -proto api.proto \
${DGRAPH_ALPHA_SERVER}:9080 \
api.Dgraph/CheckVersion

You should see something like this:

Finally upload some data with this command:

# Load Data with pydgraph-client
python3 load_data.py \
--plaintext \
--alpha ${DGRAPH_ALPHA_SERVER}:9080 \
--files ./sw.nquads.rdf \
--schema ./sw.schema

These are expected to return values after entering the commands. If you have the Grafana open through the port-forward command introduced earlier, you should see some new traffic generated.

When finished log out of the container:

logout

What’s Next

The next article covers how to an ingress controller using KIC (NGINX and NGINX Plus Ingress Controllers for Kubernetes) that will integrate with the service mesh so that traffic between the ingress controller and the service mesh is through mTLS.

Clean up

To clean up these the existing resources, follow the steps below.

Kubernetes components

Cloud Resources

References

These are resources that I came across when researching this topic.

Blog Source Code

These are further notes and code that I created in the testing of NGINX service mesh solution.

NGINX Service Mesh Documentation

How-Tos and other articles

NSM vs other F5 solutions

NGINX has paired with Istio to use NGINX as an alternative before NSM, and also F5 offers an alternative solution around Envoy instead of NGINX.

Concepts

Conclusion

I hope this journey to explore NSM (NGINX Service Mesh) was useful. As I have tried out other service meshes, specifically Linkerd and Istio before, I had some observations on NSM, see The Mini Review below.

Take-Aways

Besides concepts presented in this article, there are some other implicit take-aways from the scripts and instructions used in this guide. These are some of these take-ways that are covered in this tutorial:

  • Provisioning GKE with principal of least privilege (gcloud)
  • Granting read access to GCR without configuring secrets (gsutil)
  • Building and Publishing a container image on GCR (docker).
  • Deploying applications with Helmfile (helmfile) and Helm charts.
  • Interacting with Kubernetes: kubectl, helm
  • Testing traffic using both gRPC (grpcurl) and HTTP (curl)
  • Exposure to command line tools: bash, sed, grep, jq
  • Exposure to Dgraph distributed graph database
  • Exposure to using pydgraph library to access Dgraph via gRPC.
  • Deploying NGINX Service Mesh with mTLS set to strict mode.

The Mini Review: Observations

I think this was a fun learning experience explore an alternative a service mesh, but I do have some observations with NSM in its current state:

  • FREE Tools locked behind License Wall: the FREE tool nginx-meshctl for open source is locked behind a license wall F5 site that has been problematic and causes a lot of frustration. It should be released as open source and available publicly.
  • NSM is not free: NGINX Service Mesh requires a commercial license for NGINX Plus for integration with NSM for the ingress controller (covered in part 2)
  • Future of NSM: F5 is actively promoting another solution around Envoy as an alternative to NGINX solutions.
  • All or Nothing with Injection: Mixed mesh and non-mesh is not easily supported, as auto-injection is controlled centrally, rather than scanning for annotations that are conditionally configured.
  • All or Nothing with the NGINX Ingress Controller: When integrating with the NGINX Ingress Controller (covered in part 2), ingress to pods outside of the mesh will not work. Only pods that have the side-car proxy can use the ingress controller.

Also, I found these issues that came up when trying NSM:

Observations: FREE Tools locked behind License Wall

The instructions direct users to free tools (nginx-meshctl) and images are locked behind myF5 download site, which during this writing, had some numerous issues with registration or availability.

All of this adds an extra layer of unnecessary complexity and friction. Other service mesh solutions like Linkerd and Istio and also F5’s AspenMesh, do not this issue as their free tools are open source and publicly available.

Observations: NSM is not free

Integration with an ingress (covered in part 2) REQUIRES a commercial license to NGINX Plus, and even the free trial requires a business email, so if you are not part of an organization, you’ll need to register a DNS domain with an e-mail provider (or roll your own), configure MX records, DKIM, etc.

In the competitive landscape, other service meshes like Linkerd, Istio, Consul, Kuma, and even F5’s AspenMesh, support integration with an ingress that is open source, freely downloadable, or does not require a commercial subscription to use.

Observations: Future of NSM

Currently F5, the current owners of NGINX, is promoting an alternative solution called AspenMesh that is built around Envoy instead of NGINX.

Let me re-emphasize that,

F5, the owner of NGINX, is promoting Envoy, not NGINX.

Earlier, NGINX experimented Nginxmesh about three to four years ago, which is Istio, but with Envoy swapped out for NGINX. So this was a surprise that F5 would still push Envoy based service mesh, and not figure out how to use NGINX with Istio to at least promote NGINX across the board.

In the area of actually promoting the NGINX Service Mesh solution, despite the phenomenal popularity of NGINX, NSM doesn’t seem to be all that popular (GitHub project has 70 stars ⭐️⭐️), and it is not very proactively promoted in contrast to other open source projects like Calico, Istio, ArgoCD, Meshery, etc. You won’t find recent material as far as workshops, tutorials, articles, etc. for NSM. The exception being this article of course.

The current promotional material for NSM posted recently on Twitter is recycled content that is more than a year old, and that content has low interest, e.g. no likes, less than a hundred views for their videos. There’s little response through the Twitter channel, and sales channels require you to register a work e-mail (likely to generate sales leads), or there’s paid commercial support. For the trial, business development offer support, but don’t follow up on questions. On Reddit, fortunately, I did find more proactive employees engaged with the community, and I have been able to get some interaction when reporting bugs through GitHub issues.

Observations: All or Nothing with Injection

One thing that I found was a little unusual in contrast to other service mesh implementations is that when using NSM, you that will automatically inject side cars into ALL Deployments, StatefulSets, Pods, etc. deployed onto the Kubernetes cluster, instead of just objects that are flagged for injection with annotations.

You can customize this auto-injection configuration centrally by configuring autoInjection.disabledNamespaces or autoInjection.enabledNamespaces in the Helm chart values, or simply disable it completely with autoInjection.disable=true.

You can do manual injection, automated through the nginx-meshctl tool. This is actually required if you need to exclude some TCP ports for either inbound or outbound traffic, which I definitely had to due do to SSL Handshake Errors encountered.

The unusual thing about this injection, is unlike other service mesh implementations where you either inject annotations in the namespace or pod level, so that auto-injection can scan and automatically inject the side-cars, with nginx-meshctl inject, you are actually injecting the side-containers into the Pod specification.

This makes it supporting manual injection outside of automation with nginx-meshctl difficult to support. If it was was just a matter of an annotation, this can be done easily.

Observations: All or Nothing with the NGINX Ingress controller

When integrating the NGINX Ingress controller (next article will cover this), the ingress will only function for ONLY pods that are part of the service-mesh, otherwise, ingress will not work (ref. Ingress disabled for non-mesh traffic).

The documentation confirms this:

“All communication between NGINX Plus Ingress Controller and the upstream Services occurs over mTLS, using the certificates and keys generated by the SPIRE server. Therefore, NGINX Plus Ingress Controller can only route traffic to Services in the mesh that have an mtls-mode of permissive or strict. In cases where you need to route traffic to both mTLS and non-mTLS Services, you may need another Ingress Controller that does not participate in the mTLS fabric.” (ref)

With this all-or-nothing integration with the ingress, you will have to compromise security to use the ingress, because you will have to add ALL services to the service mesh in order to use the ingress. To work around this issue, you will have to add further complexity and expense. Here are some possible solutions for the problem created by the ingress implementation:

  • deploy two ingresses, one for the integrated for the mesh, one that is not integrated.
  • use network policies, such as Calico network plugin, to restrict traffic.
  • configure the mesh to deny all traffic, and add exceptions by configuring SMI Spec feature for traffic access control, which is currently in alpha.

Note, for the later option, I tested this and found it only works for HTTP, but gRPC traffic still works: https://github.com/nginxinc/nginx-service-mesh/issues/76.

Finally

As a student learning about service meshes, I do think it is worth exploring this solution, as there are good example configurations for some of the open source platforms that are integrated into NSM.

For those heavily invested commercially into NGINX Plus, this could be simple solution to get started with service meshes, especially if you already have an existing license of NGINX Plus, NSM could be a nice fit.

--

--

Joaquín Menchaca (智裕)

DevOps/SRE/PlatformEng — k8s, o11y, vault, terraform, ansible