Unleashing the Power of Cilium CNI to Propel Trendyol’s Performance Up to 40%!

Trendyol implemented Cilium as the default CNI for the Kubernetes Cluster starting from version 1.26. Discover our journey.

Emin Aktaş
Trendyol Tech

--

Photo by Naddod Networking on Unsplash

Written by Asım Sezai Ceylan and Emin Aktaş

Introduction

During our journey, we will explore the reasons behind our choice of Cilium as well as the configuration process we undertook to enable it within our many and large Kubernetes clusters.

At Trendyol, we had been relying on a rather simple pod-network connectivity Container Network Interface (CNI) known as Flannel. Recognizing the need to enhance our clusters and unlock new capabilities, we sought out a more advanced solution.

Our infrastructure has been efficiently managed by our network and traffic teams this reason the need for advanced CNIs unnecessary. However, as our Kubernetes clusters grew in size and complexity, we recognized the need to upgrade our network infrastructure and explore alternatives such as Cilium.

Cilium: Beyond Simple Pod Connectivity

Cilium transcends the role of a typical CNI by encompassing a plethora of features. While delving into the extensive range of capabilities is beyond the scope of this article, it is crucial to highlight that Cilium is an open-source CNI tool that empowers us with enhanced security, policy management, visibility, and mesh technologies. Leveraging the power of eBPF (extended Berkeley Packet Filter), Cilium presents us with a robust framework to bolster our network infrastructure.

Evaluating Network Performance

To comprehensively assess network performance in various cloud environments, we established two distinct testbeds:

  1. OpenStack: A test environment specific to the OpenStack cloud.
  2. Bare Metal: A test environment utilizing bare metal servers.

To cover all aspects of generic pod traffic, we designed three types of tests:

  1. Localhost Tests: These tests were conducted within a single pod, between two containers.
  2. Same-Host Tests: These tests involved communication between two containers residing on different hosts within the same bare metal setup.
  3. Different-Host Tests: These tests focused on communication between containers located on separate bare metal nodes or VMs.

Before executing each test type, we conducted node-to-node tests to establish baseline values. Our benchmark tool of choice is netperf.

To cover all conceivable network types, we performed multiple test scenarios, including:

  1. Stream Tests: These tests were instrumental in evaluating the throughput performance of pods and nodes.
  2. RR tests (Request-Response): These tests allowed us to assess the packet per second and latency performance of pods and nodes.
  3. CRR tests (Connect-Request-Response): By utilizing this scenario, we could evaluate the New Connection Per Second performance of pods and nodes.

By employing this comprehensive testing approach, we gained valuable insights into the network performance of our infrastructure.

Based on our overall results even if we didn’t enable advanced features like XDP, the default implementation of Cilium has already shown improvements in network connectivity.

Test times are as follows: 120 seconds for stream tests and 75 seconds for RPS tests. Also, CPU pinning was enabled during the tests.

Stream Tests:

Our focus was on measuring throughput from the client to the server, similar to upload scenarios. We experimented with multiple TCP message sizes (128 1K 8K) to explore various results.

RPS Tests:

As a result, Cilium managed to outperform its peers in key performance indicators. Here are Cilium’s notable differences based on our test results:

  1. Throughput Tests (OpenStack): In comparison, Cilium showcased a 41.6% increase in throughput rates.
  2. Throughput Tests (Baremetal): Cilium outperformed by 39.6% in throughput tests conducted on physical servers.
  3. Packet per Second (PPS) Tests: In PPS tests, Cilium exhibited a 12% improvement.
  4. Latency Tests: While all CNIs exhibited similar performance in latency tests, Cilium boasted a 3.59% improvement in latency compared to its counterparts.

These benchmark results clearly illustrate the performance capabilities of Cilium, confirming its suitability for our specific needs. We should also note that Flannel is just provided us a simple network connectivity however, Cilium provides Visibility and Security features along with many other features.

Challenges

Throughout the testing process, we encountered several challenges. Here are the key challenges we faced and how we addressed them:

  1. Finding Reasonable Test Times: Initially, test durations were excessively long, impacting the number of tests that could be conducted within a given timeframe. To overcome this, we optimized and streamlined the test scenarios, reducing the test duration from 124 hours to a significantly shorter 7.5 hours.
  2. Non-Isolated Environments: In the early stages of testing, all our environments were isolated. However, due to technical limitations, one of the test environments remained connected to public pools. This caused unintended interference from public nodes, affecting the test environment’s results.
  3. Graphical Representations: The data output from the tests did not follow a structured time series format, posing challenges in creating meaningful charts and visual representations. To address this, we leveraged a multi-table database with PSQL, organizing the test results into separate tables based on each test method. This allowed for effective analysis and visualization of the data.
  4. Container Runtime Problems: In some test scenarios, we observed better performance in Different-Host Tests compared to Same-Host Tests. To troubleshoot this issue, we conducted additional tests using a different container runtime on Docker. The results confirmed our expectations, with Same-Host Tests exhibiting better throughput.
  5. CNI Know-How: Working with multiple CNIs, each with its own specific configurations and modes, required technical expertise. This led to extensive learning sessions, involving trial and error to understand the behaviors and outcomes of different CNI modes.

Move Away from kube-proxy

In Kubernetes, kube-proxy utilizes iptables or ipvs mode to route services to assigned pods, enabling load balancing through a Round Robin scheme. Additionally, CNI relies on iptables for implementing network policies. However, the use of kube-proxy can present challenges and inefficiencies. eBPF has emerged as a new alternative to solve the following issues by completely removing kube-proxy.

  • Resource Consumption: The continuous addition of rules to iptables or ipvs for new services, pods, or nodes can result in increased resource consumption. Over time, the accumulation of rules can lead to network overhead and resource burden.
  • Management Complexity: Writing network policies as sets of iptables rules can be challenging to manage. The structure of rules makes troubleshooting cumbersome and can slow down the overall process.
  • Scalability Issues: As the number of services grows, so does the resource consumption and overhead associated with iptables chains.

To address these challenges, Cilium has transitioned away from kube-proxy in most of its processes. Instead, Cilium leverages eBPF hooks to handle service routing and network policies. As a result, Cilium achieves enhanced efficiency, improved performance, and scalability in managing network traffic and policies.

NodeLocal DNSCache Deployment Considerations

If you intend to utilize NodeLocal DNSCache, it is important to make some adjustments to the deployment method. Due to the inability of Cilium to track non-Cilium routing, certain features like observability may be lost.

Fortunately, Cilium has introduced the Local Redirect Policy, currently in beta. There have been discussions about promoting it to the general availability (GA) phase. So far, we have not encountered any issues with its implementation.

Here you can check out the tutorial however, we are deploying with a network interface.

To get started, you can refer to this link here. However, please note that in our example, we are deploying with a network interface and it is being introduced to the cluster.

During the deployment process, pay attention to the following variables:

  • __PILLAR__CLUSTER__DNS__: This will be populated with the kubedns IP. If necessary, make changes accordingly.
  • __PILLAR__UPSTREAM__SERVERS__: If left unchanged, it will default to “/etc/resolv.conf”. Modify this variable as needed.

In the DaemonSet file, replace “<kube-dns IP address>” with the actual IP address of your kube-dns service. To retrieve the IP address, use the following command:

# In our cluster, we are using coredns so, we are getting coredns service IP.
$ kubedns=$(kubectl get svc coredns -n kube-system -o jsonpath={.spec.clusterIP})

You can access the manifests here.

Cilium Configuration

Thanks to Cilium’s exceptional out-of-the-box performance, the complex configuration is a thing of the past. With Cilium, you can enjoy a seamless experience without the need for intricate setup or configuration.

# Security context to be added to agent pods
securityContext:
privileged: false
prometheus:
enabled: true
serviceMonitor:
enabled: true
operator:
prometheus:
enabled: true
serviceMonitor:
enabled: true
enableK8sEventHandover: false # Set to true when etcd is enabled.
identityAllocationMode: "crd" # Set to kvstore when etcd is enabled.
bandwidthManager:
enabled: true
# BBR requires a v5.18.x or newer version of Linux kernel
bbr: false
localRedirectPolicy: true
endpointStatus:
enabled: true
status: "policy health controllers log state"
extraArgs:
- --allow-localhost=policy
encryption:
enabled: false
nodeEncryption: false
ipam:
mode: kubernetes
# If true, enables sidecar-free service mesh
ingressController:
enabled: false
# The envoy runs in the cilium pod along with the cilium agent for sidecar-free design.
# If the envoy runs in the cilium pod, proxy.prometheus.enabled should be set to true to collect metrics.
proxy:
prometheus:
enabled: false
# For the replacement of kube-proxy, the value is set to strict.
kubeProxyReplacement: strict
# This provides access to the kube-apiserver over nodes for Cilium to run.
k8sServiceHost: "127.0.0.1"
k8sServicePort: "6443"
# Enables eBPF
bpf:
masquerade: true
hostLegacyRouting: false
hubble:
enabled: true
enableOpenMetrics: true
metrics:
enabled:
- dns:query
- drop:sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity
- tcp
- icmp
- port-distribution
- flow:sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity
- httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_ip,destination_namespace,destination_workload,traffic_direction;sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity
serviceMonitor:
enabled: true
metricRelabelings:
- sourceLabels: [__name__]
regex: grpc_server_.*
action: drop
peerService:
clusterDomain: cluster.local
relay:
enabled: true
# Cilium Relay does only provide server metrics
# For example, grpc_server_started_total and grpc_server_msg_received_total
prometheus:
enabled: false
port: 9966
serviceMonitor:
enabled: false
tolerations:
- operator: Exists
ui:
enabled: true
tolerations:
- operator: Exists
tunnel: vxlan
# https://cilium.io/blog/2020/06/22/cilium-18/#socketlb
socketLB:
enabled: true
# Enables istio to work properly.
hostNamespaceOnly: true

The Testbed

We have devised and tested various scenarios based on our environment, with the CNI Benchmark: Understanding Cilium Network Performance article serving as a valuable reference. With Terraform, Ansible, and network tools, we created an automated system to test different CNI and Linux configurations to decide which is best for our environment.

Our network test script is available for you to access here. This script enabled us to automate testing across various scenarios, generating results in a format that can be easily represented in Grafana.

OpenStack:

System:
Host: benchmark-test-node Kernel: 5.4.0-1046-kvm x86_64
bits: 64 Console: N/A Distro: Ubuntu 20.04.3 LTS (Focal Fossa)
Machine:
Type: Kvm System: Trendyol product: HPC v: 19.3.2
serial: **omitted**
Mobo: N/A model: N/A serial: N/A BIOS: SeaBIOS v: 1.10.2-1ubuntu1
date: 04/01/2014
CPU:
Topology: 8x Single Core model: Intel Xeon Platinum 8358 bits: 64
type: MT SMP L2 cache: 128.0 MiB
Speed: 2594 MHz min/max: 16 Cpu
Network:
Device-1: Intel 82371AB/EB/MB PIIX4 ACPI type: network bridge driver: N/A
Device-2: Red Hat Virtio network driver: virtio-pci
IF: ens3 state: up speed: -1 duplex: unknown mac: fa:16:5e:dc:26:de
IP v4: **omitted** type: dynamic scope: global
Memory: 31.37 GiB used: 371.7 MiB (1.2%)

Baremetal:

System:
Host: benchmark-test-node Kernel: 5.13.0-48-lowlatency x86_64 bits: 64
Console: N/A Distro: Ubuntu 20.04.4 LTS (Focal Fossa)
Machine:
Type: Server Mobo: HPE model: ProLiant DL360 Gen10 Plus
serial: **omitted** UEFI: HPE v: U46 date: 11/29/2021
CPU:
Topology: 2x 32-Core model: Intel Xeon Platinum 8358 bits: 64
type: MT MCP SMP L2 cache: 96.0 MiB
Speed: 902 MHz min/max: 800/3400 MHz Core speeds (MHz): 128 Core
Network:
Device-1: Mellanox MT27800 Family [ConnectX-5] driver: mlx5_core
IF: ens10f0 state: up speed: 25000 Mbps duplex: full
mac: **omitted**
IP v6: fe80::8ae9:a4ff:fe3f:5e6c/64 scope: link
Device-2: Mellanox MT27800 Family [ConnectX-5] driver: mlx5_core
IF: ens10f1 state: up speed: 25000 Mbps duplex: full
mac: **omitted**
IP v6: **omitted** scope: link
IF-ID-1: ens10f0.242 state: up speed: 25000 Mbps duplex: full
mac: **omitted**
WAN IP: **omitted**
Memory: 503.56 GiB

Test Layout

Test Topologies

Baremetal: These tests are conducted on bare metal operating systems. They provide valuable insights into bare metal performance, establishing a baseline for C2C (Container-to-Container) tests that utilize bare metal as the provider. Specifically, the “Baremetal-to-different-baremetal” tests are categorized under node-to-node different host tests.

Node-to-Node: These tests are performed on virtual machines (VMs) provided by virtualization stacks such as OpenStack and VMware. They serve to measure VM performance and establish a baseline for C2C tests that utilize virtualization stacks as the provider. Within this category, “Node2node-same-host” tests are conducted on the same bare metal with different VMs, while “Node2node-different-host” tests are conducted between different VMs on different bare metals.

Container-to-Container: These tests are carried out within Kubernetes containers and utilize various platforms as worker nodes, including bare metal, vCloud, and OpenStack. Within this category, “c2c-localhost” tests run within a single pod, between two containers. “c2c-same-host” tests are conducted between two containers located on different hosts within the same bare metal. Lastly, “c2c-different-host” tests run between two containers located on different bare metal VMs.

These testing categories provide comprehensive insights into performance across different environments, enabling Trendyol to optimize its infrastructure accordingly.

Conclusion

Cilium has proven to be a changer for Trendyol’s Kubernetes clusters. With its advanced capabilities in networking, observability, and security, Cilium has met our expectations, outperforming previous CNIs.

We have successfully tackled challenges, optimized performance, and enhanced scalability by leveraging Cilium’s features such as eBPF-based networking, NodeLocal DNSCache, and the Local Redirect Policy.

Looking ahead, our future plan includes the creation of large Kubernetes clusters and the implementation of an external etcd for both Cilium and Kubernetes events. This will further bolster our infrastructure’s resilience and flexibility.

This is, of course, the work of the Trendyol team. Special thanks to the team leaders and members for their work. Also, special thanks to Cilium Community to provide such CNI to the Open Source community.

--

--