The importance of limits for containerised JVM applications

Resource limits for OS-, Docker-, Kubernetes-, and the JVM level.

Nikola Stanković
Viascom Publications
8 min readAug 15, 2022

--

Preface

“A container is “randomly” shutting down on a production server! And there are no log entries that anything like that happened!!”

I have heard it from a client before. Knowing that nearly nothing is genuinely random for computers, anyone has to get suspicious at this point. My investigation steps follow to find the root cause for a “randomly” shutdown application and my recipe for stable environments running containerised JVM applications.

Preconditions

Before we can jump in, there are some preconditions. And I assume that if you found this guide, you do

*Nope, I’m not secretly sponsored to list these links, nor do I get a gold star every time you click one. These are either Google-gifted or from the mysterious depths of my bookmarks. It’s just some developer-curated content for your benefit. So, enjoy this free tour through my biased digital lens! End of a cheeky disclaimer.

Investigation

Usually, I encounter JVM applications running on a Linux server. This guide is therefore investigating the mentioned bug on such a system. If you use something else, you must adapt the system commands.

Investigation plan

This was the resulting analysis plan:

  • Check Docker / Kubernetes resources usage stats
  • Check server resources usage stats
  • Check JVM RAM consumption stats

But let’s continue with a bit of Linux theory first 😅 This was the key to understanding and solving the quoted “randomness”.

The automatic process kills in Linux

The Linux Kernel may terminate one or more processes when the system runs low on resources. A widespread example is the out-of-memory (OOM) killer, which takes action when the system’s physical memory is exhausted.

The job of the OOM killer is to pick the minimum number of processes when the system is running out of memory and terminate them. It uses a badness score to decide which processes to kill. While making that decision, it tries to minimise the damage by making sure that it:

  • minimises the lost work
  • recovers as much memory as possible
  • doesn’t kill innocent processes, but only those that consume a lot of memory
  • minimises the number of killed processes (ideally, just one)
  • kills the process(es) that the user would expect

We must know that Linux is doing this job to avoid a RAM shortage. Bringing the computer to its knees and introducing limits on all involved levels is the answer to prevent this from happening to your production-proof services.

See the following reference for more details regarding the OOM killer: https://www.kernel.org/doc/gorman/html/understand/understand016.html

Investigation executions

Check for Docker stats

docker stats $(docker ps --format '{{.Names}}') --no-stream

See the following reference for more details regarding the docker stats command: https://docs.docker.com/engine/reference/commandline/stats/

One of the essential facts we can get out of these Docker stats is that if the Limit is the same as our total physical RAM, we certainly have no limit set at all 😅.

Check Kubernetes stats

If you are using a Kubernetes environment, the following command will supply you with the same data:

kubectl top pod --all-namespaces | sort --reverse --key 3 --numeric

See the following reference for more details regarding the kubectl top pod command: https://docs.docker.com/engine/reference/commandline/stats/

Check for RAM usage

free -m

See the following reference for more details regarding the free command: https://linuxize.com/post/free-command-in-linux/

Here, I realised that we must have an error connected with the RAM as only 97 MB are left, and memory swapping is also happening.

Check last server reboot

who -b -H

The -b, --boot option tells who to print the time of the last system boot. If you want to print the column headings, add the -H (--heading) option.

See the following reference for more details regarding the who command: https://linuxize.com/post/who-command-in-linux/

Or you execute the last reboot command:

last reboot

See the following reference for more details regarding the last command: https://linuxize.com/post/last-command-in-linux/

Check for kill events

Execute the following Linux command on your server to get information about any eventual kill events:

dmesg -T | grep -i kill

See the following reference for more details regarding the dmesg command: https://linuxize.com/post/dmesg-command-in-linux/

Having log entries which contain kill is proof that the OOM-killer took action. At the same time, this is why there is no log entry or graceful shutdown messages in our application log visible. It just got killed without having any time to log anything.

CPU over-consumption

If CPU usage is too high, users will experience a long load and save time, and in the worst-case scenario, programs will start to freeze because the processor is overloaded with too many processing commands. There will be no automatic killing of processes for over-using the CPU.

Java runtime stats: jmap, jstack, jcmd, jinfo

Usually, we run our applications in a container with a JRE. However, this brings the downside that we don’t have tools like jmap, jstack, jcmd and jinfo as they are part of the JDK.

I started to use jattach, a tool which combines all the mentioned tools above. I install them in my containers to have them ready when investigating.

By running the following command, we see that reserved memory represents the total amount of memory our application can use. Conversely, the committed memory is equal to our application's current memory.

jattach <pid> jcmd VM.native_memory

To be able to fetch this data, there is one precondition to enable native memory tracking by setting the following JVM parameter:

-XX:NativeMemoryTracking=summary

See the following reference for more details regarding the native memory tracking feature: https://docs.oracle.com/en/java/javase/18/vm/native-memory-tracking.html

This is how I used it to investigate the memory:

Checkout my example of a production-proof Dockerfile there; I’ve included all those preconditions to be available for upcoming investigations:

Another approach … VisualVM

We could also decide to use tools like VisualVM to analyse, but that brings the downside of the need to configure and open the JMX port, which in a container has to be included in the CMD part and, therefore, can only be done before we launch our application, container.

If you are interested in that path, check out the following guide:

Disadvantage: Runtime data only

With nearly all of the shown commands, you can fetch the runtime data, but you are most often investigating a bug and would need historical data. This is a problem, especially in a containerised environment, where no monitoring systems are used, and as a result, no such data is collected and persisted.

For the Java heap dump, we can set it to create it automatically when a java.lang.OutOfMemoryError occurs to help us investigate the error by adding the following configuration to the JAVA_OPTS:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file-or-dir-path>

Remember that the resulting file size can become quite big (your maximum physical RAM or the defined -Xmx value). If you run your containers with a retry, you could end up with the number of retries of files. Calculate this fact into account when defining to write out the heap dump!

Summary / Take-away

Even in a simple deployment, three different memory limits are involved:

  • JVM via the -Xmx parameter
  • Docker via the docker-compose parameter /
    Kubernetes via the resources:limits in the Pod specification file
  • OS via the memory.limit_in_bytes cgroups parameter

So, whenever you have a situation where a process is getting killed, you must pay attention to all involved levels' memory limits.

Recommendations, Further topics

Define -Xms, -Xmx and -Xss

Make sure to think about these three variables and define them properly. Unfortunately, there is no simple rule for the sizing. It will heavily depend on how many threads and calculations you are executing.

Application Monitoring

You may get the main issue that we have only been able to analyse runtime data, as there was no historical data for anything we looked up. So, do yourself a favour and use an application monitoring system. I will not discuss which monitoring to use, as many are on the market, depending on your budget, use case, etc.

Following some of them, I encountered:

Linux Process Priority Using nice and renice

You can develop the idea to prioritise processes directly on the host so your relevant services are not killed. More information regarding that can be found on the following website:

I would keep my fingers away from that idea to solve a Java memory issue by changing the process priority. It just sounds so wrong and would only result in a hack.

JProfiler

It’s pretty pricy, but at the same time, worth every invested cent: JProfiler. So, it may be worth considering if you often need to analyse running JVM applications. I also showed the free tools I could recommend, like VisualVM and the jattach stack, as alternatives.

Following the link to JProfiler:

If you would like to use JProfiler for open-source projects, there is a way to get it for free: https://www.ej-technologies.com/buy/jprofiler/openSource

Feedback and updates matter 📝☕. Enjoy my articles? Show support with claps, follows, and coffee donations. I keep all content regularly updated.

Support: ko-fi.com/niksta | Discord: devhotel.io

Disclosure: This article was assisted by ChatGPT (OpenAI) and refined using Grammarly for spelling and style. Visuals created with Midjourney’s AI tool.

--

--

Nikola Stanković
Viascom Publications

Discover the path to production-grade coding of containerized Spring Boot applications with Kotlin, focusing on cloud integration and application security.