In a prior post, Steve gave a good introduction to heap analysis and briefly touched on why you might want to do it. While I agree with Steve that an OutOfMemoryException means you need to take a look, the VM running out of memory is a solid indicator that you’ve already neglected your heap analysis duties for too long. By understanding what your application’s memory profile looks like and what a healthy application should look like, you can prevent problems like OutOfMemoryExceptions from ever happening.

Here’s an example of what a fairly healthy JVM might look like. This image (and the one that follows) were generated from an internal Cacti instance.

A healthy(ish) JVM heap profile

Here’s an example of an unhealthy heap. I’m pretty sure the huge drop slightly to the right of the midpoint was the VM crashing.

An unhealthy JVM heap profile

Both images are from 24 hour time windows on one of our production VMs within the past two years. One is more recent – you figure out which. Putting the two images next to each other presents a strong contrast. One looks like a predictable pattern while the other is somewhat random. Even in the profile with the more regular pattern, there is some variability. So what’s normal?

In both, the redish-orange at the top is new-gen, the light blue in the middle is old-gen, and the dark blue on the bottom is perm-gen. The uncovered purple area represents the memory allocated to the VM that is not currently used. Adding the two blues, the red and the purple gives you the total memory allocated to the VM. The redish portion is not nearly granular enough to represent the orgy of allocation and garbage collection that is new-gen. In a sufficiently large JVM, new-gen can potentially allocate and collect gigabytes per second.

A JVM will naturally try to consume as much memory as is allocated to it before initiating a full, stop-the-world garbage collection. When an application is nearing the maximum allotted heap, it will initiate a full GC. Depending on your VM’s settings, it might also do full GCs before hitting the ceiling. The top image is GCing on an interval. It never reaches the ceiling. The bottom profile appears to be GCing on an interval sometimes, but due to memory pressure it GCs when it hits the ceiling other times.

The bottom image is a clear indication of one or more memory leaks. When a full GC happens, there should be a predictable floor to the space occupied by old-gen. If the floor of your old-gen ramps and/or ends at an unpredictable floor, you have a memory leak. If you don’t have a fairly predictable pattern that looks something like the teeth of a saw, you need to start analyzing heaps.

Despite its apparent superiority over the bottom image, the top image is not as healthy as might appear. The top graph does exhibit some ramping in the floor. That ramp is nothing like the dramatic and consistent ramp in the bottom graph, but it’s there. In the top graph, the ramp and subsequent fall of the old-gen floor is the result of sessions on the VM aging-out. The sessions are not GC-able, so any objects they reference are also not GC-able. That trend in the graph is one of the primary motivators behind the “stateless” effort that I and other authors on this blog have written about. Holding hard references to objects within long-lived objects like sessions is an anti-pattern.

Pay attention to the way your application uses memory. Once you start handling “real” traffic, being proactive about identifying memory problems can keep you out of big problems later.