Author Archive

On the Rally Engineering blog we’ve written many articles on systems’ performance, monitoring, resilience, and recovery. An interesting event just happened at our Denver production facility that pushed our systems to their limits. It helped us answer the question:

How well does production survive in the face of all-out failures?

Maybe inspired by Netflix’s Chaos Monkey, some local wildlife found its was into our server room and litterally tore apart our hardware.

Enter the “Rally Chaos Raccoon”

Last weekend the Chaos Raccoon, we now call “Cyril”*, chewed his way into our production server room. From security video footage we saw that at first he was cautious and quiet. Just hiding behind the Dell 4210s — where it’s warm.

But after a full day of no food Cyril got angry.

Pissed angry.

He gnawed on the mounting brackets of our UPS and ripped open a SAN unit in use by app-server-01.

The instant app-01 failed our backup app-server spun into action and took over the production traffic from app-01. The process all had to happen automatically — when Aaron tried to enter the cage for a manual failover Cyril attacked and bit his cheek (yes, that cheek).

angry-raccoon

As Aaron left the server room, seeking medical attention and rabies shots, the Chaos Raccoon ripped into another Dell 4210 and chewed its innards into shreds. Our system smoothly recovered, load-balancing network traffic whilst simultaneously paging the operations team another warning. When Cyril finally chewed through the UPS he electrocuted himself to death.

It smelled bad.

In the future we’ll probably test our processes with a “Bad-ass Bear”, “Crazy Coy Carp”, or “Janky Jackalope”. Opening up your production systems to wildlife attack demonstrates confidence in monitoring, recovery, and backup processes. It stretches your failover strategies to their limits. You may think your systems are ready for anything but when a raccoon attacks there are no rules.

If you’re going to implement your own Chaos Raccoon we recommend you first deploy an array of recovery tools and test with non-endangered creatures. It’s organic and the ecologically sound option.

* real names changed to protect the innocent

Help! My Java application keeps crashing! OutOfMemoryError?! My hair is on fire! Please don’t let the invisible fire burn my friend!!!

1. Increase max heap size

Some applications are complex and large enough that they simply need more memory.
Try increasing the max heap size when you start the JVM by adding the flag -Xmx. So your start-up will look something like this: java -Xmx1024m MyApp (which is equivalent to java -Xmx1g MyApp)

The downside to this is that your application will suffer from longer garbage collection cycles, it may be better to fix the greedy code.

But what if your application still runs out of memory? Seems like you have a leak.

2. Detect memory leak

Learn to use a tools like jmap and Thread Dump Analyser (TDA). Run jmap -histo
(where pid is the process-id of your Java application) a few times and compare the output. Read my earlier blog posts on fun with heap dump analysis.

3. Carefully read your code

There are many ways to leak memory in Java code. Watch out for thread local variables — especially in a thread pool. Each thread stores its own state so anything you put on there will last the lifetime of the thread. If your threads are recycled through a thread pool that state may hang around for a long time.

Mutable static collections are bad. Static fields are retained by the class and therefore its classloader. That means your collection will hang around for the lifetime of that class. Come to think of it, it is better not to use mutable statics.

If you re-implement the hashcode and equals methods and get it wrong you’ll shoot yourself in the foot. Here is why: when you store an object in a HashMap the collection will call your (wrong) hashcode method to determine that object’s position. On look-up, the map uses the (wrong) hashcode method again and then calls (wrong) equals method to check it retrieved the correct object. If you break the implementations of these two methods you can add objects into the HashMap but can’t get them back. Worse, you can add the same thing repeatedly and it will not overwrite the old value because hashcode keeps returning different values for the same object.

4. Circuit breakers

If you do have to compute over potentially large collections of objects then use weak references as circuit breakers. I wrote a series of blog posts that will serve as a good primer for weak references. In particular: JVM Memory Primer, Java References and Reachability Explained.

Monitoring JVM memory usage and tweaking system options is something we do all the time. Once you’ve fixed your leaks you need to continue to monitor systems and run load tests to project usage into the future. Be proactive; don’t wait for an OutOfMemoryError to remind you.

Engineering blog: How long have you been with Rally and what were you doing before?
Steve Neely: I joined Rally in Feb 2010. Before that I lived in Dublin, Ireland and worked at University College Dublin in the Systems Research Group. Before Dublin I lived in Glasgow, Scotland and worked as a lecturer at the University of Strathclyde.

EB: What made you choose Rally?
SN: I wanted to work with industry leaders in an environment where I could learn and have fun.

EB: What sorts of things have you worked on at Rally?
SN: Across the stack. I started on the Chuck Norris team building the frontend JS that comprises the users page; then was part of Das Schnitzel building Solr search; then as a member of YUNO? I upgraded our Ruby GUI testing infrastructure; and now I’m on the architecture/scalability/performance team (team names: Steel Panther and !pants).

EB: What are your favorite things to work on?
SN: The architecture work we’re doing now is complex and super interesting. A great thing about working at Rally is I’m free to move to any team. I could jump back into JS-land tomorrow if I wanted.

EB: Explain your nickname.
SN: According to one of the guys in the dev team, Groundskeeper Willie, from the Simpsons, is the most famous other Scottish person ever. In my first week they changed my Rally display name and photo. It stuck.

EB: Anything else you’d like to share about life at Rally?
SN: I learn something new every day and love coming to work. That’s an awesome thing to be able to say.

After my previous post on references one of the team asked me for an example use for phantom references. The best I could recall was to keep track of an object and perform some postmortem clean upon its destruction. Very important for database connections and caches.

But can’t you do that with the finalize method?

Well no, not with any guarantees. Here’s why:

Firstly, we don’t know when the garbage collector will run. It is non-deterministic. In fact, if your program doesn’t use up its allocated memory it doesn’t ever need to GC — no objects would befinalized. Secondly, any objects that are strongly referenced when the runtime quits will not be GC’d. Their finalizer code will never be called.

Object finalization happens after an object has been marked as garbage but before memory has been reclaimed. You can screw around with this by creating a strong reference to a garbage object during its finalization:

myObject = this;

If you do this (please don’t!) the object comes back to life because it is strongly reachable via myObject. However, the object you have a reference to will never have its finalizer code called again. That’s bad. Phantom references don’t allow this:

“In order to ensure that a reclaimable object remains so, the referent of a phantom reference may not be retrieved: The get method of a phantom reference always returns null.”
[PhantomReference API]

Recall that the constructor for a PhantomReference takes a ReferenceQueue. Phantom references are enqueued once the object has been finalized and the memory has been reclaimed. If you process this queue in your own cleanup thread then you are guaranteed that the referents are gone.

Unlike soft and weak references, phantom references are not automatically cleared by the garbage collector as they are enqueued
[PhantomReference API] – So you just remember to clean up and dequeue.

If that’s not enough to convince you that finalizers are troublesome then maybe an optimization argument will help: the runtime’s finalizer thread is responsible for calling the finalize method on garbage objects. This can slow up your application during GC cycles. Especially if your finalizer code does a synchronous wait for some external process (like a database confirming the connection was closed). Taking control of cleanup with your own thread(s) by processing a reference queue is likely to be way more efficient.

Finalizerly, if the garbage collector never runs then unreachable objects are not identified which means phantom references will not be put into the reference queue. That’s important to remember too.

As a follow up to my last post on Java references I wanted to blog a clear explanation on reachability. But there is no way I can be more succinct than the Java API docs. So I’m going to invoke “fair use” and reference the source.

– begin quote –

Going from strongest to weakest, the different levels of reachability reflect the life cycle of an object. They are operationally defined as follows:

  • An object is strongly reachable if it can be reached by some thread without traversing any reference objects. A newly-created object is strongly reachable by the thread that created it.
  • An object is softly reachable if it is not strongly reachable but can be reached by traversing a soft reference.
  • An object is weakly reachable if it is neither strongly nor softly reachable but can be reached by traversing a weak reference. When the weak references to a weakly-reachable object are cleared, the object becomes eligible for finalization.
  • An object is phantom reachable if it is neither strongly, softly, nor weakly reachable, it has been finalized, and some phantom reference refers to it.
  • Finally, an object is unreachable, and therefore eligible for reclamation, when it is not reachable in any of the above ways.

– end quote –

I’m on a memory management kick. Understanding how the Java runtime partitions memory, allocates space for objects, moves objects around, and performs garbage collection (see previous post) is a good base for our quest to writing memory conscious code. In this blog post I’ll describe Java’s four reference types, which you can use to assist the JVM in its memory management duties.

In order from strongest to weakest these references are: Strong, Soft, Weak, Phantom.

Strong References

These are your regular object references:

Server server = new Server();

The variable server holds a strong reference to a Server object. OK, before you stop reading there is a point to this: objects that are reachable through any chain of strong references are not eligible for garbage collection. Usually this is what you want. But not always.

Imagine for a minute that you’re coding an Enterprise Web App backed by a large Oracle database. When users navigate your web app it loads data into memory from Oracle. Sometimes the same data is regularly accessed so you decide to cache it in a Map. By storing strong references you’ve just introduced a memory leak. “Ha!”, you say, “then I’ll write a memory manager that throws out least frequently used objects when we begin to run out of memory”. But doesn’t the JVM already manage memory for you?

Weak References

A weak reference will not pin an object into memory. An object that is identified as only weakly reachable will be collected at the next GC cycle.

WeakReference<Cacheable> weakData = new WeakReference<Cacheable>(data);

To access data call weakData.get(). This call to get may return null if the weak reference was garbage collected: you must check the returned value to avoid NPEs.

Java contains collections that use weak references. For example, the WeakHashMap class stores keys (not values) as weak references. If the key is GC’d then the value will automatically be removed from the map too.

Since weak references are objects too we need a way to clean them up (they’re no longer useful when the object they were referencing has been GC’d). If you pass a ReferenceQueue into the constructor for a weak reference then the garbage collector will append that weak reference to the ReferenceQueue when it is no longer needed. You can periodically process this queue and deal with dead references.

Soft References

A SoftReference is like a weak reference but it is less likely to be garbage collected. Soft references are cleared at the discretion of the garbage collector in response to memory demand. The virtual machine guarantees that all soft references to softly-reachable objects will have been cleared before it would ever throw an OutOfMemoryError.

Phantom References

In practice these are rarely used.

Key point: phantom references are the most tenuous of all reference types: calling get will always return null.

So how are they useful? When you construct a phantom reference you must always pass in a ReferenceQueue. This indicates that you can use a phantom reference to see when your object is GC’d. The phantom reference is enqueued after it has been physically removed from memory — as opposed to weak references which are enqueued before they’re finalized or GC’d.

Hey, so if weak references are enqueued when they’re considered finalizable but not yet GC’d we could create a new strong reference to the object in the finalizer block and prevent the object being GC’d. Yep, you can but you probably shouldn’t do this. To check for this case the GC cycle will happen at least twice for each object, unless that object is reachable only by a phantom reference. This is why you can run out of heap even when it your memory contains plenty of garbage. Phantom references can prevent this.

The Java VM automatically manages allocation and deallocation of memory during program execution. The Garbage Collector (GC) is responsible for scanning the heap and freeing up space occupied by unreachable objects. In some ways, automatic memory management makes programming way simpler and less error prone.

Sweet! Memory is managed automatically so I don’t need to care about it!

Nope, not necessarily. If you understand just a little about memory management you can tune your application to help the JVM do its job more efficiently. So let’s learn a little background here and in a later post I’ll write about tweaking your code.

The JVM splits memory up into sections:

JVMMemorylayout

It turns out that the majority of Java objects are very short-lived: new‘d up in a block, used, and destroyed. These objects are created in the Eden space and they are GC’d by a serial process which runs with minimal impact on the application.

Key point: GC in the Eden space is a relatively lightweight operation so we prefer this to happen.

Objects in the Eden space that are not culled by the GC are copied into the Survivor space. And occasionally, the runtime copies longer-lived objects from the short-term area into the Tenured space.

When Tenured fills up the the JVM performs a major mark-and-sweep GC which will stop all threads whilst it clears up, reorganizes and compresses this space.

This “stop-the-world” GC can cause your application to hang and the user will see a blank screen or spinning beach ball of doom. spinning_beach_ball_of_doom

Worse, your application can GC Storm. This happens where the tenured space is pretty much always full so the collector runs frequently. Yes, I’m talking about you IntelliJ.

So, what can you watch for? How about: premature object creation, holding references to objects you don’t need, String concatenation operations, statics, not closing JDBC Statements/ResultSets in a finally block…

Finally, measure don’t guess: understand your application’s memory profile, use a profiler like JMeter, and a Memory Analyzer Tool to have fun with heap dump analysis.

I recently wrote a blog post about the Strategy Pattern and have been thinking more about how it’s a neat pattern for composition. That line of thought took the voices in my head to discussing ways we construct classes, with various forms of code reuse and interface definitions, by applying design patterns.

That’s a large topic so let’s narrow focus and talk about two options for code reuse: inheritance vs composition.

Object-oriented systems are characterized, in part, by inheritance. We use polymorphism and dynamic binding as tools for switching about objects and allowing the system to choose the best method implementation for a given call.

This flexibility makes code easy to change. Method calls don’t care about which method they call providing the signature/type look correct — you can dynamically swap out the method implementation behind the scenes without syntactically breaking legacy code. Inheritance makes code changes easy when you’re creating a new subclass.

We’re going to need an example (from our code base, heavily modified for illustration):

public class Artifact {
  protected void initializeDefaultValues(Project project) {
  }

  public boolean isScheduled() {
    return Objects.firstNonNullValue(transientIsScheduled, isScheduled);
  }
}

public class StoryCard extends Artifact {
  @Override
  protected void initializeDefaultValues(Project project) {
    state = getInitialCreationStateReadOnly();
  }
}

Artifact artifact = new StoryCard(); // polymorphism
artifact.initializeDefaultValues(project); // dynamic binding

StoryCard card = new StoryCard();
if (card.isScheduled()) {
 …
}

Superclass–subclass relationships are fragile. If we decide to change the signature of Artifact’s initializeDefaultValues method this forces us to change StoryCard to match the new signature (along with the seven other subclasses which override this method). This was part of the motivation for introducing the @Override annotation.

Inheritance is said to provide weak encapsulation. Look back at if(card.isScheduled()). The card object was declared as type StoryCard but since StoryCard extends Artifact it inherits the isScheduled method. Here be danger: changes to isScheduled() in Artifact must be aware that subclasses, such as StoryCard, are relying on it: subclasses know about the method and can call it. Yes, the method is encapsulated but with a weak abstraction.

So inheritance can make it difficult to change the interface of a superclass. As an alternative we can use composition. Here’s a trimmed down version of our example to illustrate:

public class Artifact {
  public boolean isScheduled() {
    return Objects.firstNonNullValue(transientIsScheduled, isScheduled);
  }
}

public class StoryCard {
  private Artifact artifact = new Artifact();

  @Override
  public boolean isScheduled() {
    return artifact.isScheduled();
  }
}

Building StoryCard by composing in Artifact brings the isScheduled method up-front; it is implemented in the back-end by Artifact. By having StoryCard explicitly call isScheduled in Artifact we’ve more strongly encapsulated the method. This forwarding or delegation means we can alter isScheduled in Artifact without breaking any code calling isScheduled on a StoryCard typed object.

Another advantage of composition is that you may delay the creation of back-end objects until you need them. This can be more efficient if there is a lot of overhead in constructing those back-end objects. And since you’re hiding these back-end objects you can dynamically switch out that back-end object at run time — something you cannot do with inheritance. On the flip-side, when using composition the addition of new subclasses requires more effort.

So how do you choose? Traditionally we say inheritance follows the is-a pattern. A StoryCard is-a Artifact so we use inheritance. I would take this a step further and say that the is-a relationship must be true for the entire lifecycle of an object. If that’s not always true then maybe composition is more appropriate.

TL;DR; don’t automatically choose inheritance for code reuse and polymorphism — there must be an full-lifecycle is-a relationship between objects otherwise composition with interfaces may be a better alternative.

The Strategy Pattern is a new favorite design pattern of mine. Probably because it has made my life substantially easier on a bunch of occasions. You can use it for situations when you’d like to group together a family of algorithms and interchangeably drop them into other code units (e.g. objects) at run-time. These strategies are readily reusable, composable and mixed in to your code.

Traditional implementations of the Strategy Pattern implement it with function pointers or first-class functions. These are stored away and later selected according to some given set of requirements. For example, say I write some awesome code for binding Subscription domain objects with incoming HTTP request data in a flexible manner. I can label my binding service as a strategy (suitable) for Subscription objects and they’ll automatically pick it up for binding.

In the Rally code base we implemented the Strategy Pattern using Java annotations. Our framework uses reflection to get classes annotated @StrategyFor and stores them in a Map. The association between consumers and strategies is based on type. Since types are hierarchical in Java we can write strategies for more than just a single class: we can annotate as a @StrategyFor some parent class and cover an entire sub-hierarchy.

Here’s a code example. Let’s say toward the end of some transaction we want all Artifacts to be processed and have their custom attributes set. So:

@StrategyFor(Artifact.class)
public class ArtifactPostProcessor implements DomainObjectPostProcessor {

    @Override
    public void process(DomainObject domainObject, Map arguments) {
        Artifact artifact = (Artifact)domainObject;
        artifact.setAllCustomAttributeValues();
    }
}

Ok, so now you’re thinking, if a family of objects needs a given behavior why not just create some interface and have them all implement it? Sometimes that is too tightly coupled: changing the interface means having to change every single implementing object.

The Strategy Pattern is an alternative that encourages building objects by composition over inheritance. Using composition means you can change the description of the superclass (what would be the Strategy class) without breaking the subsclasses.

Neat, huh?

Back in May I posted on how we connected build lights to our continuous integration server to increase visibility into the health of our builds. Since that post we’ve grown from a single build light to six lights here in Boulder and two in our Raleigh office. The light control code has become a little more sophisticated and our working agreements have changed over time. We’ve just instrumented a couple of new features so I thought you may be interested in a retrospective post on our experiences.

Too much red: rush to push

The build lights worked well for a while but a negative behavior began to emerge. They encouraged a “rush to push”. This rush was encouraged by extended periods of red lights due to randomly failing flaky tests. When tests would sometimes fail due to timing issues we would kick off a new build and wait for a green light. During these waiting periods local check-ins would pile up on dev machines and the moment the light went green everyone would rush to push code. These check-ins would spawn another build and if that failed we’d have a dozen or so commits to weed through to figure out which one was the culprit. PITA.

After the build lights had been running for a couple of months, our fortnightly departmental retro diverged into a long discussion on problems with too many red build lights. In short, they were encumbering our ability to work. The outcome was a team decision that we’d focus on flaky tests and protect the build light which, in turn, would produce green lights for the majority of the time. One agreement was: every time a flaky test failed we’d round-robin assign it to a team to fix.

This strategy worked. Today the light is green for the majority of the time. Our test suite is far healthier and devs are free to push code pretty much any time during the day. But we don’t want our efforts to slowly erode away with the introduction of new flaky tests or bad practices. Here’s a selection of actions we took:

Flaky finder and metrics

One trick we pulled was to introduce a “flaky finder” job that attempts to spot flakiness in new browser tests. When someone commits a new test it must pass the multi-execution concurrent gauntlet that is flaky finder on the build system before the light will turn green.

At a later retro we talked about how it felt like our builds were doing mostly okay and we believed that things were getting better. But we really did not know. We had no metrics. An action item that came out of this meeting was to instrument ingestion of build logs into Splunk so we could measure the health of builds in the system in fine grained detail. We built a dashboard of graphs and tables with predefined queries over the build log data and monitor these regularly.

More build lights: who is doing what?

As we introduced more build lights we noticed that a red light could result in two or more people grabbing the build and start working on it. This could be duplicated effort. To fix this we implemented a “claim this build” feature. Now a red light means we have a broken build and no one has claimed it. A yellow light means the build broke but someone claimed it and is working on the problem.

“Claim this build” works through the description field of a given build in Jenkins. At first we just wanted commiters to put their names in the description field to claim the build. An interesting emergent behavior was people began adding more details about why the build failed. This further improved visibility into what was going on in Jenkins. I didn’t expect this usage pattern but it is highly valuable.

Another enhancement we made is that the build lights now track weekend release jobs when they exist. This suggested that the team was relying on the lights and not watching the Jenkins UI. We had a broken release build for a few hours one day and no one noticed. The build light script now checks Jenkins to see if there are new jobs for an upcoming release and the build lights automatically tracks them too.

Future work

An ongoing issue we have is that our full test suite can take up to 40 minutes to complete. This means our build light can be green for 40 minutes even though an earlier test failed. During this time devs could be checking in code on top of a broken build. We could use a fail-fast mechanism or faster test suites to combat this. We’ve been working on faster test turnaround with Selenium Grid as part of this.

That’s more work for the future.