Tue 13 Sep 2011
Come, Watson, come! The Game is Afoot!
What do you do if your application crashes?
Restore service ASAP!
Now what?
The investigation and forensic analysis begins… Watson, the game is afoot!
Postmortem analysis of application failure is an incredibly useful skill to learn. When an application terminates unexpectedly there is no image in memory that will allow you to attach a debugger and inspect the program state. By the time you actually get to the analysis stage it is likely the application will be up and running again on your servers in an healthy and active state.
Preparation and planning for outages will be invaluable when the crunch comes. At Rally we have an array of monitoring systems and reactive scripts, logging, event alerts, and well-known documented processes for what to do and when do it. With the right infrastructure in place your application will leave a trial of clues that you can follow to discover the smoking gun.
This trail of clues may include:
- Logs — from application server, web server and database.
- Data from monitoring systems — heartbeats, memory/heap recordings, request rates, network traffic.
- Configuration files — did they change? Are they correct?
- Thread dumps.
- Eyewitness reports — not always 100% reliable but can point your investigation in the right direction.
- Documentation recording what changed — from SCM tools, mailing lists, documents or posts to systems team blog.
The Rally core application is Java-based so thread dump analysis is an activity we sometimes perform. This isn’t something you have to wait for a crash to do. When the application slows down or behaves in an unexpected manner we take thread dumps and investigate to see what is happening. This process guides and prioritizes work for those of us working on architecture and scaling.
A thread dump is a snapshot of the threads and monitors in your Java application. To generate a thread dump you send a SIGQUIT on Unix systems kill -3 <process_id> and on Windows systems control-break from the console of the running application. This will create a snapshot of the internal system state and print it to standard out. A gotcha here is that start-up scripts don’t always capture standard out or may send it to /dev/null which is not what you want right now.
Here is an snippet of a thread dump from one of our production nodes that exhibited a high load one afternoon: "qd-app-0225602626-some-user@random-host.com" prio=10 tid=0x00002aaae40e5000 nid=0x681b runnable [0x0000000048cd6000] java.lang.Thread.State: RUNNABLE at java.util.AbstractMap.hashCode(AbstractMap.java:458) at com.f4tech.fcl.identity.Identifier.hashCode(Identifier.java:133) at com.f4tech.fcl.domain.DomainObjectImpl.hashCode(DomainObjectImpl.java:34) at java.util.HashMap.put(HashMap.java:372) at java.util.HashSet.add(HashSet.java:200) at java.util.AbstractCollection.addAll(AbstractCollection.java:305) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.domain.Artifact.getRecursiveSuccessors(Artifact.java:1097) at com.f4tech.slm.services.artifact.NotificationUserCollector.collectProjectNotificationOwnerNames(NotificationUserCollector.java:63) at com.f4tech.slm.services.artifact.NotificationUserCollector.collectProjectNotificationOwnerNames(NotificationUserCollector.java:54) at com.f4tech.slm.services.artifact.NotificationUserCollector.collectNotificationOwnerNames(NotificationUserCollector.java:26) at com.rallydev.slm.services.notification.ArtifactChangedMessageFactoryImpl.collectNotificationOwnerNames(ArtifactChangedMessageFactoryImpl.java:122) at com.rallydev.slm.services.notification.ArtifactChangedMessageFactoryImpl.getChangeNotificationUsers(ArtifactChangedMessageFactoryImpl.java:113) at com.rallydev.slm.services.notification.ArtifactChangedMessageFactoryImpl.createChangeNotification(ArtifactChangedMessageFactoryImpl.java:77) at com.rallydev.slm.services.notification.ArtifactChangedMessageFactoryImpl.createArtifactChangedMessage(ArtifactChangedMessageFactoryImpl.java:57) at [snip] com.f4tech.fcl.persistence.EclipseLinkPersistenceContext.commit(EclipseLinkPersistenceContext.java:310) - locked <0x0000000637383500> (a com.f4tech.fcl.persistence.EclipseLinkPersistenceContext) [snip] com.rallydev.annotations.persistence.PersistenceContextCommitAspect.processPersistenceContextCommit(PersistenceContextCommitAspect.java:43) at sun.reflect.GeneratedMethodAccessor1194.invoke(Unknown Source)
The first line of the dump is the thread information line which includes details on priority, thread ID and state runnable. The following section contains a trace of the thread’s stack at the moment of the thread dump. Yes, thread dumps get wordy. So you will need some assistance when reading them. Fortunately, there are tools to help with analysis, e.g. Thread Dump Analyzer and Samurai. I tend to use TDA and in particular its long-running-thread detection feature.
The key to thread dump analysis is to take 4 or 5 sets of thread dumps every 5-10 seconds. So you’ve got a snapshot of the application server state for around a minute. You are looking for blocked or long running transactions: try to find threads marked blocked or waiting. These may cause deadlocks in systems. By loading up all your thread dump files in TDA and right-clicking to access long running thread detection you’ll be able to find these problem areas fairly quickly.
The thread dump contains IDs for each thread and will show in the stack trace which line of code it was caught at. If you see the same thread in each dump at the same line of code this suggests an issue requiring investigation. In our example above we noticed that there were a bunch of threads processing artifact changed messages and building user notifications from them. Further analysis of our request logs showed that a user had uploaded a lot of new data through CSV import which put a strain on the server, causing a high load alert and the automatic capture of the thread dumps. Now we know that we should work on optimizing this code to prevent imports like this being an issue in the future.
