DevOps


We have been using the PER process now for a while and as we have run through a number of these we’ve made some changes to the process. As with all things at Rally, we rely heavily on inspecting and adapting. Here are some areas we’ve found we needed to adjust.

Communicating with teams outside those directly involved in the event

We have started to bring our PER process into the Dev/Ops sync meetings to raise visibility in that meeting of events that occur. After a PER is completed (meaning we’ve discussed it & all the action related stories are created) we present that PER at the start of our Dev/Ops sync meeting so that the broader team can ask questions or provide feedback. A big part of the PER process is simply to raise awareness of the types of issues which impact our service and this has been an easy way to briefly review these events with the team.

For larger events, in particular those where the system is unavailable, we have distributed the PER document to the teams and even the Executive team as necessary to answer questions about the event. We try to be as transparent as possible about these events & what we learn from them.

Creating and tracking stories from the event

One of the most difficult aspects of establishing a process like this is managing the large number of new stories which are created out of each PER. We do not focus on a single root cause for an event, instead we focus on discovering and documenting all the factors that led to the event or could lead to future events. Doing this means that we can generate a pretty large number of action items out of any given event – usually it’s 5-10 new stories.

Moving those stories through the process to get prioritized & completed is a challenge. We have created a “PER View” in Rally where we can see all our parent stories and their children and we have started to work on reviewing and prioritizing those incomplete items. The good news is that some are already complete, the bad news is that many others are not. We are continuing to evolve our process for reviewing these within Operations and with other teams to raise awareness about incomplete stories.

When to run a PER

When we first started this new process, we indicated that we would only run a PER for events which caused downtime. We have since started to expand our use of the PER to other events. This is driven by the fact that many events which do not cause downtime still expose areas where we could significantly improve things. Sometimes we get lucky and catch something in manual testing, or the sequence of events is such that production downtime is avoided, but it doesn’t mean we can’t learn from the event.

More recently we have tried to look at an event from the perspective of how much we think we can learn, rather than the impact it had on our system or customers. We run a PER on any event that causes downtime without question, but we’re looking for opportunities to use this process to learn about events which help us improve before we have downtime.

Developing Muscle Memory

After having done this a few times, we’ve noticed an interesting side effect. We’re getting better at recognizing if an event can lead to a PER, and we start the data collection process during the event itself. This has been immensely helpful in capturing detail without relying on an individual’s memory of the event. It also helps make our PER meetings more efficient by creating a document with nearly complete timeline, deltas and action items. As an example, during a recent event the document was created, the timeline was started, and an email was sent with a link to the PER. Those involved in the event could then immediately begin to contribute to that document as the event occurred.

A big part of this is people getting used to the PER process and having a template ready to go which folks can quickly fill in with details. Making it easy for anyone to start the document & allow everyone to contribute to it is key. We use Google Docs for these items but you can use just about any collaborative document system.

Making change easy

When the PER process was created it was intended to be a flexible process that allows for easy change to the process. We didn’t invest a lot of effort in creating any complex software or forms – we just created a Google Document. Over time we have modified that document (the template) a number of times as changes made sense. This was trivial to do, anyone could do it, and you could see a history of what had been changed in the document.

Although the process has changed a bit since it was started, we didn’t have to have a lot of discussion about those changes. The cost of change was low, and the ability to reverse a change that doesn’t work means there is very little risk to trying new things. This means we didn’t have to think too much about what the best process would be when we started – we just built something that was good enough and started using it, adjusting as we go.

Wrapping up

We will continue to evolve this process and we’ll talk about it more as we do, but the process seems to be working pretty well for collecting data about an event. The most important aspect of this is to be open to change & to try different ways of approaching things. Collecting ideas about what went wrong is the easy part though, the hard part is turning those ideas into actual changes to the software, process or the organization. When we get that part all figured out, expect another blog post with the results.

As we’ve mentioned in previous posts, we monitor a lot of different metrics. One of the challenges you have as your list of metrics grow is the impact metric collection has on your service. Typically monitoring methods expose a single metric per request so if I want to observe the current heap allocation I have to make a single request and if I want to observe the number of active threads I have to make another request. These requests all add up and when you query them every minute, like we do, there are concerns about being able to query all the metrics you want to in a 1 minute interval. There are also concerns about what impact all these requests have on the application itself.

Recently we have started trying to use a metrics library provided by Yammer to expose information about our application via JSON. Besides making it easy to access the metrics (they are exposed via HTTP), this library also makes it possible to collect those metrics in a single request which delivers a blast of JSON with everything in it. We’ve found adding new metrics to be pretty easy & you get a lot of metrics out of the box.

Having solved the problem of multiple queries nicely, the next problem was the overwhelming number of metrics that are now exposed. Being the lazy Ops guy who has to configure all of those into the monitoring system, I opted to automate the process so that when new metrics are added to the JSON we use our monitoring system’s API to add the new metrics & immediately start to collect data for them. This could be done for any respectable monitoring system but in our case we’re using Zabbix, so my example is based on that.

Moving from JSON to key/value pairs

The first challenge was that JSON presents data in a hierarchy and our monitoring system wants to assign a value to a key. We opted to just “flatten” the JSON structure so that the path to any given value is the key – thus we go from something like this:

"jvm" : {
  "memory" : {
    "memory_pool_usages" : {
      "Code Cache" : 0.029173533121744793,
    }
  }
}

To this:

jvm.memory.memory_pool_usages.Code_Cache = 0.029173533121744793

There are lots of examples of how to do this in whatever language you choose to use – here’s how we’re doing it in ruby:

def hash_flatten(input = {}, output = {}, options = {})
  input.each do |key, value|
    key = options[:prefix].nil? ? "#{key}" : "#{options[:prefix]}#{options[:delimiter]||"."}#{key}"
    if value.is_a? Hash
      hash_flatten(value, output, :prefix => key, :delimiter => ".")
    else
      clean_key = key.gsub(/\s/, "_")
      output[clean_key]  = value
    end
  end
  output
end

Once we have a list of key/value pairs we dump these metrics into a file and use the zabbix_sender tool to deliver all those metrics to Zabbix every minute.

The next issue is how to make sure all the metrics in the JSON show up in our monitoring system without Ops having to do any extra work. For this we leverage the Zabbix API.

Creating Metrics via the Zabbix API

We are using Templates to associate a list of monitored items with a host. This makes it easy for us to add a new host & apply a few templates to get all the metrics we want associated with that host. When we are adding new metrics via the API we want to make sure we add them to the Template, not to the individual host being monitored. The API has an authorization process that I’m not going to cover here, it’s covered pretty well in their documentation. We are going to focus on creating new metrics.

The Zabbix API offers a convenient method “item.exists” to query a particular key to see if it already exists. We use this to confirm a key is missing before we create it. Creating a key involves a fair amount of information about the key you want to create, so here is the code (ruby) and then we’ll talk about it a bit:

def create_item(key, apps, hostid, value_type)
  #appids = []
  # Get a list of application ID's
  appids = apps.collect { | app | get_app_id(app) }

  req_json = jinit("item.create")
  req_json[:params] = {
    :description => key,
    :key_ => key,
    :hostid => hostid,
    :applications => appids,
    :type => 2,
    :history => 90,
    :value_type => value_type
  }

  api_req(JSON.generate(req_json))
end

Ok, so when this is called we pass a number of parameters to create_item:

  • key – this is the name of the item (the “flattened” JSON path to the value listed above).
  • apps – this is an array of application names we want this key to be added to. Refer to the Zabbix manual for the relevance of these – just know you can associate multiple apps with a single key.
  • hostid – In our case, this is the template ID since we’re using templates, but this could be any host ID you want to associate this item with. The Zabbix API re-uses the hostid to pass the templateid for items associated with templates.
  • value_type – this defines the type of value this is: string, float, int, etc

We have to iterate over the list of apps provided and obtain their numeric IDs (apps.collect). To do this we have another bit of code called get_app_id() which resolves an app name to an app ID via the API. We setup the basic JSON request structure with a call to jinit() along with the API method we want to use, in this case ‘item.create’. Using jinit() takes care of making sure we have an authentication token, keeps track of request IDs and sets up the basic structure that’s common to all API requests.

This is what jinit looks like:

def jinit(method)
  # If we don't have a token and we aren't
  # being called to setup authentication then
  # we setup auth first
  if !$auth_token && method != "user.login"
    auth_setup
  end
  json = {
    :jsonrpc => "2.0",
    :method => method,
    :auth => $auth_token,
    :id => get_id(),
  }
end

Once we get back a hash from jinit we complete the params portion of the hash with all the values we were passed. The two static values, type & history, are common to all our requests. We keep 90 days of per-minute data on monitors & setting “type” to “2″ means that this is a “Zabbix Trapper” key type which is required to receive data sent to Zabbix via the zabbix_sender utility.

This generates a JSON request that looks something like this:

{
  "auth": "12345678901234567890",
  "method": "item.create",
  "id": 1,
  "params": {
    "key_": "test.db.query.composites.duration.p95",
    "history": 90,
    "hostid": "100100000010157",
    "applications": [
       "100100000001773"
     ],
     "type": 2,
     "description": "test.db.query.composites.duration.p95",
     "value_type": 0
   },
   "jsonrpc": "2.0"
}

Zabbix responds to this type of request with the itemid that was created as the result and keeps the “id” value the same as your request so you can associate the two:

 {
   "result": {
     "itemids": [
       "100100000039405"
     ]
   },
   "id": 1,
   "jsonrpc": "2.0"
 }

Now that the key is created zabbix will collect data about this value anytime it’s delivered via zabbix_sender.

How do you remove metrics that are no longer in use?

We don’t. Partly this is because I’m paranoid and wouldn’t want to clobber months or years of data because of a bug. Another reason is because the metrics are reasonably easy to remove via the Zabbix UI or API using a purpose built script and overall we rarely remove metrics from our systems. We also found that on application restart, if certain individual metrics had not been incremented they would not appear in the JSON. We’re working on fixing this for consistency but we wouldn’t want a problem like that in the application to cause us to remove metric data, we would rather just fail to collect.

Wrapping up

This method works well with a variety of monitoring systems but relies on the metrics being discoverable without knowing what they are. Typically this means they need to be exposed via JSON or XML and provided to the polling client with a single request. There is an additional side effect of doing this that may not be clear. If you make it automatic and easy to expose & graph metrics in your application then Developers can suddenly add metrics & observe those very easily without asking anyone. More metrics means better understanding & hopefully happier Dev & Ops folks.

Recently I blogged about our approach to using executable jar/war files for deployment. As part of this we came up with an interesting method of handling build numbers. We did not want to be restricted by things like incremental numbers since they can get weird if you branch (which for this app we do not). What we decided to do was to use a combination of the git SHA and the date. The SHA is sufficient for telling us the head of the repository for a specific build. The date portion just makes it more human readable with the additional ability of being able to group builds by day. Below is the ruby code we use within buildr to calculate our version number:

require 'date'

def calculate_version_name
  date_string = Date.today.strftime("%G.%m.%d")
  sha = `git log -n 1 --pretty=format:'%h'`
  "#{date_string}-#{sha}"
end

This ends up producing a build number that looks something like this – 2011.10.25-836ab79

The only drawback we’ve seen from this approach is that the build numbers cannot be easily sorted chronologically, but so far we haven’t needed that.

What do people think?

Recently Adam, Steve (the Chief) and myself have been working on a new service that involves a brand new web-application. I won’t bore you with details of what it does because it won’t be publicly facing. Regardless, we had to consider deployment options and after a few false starts we decided to build an “executable WAR”. Basically a war file that you can start up by calling “java -jar mywebapp.war”.

Why?
First we liked the idea of having a fully contained application in a neat and tidy package. Second, this isn’t a case where we can build the war and let our operations team pick the container. Our application requires a servlet 3.0 container so our options were limited. And finally, we just wanted to learn how to do it :)

How?
Our first naive thought was that if a war file has a manifest with a Main-Class declaration that everything would just work. We were very, very wrong. Since war files keep their classes in WEB-INF/classes and not in the root of the jar this failed. Next we put our main class in the root of the jar, it started but our app server classes couldn’t be found because we put them in WEB-INF/lib, *sigh*. Then it all started coming together, any dependency that we typically consider “provided” will get un-jarred along with our main class into the root of the jar, then the rest will be the typical war format. It looks like this:

executable_war_layout

We did this all in buildr by adding our own packager and calling it :executable_war.

module Rally
  module Packaging
    module ExecutableWar
      class ExecutableWarTask < Buildr::Packaging::Java::JarTask
        attr_accessor :main_class, :libs, :provided, :classes

        def initialize(*args) #:nodoc:
          super
          enhance { |jar|
            libs.each { |lib| lib.invoke if lib.respond_to?(:invoke) }
            classes.each { |dir| jar.include dir, :as => 'WEB-INF/classes' }

            classes.each do |dir|
              class_dir = dir.to_s
              class_dir_classes = "#{class_dir}/**/*.class"
              main_class_path = main_class.gsub("\.", "/")

              Dir.glob(class_dir_classes) do |file|
                jar.path(File.dirname(file.sub(class_dir, ""))).include(file) if file.include?(main_class_path)
              end
            end

            provided_target_dir = "target/provided"
            mkdir_p(provided_target_dir)
            cd(provided_target_dir) do
              provided.each do |dependency|
                `jar -xf #{Buildr.artifact(dependency)}`
              end
            end
            jar.path("/").include("#{provided_target_dir}/*")
            jar.path('WEB-INF/lib').include Buildr.artifacts(libs)
          }
        end

        def main_class=(main_class)
          @main_class = main_class
          add_to_manifest({'Main-Class' => main_class})
        end

        def version=(version)
          @version = version
          add_to_manifest({'Implementation-Build' => version})
        end

        private
        def add_to_manifest(options)
          if self.manifest
            self.manifest = self.manifest.merge(options)
          else
            self.manifest = options
          end
        end
      end

      def package_as_executable_war_spec(spec) #:nodoc:
        spec.merge(:type=>:jar)
      end

      def package_as_executable_war(file_name) #:nodoc:
        ExecutableWarTask.define_task(file_name).tap do |war|
          war.with path_to(:source, :main, :webapp)
          war.with :classes => [compile.target, resources.target]
        end
      end
    end
  end
end

class Buildr::Project
  include Rally::Packaging::ExecutableWar
end

And here is how we use it from our buildfile, where PROVIDED represents the buildr dependencies on things like Jetty and the Servlet API

package(:executable_war).with(
    :libs => compile.dependencies - PROVIDED,
    :main_class => "com.rallydev.server.jetty.EmbeddedJetty",
    :provided => PROVIDED,
    :version => "1.0")

And to start the server, as alluded to above chose Jetty 8, here is our code. Note the DEV_MODE flag so we can still do in place development when not deploying the jar file.

package com.rallydev.server.jetty;

import org.eclipse.jetty.server.Server;
import org.eclipse.jetty.webapp.WebAppContext;

import java.net.InetSocketAddress;

public class EmbeddedJetty {

    public static void main(String[] args) throws Exception {
        InetSocketAddress address = new InetSocketAddress("0.0.0.0", 8080);

        String warPath = Boolean.getBoolean("DEV_MODE") ?
                "src/main/webapp" :
                EmbeddedJetty.class.getProtectionDomain().getCodeSource().getLocation().getPath();

        Server server = new Server(address);

        WebAppContext context = new WebAppContext();
        context.setContextPath("/");
        context.setWar(warPath);
        server.setHandler(context);

        server.start();
        server.join();
    }

}

What does everyone think? Leave a comment and tell us what we screwed up :)

Ever feel like your Ops team just doesn’t understand what’s going on in Development? Ever feel like your Development team is just “tossing stuff over the wall”? I’ve felt like that plenty in the past. There are all kinds of strategies to battle this, this post will talk about some of the things we do at Rally to keep things moving smoothly.

Some of these are dedicated meetings and some are just Ops being involved in the regular Development process. In general the idea is to try to have Operations and Engineering in the same room as much as possible when decisions are being made.

Stand Up Integration – Daily, 5-15 minutes

We have many different Dev teams working on a variety projects at the same time. That makes for a big challenge staying in touch with all those projects in Ops. One of the most effective ways that we have found to get visibility into what is going on in Dev is to attend their daily standups. This doesn’t require a ton of our time, usually 5-15 minutes per team, and has a huge benefit.

We try to have one person in Ops assigned to a particular stand-up and work with Development to stagger their stand-ups so that we can attend them. Ideally the same person in Ops goes to the stand-up each day.

A few things attending standup each day buys us:

  • Get to hear about new changes as they are being concieved & built.
  • Ability to gather feedback from Ops & Dev each day and keep both teams in sync as ideas evolve.
  • As issues come up in production for a given team they can be brought to the Dev team directly by someone from Ops who is familiar with their projects because they attend the stand-up each day.
  • Establish a relationship where the designated Ops person works as part of the Dev team. They aren’t necessarily responsible for all the tasks that Dev team needs, but they are an easy and available point of contact.

This doesn’t solve 100% of your problems, but it does help to open up communication & involvement between the teams and goes a long way toward getting things moving smoothly.

SoS (Scrum of Scrums) – Daily, 5-10 minutes

This typically happens following most of the individual team stand-ups. This is a very quick meeting for teams to raise any cross-team dependency issues. If Ops had a production issue the night before, we let everyone know at SoS. If there are problems that are blocking another team and they need help, they get raised at SoS. If there’s an important demo going on and we need to be extra vigilant, it gets raised at SoS.

SoS is a fantastic opportunity to make everyone aware of important current events. If there are issues that are raised which require additional discussion we typically have folks stay behind who want to discuss and that meeting happens immediately following SoS.

Dev/Ops Sync – Weekly, 1 hour

Once a week we pull the Operations & Engineering teams together for a meeting we call “Dev/Ops Sync”. This meeting has evolved over time but the basic premise & agenda is something like this:

  • Discuss changes in this weeks release, what features are being toggled on, what does Ops need to be aware of with the upcoming release.
  • Talk about any open issues we have for the Dev team. This could be defects we have created which are impacting production, new issues that have been discovered since the last meeting, or areas of concern where we want input from Developers to help steer our actions.
  • Make Development aware of any outages & share the associated PER document.
  • Talk about any issues Development needs to bring to Ops. Some of these issues are headed off by Ops attending stand-ups but sometimes there are larger issues that need to be discussed.

We try to have someone from each of the Engineering teams represented here which helps facilitate making sure we can cover any questions that come up for any of the teams.

Architecture Council – Weekly, 1 hour

This meeting isn’t strictly for keeping Dev and Ops in sync and more often than not it’s pretty Dev focused, but Operations involvement in this meeting is always significant. The Architecture Council at Rally is where our larger architectural issues are raised, discussed, and assigned to teams for prioritizing the work. This team is intended to make sure there is a forum for raising issues that could cause us problems down the road, for reviewing the design of new features and components being built, and to provide a way to keep all teams in sync around what important issues are being worked on.

Sometimes this team meeting is where decisions get made, sometimes it’s just to review what decisions are being made, but it’s one of the areas Dev & Ops come together to get consensus on how to move forward.

Retro & Demo – Bi-Weekly, 1 hour

Again, not strictly for the benefit of Ops, but these are two meetings where Dev and Ops come together to talk about how things are going. Every week on Friday we have either an Engineering Demo or a Retrospective. Both of these are opportunities to share what work is being done and what’s working or what isn’t.

The Engineering Demo

This is where our Development Team shows off what they’ve been working on. Sometimes it’s Will showing us some awesome new menu’s, other times it’s Jeff showing us an app that says “Hello World”. Either way, Dev & Ops get to ask questions and give feedback on work in progress, hang out and relax, and have fun in the process.

The Retrospective

Our Bi-Weekly retro’s are where each team reflects on what worked and what didn’t over the previous 2 week period and share those items that they think are important for the other teams. This is a great opportunity for Ops to raise examples of things that worked well and to call out areas where we need help. It is also a chance for any of the teams to raise items that aren’t working well and get those items assigned to an owner so they get fixed. It’s not perfect, some things take time to address, but if something continues to be a problem it continues to be raised here and usually gets fixed sooner than later.

Do what works for you

These are the things that are working for Rally today, but we change all the time and try new things. You may not need all of these meetings to have a successful Dev/Ops culture but hopefully some of these ideas help. The most important part of all of this is to foster communication and involvement by both teams – making team members accessible & constantly asking “what can we do better?”.

In Part I & Part II we talked about the overall structure of the PER process & the information we collect. In this post we will talk about the meeting we use to wrap all this up. Any time we have a service impacting event we have a PER meeting to discuss the event and gather all that information we discussed in Part II. Read on for details about that meeting.

A note about facilitating a PER meeting

At Rally we are big fans of meeting facilitators and we even have a group of people who have been trained to facilitate meetings for teams they aren’t on. Often meetings at Rally aren’t run by a manager or person on the team but rather by a facilitator who is asked to come in and facilitate. The facilitator can have a large impact on the outcome of the meeting, both in the quality of the result as well as whether or not you finish on time. We could write a whole post on meeting facilitation but here are a few things to keep in mind when you facilitate a PER meeting:

  • Avoid coming up with a solution to the problem. The PER meeting stops at identifying what the next action for a given delta is.
  • The Facilitator should represent the group. This is why a 3rd party facilitator is good – they don’t have an opinion about what is being discussed. As a member of the team, make sure you get the groups permission when you add something to the document & allow the group to drive the decisions. Note: this is hard.
  • Control cross-talk. Discussions like this can get pretty passionate and folks will start talking over each other – as the facilitator you need to make sure everyone has a chance to talk. Sometimes your most timid members will have great ideas – you want to hear those.
  • Keep it on time. These meetings are draining if they run too long – if you need to do some research & come back for a 2nd meeting do that, don’t hold everyone in a meeting trying to get answers you don’t have.

Scheduling the PER meeting

We have a rule that PER meetings must be held within 24 hours of an event to make sure things are fresh in folks minds. No, we don’t schedule them on weekends for end of the week events – we do them on Monday. For most events we will schedule 1 hour and we can typically wrap up a PER in that time. For larger teams you may need a longer meeting and for any large event where you have multiple teams you may have to extend the meeting.

Who should attend?

The PER meeting should include anyone who was involved in the event or needs to be present to come up with corrective actions. For problems where the component impacted is maintained by Operations, we will typically keep the meeting constrained to Ops, but for events where an application problem was involved we include a broader audience.

Folks at Rally who might get called into a PER meeting include

  • Operations – Familiar with the event and typically the ones who responded to it.
  • Development – Familiar with the applications & service as well as what might be tweaked to improve things.
  • Product Owner – Able to prioritize important changes within Development
  • Customer Support – Responsible for communication with customers during an event

For significant enough events you might even have folks from Marketing & Legal involved. Let’s hope that’s not the case – but don’t exclude anyone who would be able to help improve things.

Outcome of the PER Meeting

When we fire up a PER meeting we have a few specific goals for that meeting:

  • Make sure we have an accurate & complete timeline of the event.
  • Identify the most complete list of Pluses & Deltas possible.
  • Identify Corrective actions and assign ownership to someone.
  • Agree on the overall event metrics

That’s it. Now granted, getting those goals completed can be a long road sometimes. In the documents linked below we have included a detailed meeting agenda as well as the items that we cover for meeting preparation. I would encourage you to have a look at those documents to get an idea of what items we cover in the meeting & what the structure of the meeting is – there’s not a lot of reason to cover that here.

Wrapping Up

For folks who may not already be doing this we encourage you to find your own balance in this process. The best process is one that you actually use. Do not create a process that you avoid because it’s too much work or too heavyweight.

We’ve made our own documents around this process public to help you with some examples. You are welcome to copy as much of these as you would like & make them your own.

Post Event Retrospective Process

Post Event Retrospective Template

I appreciate you sticking with us to this point – if you have feedback on this process or ways that you’ve found which work in your environment please share them in the comments – we want to know what others are doing too.

I feel like we do a pretty good job of instrumenting our systems here at Rally. I know part of the reason for our instrumentation quality is our constant attention to how we can improve it & a feeling that it isn’t good enough. The reality is that it should never be good enough – things are always changing and you are always trying to get visibility into some dark corner of your system. All that said, we had a recent event that provided a good example of how we understand a high load event on one of our servers and I wanted to talk about how that went.

We use some pretty standard tools – we have invented a few things here and there to help us but in general these are all tools you can use yourself – the main components of our instrumentation are:

  • Cacti for on going trending data
  • Splunk for ad-hoc slicing and dicing of log data
  • Postgres data warehouse which stores data extracted from our logs
  • Tableau for visualization of more complex data & report generation

Todays story is mostly focused on Cacti & Splunk.

It starts with what you alert on

The investigation begins because we receive notifications of high load on one of our application servers. We set fairly low thresholds for load on our systems and keep them tuned pretty close to ‘normal’ so that any increase triggers an investigation. This can get annoying sometimes because it doesn’t always indicate a real problem – but it very often does indicate a change in behavior & sometimes shows us a problem in the making. We want to know when problems are brewing and certainly want to know before our customers do – so this approach suits us well (at the cost of some false positives). In this particular case, had we not set such low thresholds for load alerting we probably would not have noticed this issue because it never actually caused any problems with service delivery.

Here’s a load graph from a 24 hour period of this system – the window of concern is from 1pm to 3pm:

load-24

The ramp you see in the morning is pretty normal, but the ramp at 1pm is unusual and crossed our threshold and began to fire alerts.

Digging in – what’s driving the load

OS load is generally a poor indicator of system performance but it’s often useful as a broad view into the system activity trend and is useful for identifying anomalies. Beyond that – load doesn’t provide much insight into what is going on and you have to get more stats. Here are a few more graphs showing some other characteristics of this same system during this time.

Note that the rest of the graphs focus on a window from 12:30pm to 3pm to focus on the event. These graphs are all specific to the JVM on which this problem was being observed – they aren’t system-wide graphs. It’s important to be able to focus in on a single system as well as having system-wide views into activity.

Here is a graph showing a message queue used by the application – it shows that messages are beginning to back up in the system starting around 1pm.

message-q-timeframe

And here is a graph showing our garbage collection activity in this same window. You can see that this activity corresponds to the messages collecting in the queue above.

gc-timeframe

Looking at these 2 graphs we are starting to understand why our load is occurring. We have an increase of GC activity which typically will drive system load. That GC activity is likely more of a symptom than a cause. What these graphs do not tell us is whether the increasing message queue is due to an increase in message volume or an increase in message processing time – for that we need to start looking at individual requests.

We have to dig further – per request information

At some point aggregate statistics only help so much. The graphs above are general indicators to help focus a search but they are not enough to really identify what is going on. At some point you have to start looking at the activity going on during that time at a per-request level to weed out the noise and focus on those events which are driving this activity.

We use Splunk for this today – it does a good job of taking raw log data and allowing us to slice and dice it various ways to find specific activity. In order to make this useful we’ve instrumented our logs to expose a lot of detail about each request. We will talk more about that in a future post, but for this post just know that our ability to correlate log data to this event is highly reliant on having a lot of detail in those log messages that is easy to parse by a tool like Splunk.

We have a log message which indicates when message processing rates exceed a threshold – the application logs this message to help us identify outliers. Looking for this message in splunk gives us an interesting looking timeline:

Screen Shot 2011-09-29 at 12.00.22 PM

This is interesting, so during this same window we know that there was an increase in messages that took longer than we like them to.

Because these log messages are also instrumented with the customers ID we can look at all the activity for that customer and see if there’s any correlation to get an idea of volume.

Here’s a histogram of the volume of requests from that customer during this time:

Screen Shot 2011-09-29 at 12.04.01 PM

Now we are starting to understand what is going on. There was a dramatic increase in activity around 1pm which doesn’t directly correlate to the increase in processing times & GC activity but it’s pretty close.

What is also interesting is that while the number of messages taking over our threshold (first Splunk graph) seems to follow the same trend as the GC graph, the overall volume for this customer doesn’t follow that same trend. The timeline is similar – but it seems like there might be some specific requests in here which are driving these issues – not just the overall volume of requests.

One of the other important pieces of information we include in the log message for each request is the URI of the request. We can exclude endpoints which cannot generate a message (since the message queue is the real issue here) and understand what activity is driving the creation of these messages.

When we do that we get this histogram:

Screen Shot 2011-09-29 at 12.18.02 PM

Looking at this you might think “That looks nothing like the pattern above” and you would be right – except for a few things:

  • The start time of these requests correlates much more closely with the start of our event than the overall increase in request volume does.
  • The start/end of these requests correlate with plateaus and positive/negative slopes in the message queue graph.
  • If you envision these requests driving message generation and it taking time for the message processing to complete – you can start to correlate the graphs.

We can now look at these events & the notifications they produced and build a picture around what drove the other behavior in the system.

These graphs are great, but what was the actual problem you found?

The bottom line in this example is that we have a situation which isn’t causing us problems right now but which could grow into a problem in the future. The actual source of the event is complex and is a combination of the data this customer has, the way they are manipulating this information & the way our system is architected to handle that.

But this post isn’t about the source of the event – it’s about visibility.

During this event the system performed fine for our customers. Despite the graphs in this post, the warnings that triggered this investigation were just that – warnings. We didn’t have to take any action because there was no problem to resolve, the customer was using the system as it was intended and the system was working fine.

We know this is true because we have data to show us this is true.

We have graphs showing us server side response time, client side response time, failure rates, etc – all of those looked great. The thing that didn’t look great was the system load & it triggered an investigation just like it’s supposed to.

What do we do with all this data we have now that we’ve investigated?

Continuous improvement is an important concept at Rally and we take it very seriously. This event is something we want to understand better. We want to make sure this doesn’t turn into a problem in the future.

Shortly after this investigation concluded and we had a good idea of what occurred the Operations team summarized the findings in a document & that was distributed to the Engineering & Operations teams. Each week the Engineering & Operations teams meet for a meeting we call “Dev/Ops sync” – this issue was raised there and discussed. After some discussion this event had an owner in Engineering & a goal of investigating the effort to improve our handling of this for the future.

This effort will turn into a story and be prioritized for work in the future. We don’t stop there.

Another important group we have at Rally is our “Architecture Council”. This is a group of Engineering & Operations members who are passionate about the way that Rally evolves our service. Architecture Council was formed to track and address problems just like this – things that aren’t problems yet but will probably become problems in the future. That group has the ability to prioritize work into Engineering teams & get it done.

As time goes on, if we discover events like this being repeated they will continue to be raised at Dev/Ops sync & Architecture Council. Data such as the impact of these events on the system, the impact on customer experience, and the resources we consume each time they occur will drive the priority.

The important idea here is that it’s data driven, and it all starts and ends with visibility. You have to instrument your systems to know what they are doing & why. You then have to communicate those events to the team so that improvements can happen.

We have a new component that went into production 2 weeks ago. It had been under development for about 2 weeks when we got it running on production servers. At this point, the application itself does one thing – says “Hello World”. What had we been doing those two weeks before we first deployed it?

  • Evaluating & selecting the technology for the new component.
  • Determining packaging of the application
  • Building startup scripts
  • Building deployment scripts
  • Testing the deployment in test environments.

This application deployed to production two weeks ago onto new systems and will not see any production traffic for a while – it still only says “Hello World”. It will continue to live on VM hosts until we determine that it has specific hardware requirements.

Why is this being done? Because there is a lot we do not know and the sooner this application is in production the sooner we can get answers to those questions. We have performance testing environments that work pretty well for us – but there are lots of peripheral issues to work out when you get a system into production and we wanted to work through those early.

We have a continuous build environment and fully automated weekly deploys to production. We use feature flags and branch in code to control code availability in production. The hard part of building a new component is introducing it into that weekly flow without causing any problems – so in this case we are solving that problem first. Now that the application is in production it should be minor tweaks to build it up to the point where it’s getting production traffic. Each incremental change that’s made will go into production within the week – if Developers want to start diverting some traffic to this component for testing, they can do that with very little interaction from Ops (feature toggle most likely). Operations can begin to monitor the application, deal with logs, start collecting performance metrics – all before this application sees customer traffic.

When we get to the point where the application is fully feature complete Ops should have a pretty good handle on knowing if it’s working & have worked out any kinks related to it’s production readiness.

The caveat to all of this is that you must have mechanisms in place to make sure this new system isn’t going to cause harm in your production environment. This application goes through the same functional & performance testing environment every day that all of our applications do. From the very beginning, even though it only said “Hello World”, there was a performance test to exercise the query that produced “Hello World”. For every incremental change we make to the application, tests will be built to make sure that new functionality works as expected before it hits production. As new features are added to the application, things like DB connectivity, Ops is notified & we’re monitoring the impact these new changes have on the overall system. Because the change is incremental it’s a lot easier to identify any problems and respond to them.

There are a lot of benefits to this approach for new code:

  • Ops gains comfort with the new component very early & can provide feedback to Developers while it’s being built when there is the least possible resistance to change.
  • Rather than guessing at hardware requirements you acknowledge that you aren’t sure and deploy in the least costly way and get feedback on performance as early as possible.
  • Developers can test in production very early to catch issues that simply are not exposed in our testing environments.
  • Operations has a lot more time to get configs into our CMS (Puppet), get monitoring setup, and get log ingestion setup before customer traffic is flowing through the system.
  • As the product is developed, Operations is giving status at stand-up about the production environment readiness & any problems encountered.
  • Everyone is happier because they are working together from the beginning to make this happen.

Not every component we build goes out this way – and in fact, it’s an uncommon event for us to add components. So, how did we end up getting things prioritized so that the deployment tasks were done up front?

  • Everyone agreed that getting the application into production successfully was a critical part of the development process – and sorting it out early had a lot of benefits.
  • There were a lot of questions about what the software would need to do & how it would perform and our answers to those questions needed to be validated in both testing & production environments before the system went live.
  • The Dev team had design & testing to do before they could start building a system – this time was used for the build/release/ops teams to get moving on the deployment configurations, packaging & target systems.
  • An Operations team member was present from the very first stand up for this team.
  • The build/release team was present from the very first stand up for this team

This may not work for you, but we think it worked pretty well for us. As the system is developed and becomes an integral part of our system receiving production traffic, we can let you know how this whole process contributed to the quality of the product. For now, it delivers a damn fine “Hello World”.