"What the \$%^&? This worked just fine in test-kitchen! Why won't it converge on my host?"

- me, last week

At Rally we are working to build a service management platform that we treat like any other software project. Our goal is to provide developers with familiar tooling to manage everything from their local development environment to the Continuous Integration environment, all the way through managing the testing and production environments. A large part of this tooling involves using Chef to manage infrastructure and deploy code.

There are 10 types of problems that annoy me more than any other in Chef: dependency resolution, and dependency resolution. (Numbers don't always mean what we think they should ... sometimes they're more important than we realize.) In this post I'll cover some of the ways that Dependency Solving can go wrong and what we can do to avoid it. This usually comes down to having specific version constraints and a process that tests all of our cookbooks together, before they are converged on a production node.

TL;DR?

There isn't a TL;DR. This is a complex topic, so this post is long. Breaking it apart would make it more difficult to understand. (See the references at the end for even more reading.)

### The Goals

This post is largely about confidence and predictability in developing Chef cookbooks. As such, I'd like to define some goals we generally have when creating cookbooks or any software really.

Test and Prod Should Produce as Similar Results as Possible

The utopia of software development in general is to have a test environment that allows us to test behavior in such a way that we're confident we'll get the same behavior in any other environment we care about. A fundamental aspect of having predictable behavior is dependency resolution. If we can't know that we're running the same version of software in prod as we are in test, all bets are off.

Many software projects make it very far without solving these problems in highly reliable ways. You can disagree with the above goal, but the remainder of this post assumes you don't.

This post also assumes you're familiar with the expectations around SemVer. The goal of SemVer is to create a predictable pattern of version progression so a developer can choose the level of risk they are willing to assume in automatically pulling in updated dependencies.

Development on One Cookbook Shouldn't Impact Others

This is where things start to get rough. If we pull in the yum cookbook, we want to know that the expectations we have of that yum cookbook are going to remain the same until we decide to pull in later modifications. Dealing with unexpected changes is a great source of unplanned work. This can happen in Chef in a few ways.

1. A new cookbook version is uploaded to the Chef server which behaves differently, yet still gets used by our cookbook. This should be completely avoidable with appropriate use of SemVer and version constraints. Reality, however, has a much more complex sense of humor.
2. A completely different cookbook with the same name can be uploaded to the Chef server. An example of this is the chef rbenv and the rbenv cookbook. You see, Chef has no namespacing and thus no way to ensure that cookbook name collisions don't occur.  More on this later.
3. Chef-client can fall back to using an older version of a cookbook due to version conflicts. Again, more on this in just a bit.

Ability to Promote Changes Across Environments

We all know that test environments can only ever provide so much confidence. For high-risk changes or just for environments where change is high-risk, the ability to promote a change across environments is important. As much as I love Continuous Deployment and the idea of instant promotion across environments, it's not a reality for many environments. Even in Continuous Deployment, it pays to deploy one environment at a time and stop if there are problems.

### Assumptions

In this post we assume a few things, because this is how we do business:

• Our cookbooks each reside in their own repositories, we do not have a single central "Chef repo"; as such we have to resolve our dependencies before uploading to the Chef server.
• We are using Berkshelf for dependency resolution and for uploading our cookbooks to the Chef server.
• Cookbooks may only be published to the Chef server through CI; nobody uploads directly to the Chef server.

### Potholes

To get to these goals, we have to take a road with some potholes. Some are little holes: they spill our coffee, and make us feel a little burned. Some are bigger holes: they might give us a flat tire, and set us back a few hours. Some are are gigantic sinkholes of despair that consume entire development teams. Here are some of those potholes to watch out for.

Berkshelf 2 Shortcuts

This pertains only to Berkshelf 2, but is an important idea as many folks are still using 2.x. Because interacting with remote APIs is expensive and makes things brittle and slow, Berkshelf 2 took some shortcuts.

Take a brand new workstation that's never run Berkshelf before, and let's say that we have the following dependencies in our cookbook:

depends 'thing1', '~> 1.2.0'
depends 'thing2', '~> 3.4'
"thing1": {
"locked_version": "1.2.5"
},
"thing2": {
"locked_version": "3.5.1"
}

Now, we find a bug in thing2, so we fix that bug and we push the new version 3.5.2 up to your Chef server. We come back to this cookbook and run berks update and still run into the same bug.

Because our version constraints are satisfied by using the 'local' version 3.5.1, which is in our ~/.berkshelf/cookbooks directory, berks never bothers going out to our Chef server or other sites to see if there's a newer version of thing2.

This can lead to inconsistent behavior across environments where the version used in test (because it was cached) is not the same version used in production (because it pulls from our Chef server every time).

Berkshelf 3 addresses this particular issue, and this is one of the strong reasons for using it.

3rd Party Cookbook Versions Too Lax

Another scenario is when we introduce a cookbook that doesn't have a constraint on a particular dependency. This problem can happen in third-party cookbooks or our own, but the example I have is for a third party.

Imagine we're using the third-party nginx cookbook version 1.7.0 and it has the following in its metadata.rb (among other things):

%w{ build-essential yum apt runit }.each do |cb|
depends cb
end

This conveniently adds a list of dependencies for us, but none of them has version constraints. We'll come back to this in a bit.

Now we include the nginx cookbook in our own cookbook -- we do a good job of version constraining it and do everything right -- and our cookbook widget has the following metadata.rb (pared down for brevity):

name             'widget'
version          '1.0.0'
depends 'nginx', '~> 1.7'

This works fine when we first set it up and test it, but one day it begins to fail to converge in production. When we pull down the cookbook and run tests, it works just fine. The failure in prod is this:

Chef::Exceptions::RecipeNotFound
--------------------------------
could not find recipe epel for cookbook yum

Wait a second... I don't use yum in the widget cookbook, what's going on? What's going on is that the yum cookbook had a major refactor, bumped its major version (as is appropriate when there are backward breaking changes,) and we pulled in the new version. Let's break this down and take a look.

yum is being pulled in as a dependency of the nginx cookbook. We know from the above code in metadata.rb from the nginx cookbook that the version of yum is unconstrained, meaning we could pull in any version. Let's look at our versions of yum and see where this yum::epel recipe is defined and when it went away.

Version 2.4.4 of the yum cookbook is the last version where the yum::epel recipe appears. The cookbook was correctly incremented to 3.0.0 when that recipe was refactored out, a backward breaking change.

But why did this suddenly break? And why does it work ok in my test environment?

This is because our production Chef environment is not subject to the same constraints as our test environment. In testing, if we're using berkshelf, you have a Berksfile.lock which has very specific versions to be used -- it hasn't changed from the last time this worked and so it has the following versions:

"nginx": {
"locked_version": "1.7.0"
},
"yum": {
"locked_version": "2.4.4"
}

So when we run our tests, everything is constrained to these versions. However, when we converge our hosts in production they aren't subject to these constraints because we haven't put any constraints in the environment. When we look at our Chef server we see the following versions of these cookbooks available:

> knife cookbook show nginx
nginx   2.4.2  1.8.0  1.7.0  1.4.0
> knife cookbook show yum
yum   3.1.2  2.4.4  2.4.0  2.3.2  2.3.0  2.2.2

The trigger for this problem is simple enough: someone was working on a new cookbook that used version 2.4.2 of the nginx cookbook. That version of nginx uses version 3.1.2 of the yum cookbook. Because version 1.7.0 of the nginx cookbook has an unconstrained dependency on yum, when we converged our hosts in production they pulled in version 3.1.2 of yum (the most recent version). Unfortunately the nginx cookbook doesn't support the yum 3 cookbook because it assumes the yum::epel recipe exists and this causes our converge to fail.

This is just one real-world example of how unconstrained cookbook dependencies can bite you even if things work fine when you originally create your cookbooks. We'll talk about the solution to this in a minute.

Unconstrained Versions as a Boat Anchor

Similar to the above scenario we can have the opposite problem -- I say "opposite" because instead of pulling in a newer version of a cookbook, our lack of version constraints actually prevent us from using a newer version. Sound strange? Read on.

For this example we start with 2 cookbooks: my_yum & base. Here's the metadata for each (pared down for brevity):

my_yum:
name             'my_yum'
version          '1.0.0'
depends 'yum' # No version constraint!
base:
name             'base'
version          '1.0.0'
depends 'my_yum', '~> 1.0'

We use my_yum as a wrapper cookbook for yum and it adds some of our own preferences around yum and such. It works great with the latest version 2.x of yum.

Version 3.x of the yum cookbook is released and it causes us some problems because our unconstrained my_yum cookbook wants to use that same yum::epel recipe that got removed in 3.0.0, so we refactor it and bump the version of my_yum to 2.0.0:

name             'my_yum'
version          '2.0.0'
depends 'yum', '~> 3.0'

We also have to bump base to use the newer version of my_yum, so we do that:

name             'base'
version          '1.0.0'
depends 'my_yum', '~> 2.0'

Sweet! It all works great ... We then introduce a third-party app cookbook and want our node to run both base and this new app cookbook - here's the metadata for app:

name             'app'
version          '1.0.0'
depends 'yum', '~> 2.0' #

(The conflict should be obvious here.)

We test our app cookbook and it works great in testing. We push it up to the server and add it into a runlist that looks like this:

run_list:
- recipe[base]
- recipe[app]

When we converge our node it even seems to work fine - but we notice that some new settings we put into base aren't being applied to the node. In digging into this further we see that actually the node is downloading version 1.0.0 of base - but why?

The reason is that the only combination of compatible versions Chef can find is the older version of base, along with the older version of my_yum which doesn't have a conflicting version constraint with the app cookbook. Tested independently, both base & the app cookbook do the right thing in isolation but if you put them together they have conflicting version constraints and Chef suddenly does unexpected things, in this case falling back to an older unconstrained version of my_yum, and as such an older version of base.

The solution here is that you want to run these together in test & expose those dependency conflicts in the form of a converge failure during testing. Having unconstrained dependencies means you won't have a converge failure because of version conflicts, you are relying on some incompatibility between the resolved versions to expose the problem - such as our base cookbook not doing what we expected it to. This isn't good, and doesn't give us confidence.

Namespacing Problems

Chef cookbooks are not namespaced. What that means is that if I publish a cookbook named rbenv and someone else publishes a cookbook named rbenv there is no way to distinguish the two when they are uploaded to the Chef server. If you are using a tool like Berkshelf you can certainly point to different github repositories to say which of these you would like to use for a given cookbook, but once they get published to your Chef server there is no difference. The only hope is that the versions are sufficiently different to avoid using the wrong one through version constraints -- a really poor protection mechanism in this case.

This is made worse by the fact that the Chef community site is not namespaced, so if I just put something like this in my metadata:

depends 'rbenv', '~> 1.0'

I'm basically saying "get the rbenv cookbook from wherever you can" which typically means the Chef community site. As with Highlanders, there can be only one rbenv cookbook on the community site. So if I went out to github and found fnichol's rbenv cookbook and decided I'd put the above in my metadata to use it, I'm in for a bad time, because the cookbook on the community site is actually the RiotGames rbenv cookbook. They are very different.

Here's a list of the rbenv cookbooks we have:

> knife cookbook show rbenv
rbenv   1.7.1  0.7.3

Thankfully they are major versions apart because if you look more closely:

> knife cookbook show rbenv 1.7.1 recipes | grep name
name:        default.rb
name:        ohai_plugin.rb
name:        ruby_build.rb
name:        rbenv_vars.rb
> knife cookbook show rbenv 0.7.3 recipes | grep name
name:        user.rb
name:        default.rb
name:        system_install.rb
name:        vagrant.rb
name:        user_install.rb
name:        system.rb

The only recipe they have in common is default.rb and I guarantee you it does very different things in each cookbook.

What happens when the 2nd rbenv cookbook (fnichol's) needs to perform a major refactor? Jump to version 2.0.0? 10.0.0?

For this reason we have chosen to namespace the cookbooks we publish (see below for our standard naming convention for cookbooks.) Implement your own namespacing -- it doesn't take much and will save you all kinds of headache down the road when someone else publishes the whatever cookbook.

Unconstrained Environments

All of these examples lead to the final pothole the one with gigantic sinkhole proportions: unconstrained environments. This is where testing and production diverge, because your test environment is subject to the constraints of your Berksfile.lock but your production environment is not. I hope by now it's clear that extending the constraints you apply in testing is the only way to have confidence in the behavior of things in production.

You could also argue that tightly controlling what is uploaded to the Chef server is another way to do this; but we don't want to be the Chef server police and we don't want to be the only team contributing to Chef to manage infrastructure. We want a repeatable process that provides high confidence that things work the same in test as they do in production.

### Protecting Yourself

You've read this far, so you probably want to know: how do we protect ourselves? The best we have today is the Environment Cookbook Pattern and Berkshelf 3. There are probably other strategies for solving the above issues, but this is the one we are most familiar with.

You can read the above blog post to learn more, but the basic idea is that you are using Berkshelf (berks apply specifically) to apply the same version constraints to your Chef environment that you have applied to your testing environment using the Berksfile.lock. Using this pattern you can have high confidence that every cookbook in your expanded run_list will match your test environment [*].

For namespacing we've chosen to adopt a convention for all cookbooks, regardless of an existing conflict (rs = Rally Software):

• rs_cookbook : This is a library or application cookbook, the building blocks of our systems.
• rsw_cookbook : This is a wrapper cookbook, intended to be used in a way similar to a rs_cookbook but modifies the behavior of a 3rd party cookbook.
• rse_cookbook : This is an environment cookbook. This should never be a dependency of another cookbook and has a special CI workflow which performs a berks apply as the cookbook is promoted through environments.

This makes it highly unlikely that any cookbook we author (whether public or private) will conflict with existing public cookbooks. I wish more cookbook authors would do the same.

The remaining issue is version constraints outside of environment cookbooks. In this case we've tried to follow a policy of the "most flexible pessimistic constraint" possible. We generally try to constrain cookbooks to '~> 1.0' for example. If we get burned by minor upgrades we can restrict this further, such as saying '~> 1.1.0' to only accept patch increments. The risk is that the more specific we get with our constraints, the less flexible we are across all of our cookbooks. For example if we constrain java to '~> 1.20' in a widely used cookbook, no other java app can pull in any java cookbook < 1.20. If you need to restrict the version you can do so in the environment cookbook & we prefer to enforce specificity there.

### Notes

[*]: This isn't true if you upload forks of cookbooks to your Chef server where the version on your Chef server and the version on github/community site differ. Be careful with that! Uploading a modified fork means there are two copies of the same cookbook version that behave differently -- this almost always leads to unexpected results and unhappy endings.