Configuration Management Sucks

I’ve tried them all, and they all suck.

Ok, I’ve not really had a good go at Puppet, but it’s 2015, no one uses Puppet for anything new anymore! /s

I work for a medium-sized web hosting provider. The majority of the servers we deploy are configured and then handed over to clients, at which point our involvement is merely supporting the configuration rather than managing it forever. A lot of new business is still buying dedicated servers or VMs - I refuse to call them ‘cloud’ because they’re not what I would define as a cloud instance. They’re pets, not cattle. We do offer an OpenStack cloud and a VMware cloud, but people who are buying those generally know what they’re doing and fall out of the scope of this discussion.

Currently most of our physical Linux boxes are installed using Cobbler, with VMs built from images, then customisation applied on top mostly using horrendous shell scripts, archaic Perl line noise or, in some complex cases, by hand (mainly big cluster setups and other stuff where you need eyes-on anyway). It’s all old and creaky methods for deploying stuff, but damn, it works. Unfortunately it’s not keeping up with changing times and building more automation into the system is becoming increasing difficult, so a few colleagues and myself have tasked ourselves with modernising it.

As well as client machines, we’ve also got infrastructure where an agent-based system would be fine. Our loadbalancers, anti-DDoS, WAFs, and properly fully managed clients wouldn’t have any problem running an agent all the time, so I’m not totally against using an agent, but what we really need is something that we can run remotely from a central location and have it execute bits on servers with as little faffing around as possible.

The biggest problem we have is that the big players - Puppet and Chef mainly - tend to revolve around having an agent installed on the host. Yeah, you’ve got Chef Solo and whatever the Puppet one is, but they feel like something tacked on as an afterthought and you’ve still got to pull in huge dependencies to run the damn things (reduced with Chef’s Omnibus packages, I suppose).

Puppet’s DSL feels really strict and I’ve heard horror stories about the hoops people are jumping through to get complex stuff working right with Puppet. Chef’s DSL is basically pure Ruby, which is really expressive and quite fun to work with. I wrote a fairly complex recipe with Chef to install and configure one of our products and it was quite nice, but also rather mind bending. I’m a Python and Go man, Ruby is too much like Perl and my colleagues have an irrational hate for Ruby which makes it difficult to get buy-in.

So naturally, I turned to Ansible. Simple, YAML-driven, agentless, no bullshit. Except, there’s lots and lots of bullshit when you want to do anything other than launch stuff from the command line.

Don’t get me wrong, out of the bunch Ansible is probably the best. It’s dead easy to get up and running and start pushing machine configurations out, but if you want to do anything reasonably complex with it you’re going to start banging your head against brick walls until you wish you hadn’t sold your Puppet Cookbook without even opening the first page.

We deploy lots of servers every day. They get installed, they need configuring, then the details get sent to the clients. Job done, hands off. No one wants to be sat at a command line bashing out ansible-playbook -i hosts lb/main.yml all day long to provision these boxes. What we need is an API we can hit to start provisioning and retrieve results of runs so we can track where servers are in the deployment process. I want to go from bare metal to fully provisioned without any human interaction beyond clicking the ‘Buy’ button.

So, I wander off and start poking around the Ansible Python API docs, which to summarise is basically ‘pay for Ansible Tower or read through the code’. Paying for software? What a novelty.

Ansible Tower is their web-based UI for managing Ansible and it comes with a RESTful API. Having looked at Ansible Tower, it doesn’t seem to cover our use case. They charge you based on the nodes you’re managing, but we don’t want to manage nodes - we just want to run a few playbooks against a host and then hand it off to the client. There’s nothing in their documentation about how this would work in practice. Could we conceivably use Tower for free if we stay under their 10 node limit by registering and removing nodes as we deploy them? I’ve just thought of this… But even so, it doesn’t feel honest to do that.

So I started poking around with the Python API and tried to munge it into our existing infrastructure to give it a RESTful API for us to poke at and deploy playbooks with. This hasn’t gone well. Ansible seems designed to be run on the command line with the way it’s using callbacks for tasks. Getting data back in a sensible way seems impossible. If the API was better documented I might be more inclined to keep faffing around, but as it stands I have no interest in investing the amount of time necessary going balls deep into the Ansible codebase to try and get this working with a degree of sensibility.

I also refuse to shell out.

I’ll hold my hands up at this point and say that I’m not a great programmer. My role is a mixture of system administration and gluing APIs and other random stuff together with a mixture of Python or Go to make magic happen and convince people around me that I know what I’m doing.

So that leaves Salt. Salt comes with a RESTful API built-in. It feels very much ‘batteries and kitchen sink included’. So much promise, but in the end so much disappointment (for our use case). My issue with Salt is that their SSH stuff is a bit primitive at the moment which is a bit of a deal breaker. The documentation says it won’t work with passworded sudo either, which is an immediate killer for us at present.

Their API seems great if you’re using the Salt ‘minions’, but I’ve yet to be able to get their API working with the SSH stuff. I believe they’ve got this planned for their next major release, and I tried to test it using their development branch but couldn’t get it working. On the upside, they’ve had a bug report from me and I got to interact with some of their friendly developers in # salt on Freenode.

So what does this leave us with? I have no answer for this, unfortunately. If you were expecting me to come to some amazing conclusion at the end of this incoherent ranting, you’re going to be sorely disappointed.

I supposed we could continue using shell scripts, update the Perl to Python and carry on as we are by bolting on as much automation as possible. Hell, I could start using Paramiko or Fabric, but that would feel like a step backwards.

After talking it through with some of my colleagues, we were half-joking about writing our own configuration management system using Go that would mix the best bits from all other configuration management systems, but would really fit the hosting provider use-case. I’m really in love with Go at the minute, so I’d be totally into diving in the deep end and writing something amazing.

Watch this space.

Or don’t, it might never happen.