As your DevOps team grows, scaling efficiencies across the group is imperative to maintaining a well-oiled unit. A small team of smart engineers can function well without much effort, but as your team gets bigger, you need to make sure you invest in the right tools and practices to help keep everyone on the same page. Throw in distributed teams, across different time zones, and issues can only get exasperated.
Below are a few key tips that you should consider as your development and operations teams grow:
1. Stop SSH-ing: If you are a team of one or two, and/or you are managing a small number of instances it’s not unusual to have to SSH into a particular server to diagnose an issue or to perform some configuration tasks. However think a team of ten or more and this type of behavior can become unmanageable; hard to keep track of who is doing what. In particular, when diagnosing issues you generally do not want your Dev team logging into individual servers to “take a look.” From a security perspective you likely don’t want to hand out the keys of the castle to the entire team if you don’t need to. On top of this you don’t want a new junior developer logging onto a production system where accidents can and certainly do happen. Furthermore as your number of server instances increases to tens or hundreds this practice simply becomes unrealistic. Instead you should only give SSH access to a few key team members, and utilize tools like New Relic, Logentries, and ScriptRock to help you have a single, centralized view of what is happening across your environment. One of the primary drivers DevOps teams are adopting log management solutions so rapidly, for example, is the demand for a centralized place to access all of their logs without needing to log into individual instances.
2. Take an Application Centric View: Servers come and go, but applications last forever – or at least for a lot longer than your typical server instance 🙂 It thus makes sense to think of your systems in terms of applications rather than in terms of server instances. In practice this often means grouping resources related to servers into applications (e.g. production-search-app, staging-search-app). For example we find Logentries users regularly group logs from multiple servers into an aggregate view that relates to a given application. By way of example they regularly route all access logs from production-app1 into a single log in the Logentries dashboard and all DB-error logs from production-app2 into another log. The efficiencies come when you have a lot of server instances and you need to analyze one of your apps – if you take an application centric approach you don’t need to worry about trying to figure out what server had an issue before kicking off your investigation – instead you can look the application as a whole and begin your analysis from there.
3. Team based alerting: Firstly make sure the tools you are using give you alerts in real time. A lot of vendors will claim ‘real time’ capabilities, but in reality, will send you alerts in several minutes rather than seconds. Beware of logging tools in particular that run alerts as background jobs.
When an issue does occur – make sure the right individual gets notified immediately. Sending alerts to everyone all the time can result in alerts being treated as noise. Instead be disciplined with your alerts and use tools like PagerDuty to make sure your alerts are being routed to the right person at the right time.
Sharing alerts via internal communication tools like HipChat or Campfire can also be a good way to make your team aware of what is going on in your system – as well as providing a platform for communication about the particular issue that has occurred. This is particularly useful if you are not all sitting in the same office and are dotted at different locations across the globe.
4. Share Team intelligence: As your team grows, you tend to gain more and more in-house know-how and you build up knowledge about the systems that you all work within. You’ll often have specialists in the team that become expert on a given component or technology. Spreading this knowledge across your team is often a challenge; especially if you are grow quickly. A recent innovation we came up with at Logentries, was to leave ‘post-it notes’ for other team members so that we could share information that we had picked up over time. These were ‘post-it notes’ with a difference however – virtual post-it notes in fact that you could stick on to your log events 🙂 We’ve called them Team Annotations and they came from a simple concept after working with a number of large distributed Dev teams. Consider this scenario:
- On Monday morning, Brendan, a developer based in Dublin sees some strange activity in his system. He spends 6 hours investigating the issue and identifies a resolution, puts a temporary fix in place (i.e. restarts the server) and writes a ticket for the bug to be fixed. Job done!
- Josh, a developer on Brendan’s team but based in Boston, arrives into the office on Monday morning (+ 5 hours), and sees the same issue that Brendan saw. He spends another 6 hours investigating the issue and also comes to the same conclusion as Brendan and puts a fix in place, writes a ticket. Job done again…
Now if Brendan had simply been able to leave a log annotation for Josh to explain that (1) he was working on the issue, and (2) that he had found a fix, Josh could have spent his efforts elsewhere. With the guys being a couple of thousand miles apart using a post-it note wouldn’t necessarily be possible. Enter Team Annotations. Annotations allow you to attach the equivalent of post-it notes to exceptions, errors or anything you deem important in your systems. You can in fact leave multiple post-it notes so that you build up conversations about particular issues and can then assign these to team members, mark them as in progress, pending, complete etc. Since we released Team Annotations we have had so many customers contact us about the efficiencies they are driving through their teams – these efficiencies seem to be exaggerated in larger teams and especially those dealing with different time zones.
5. Focus on KPIs: Last but not least, keeping your team focused on what is important is critical to having a team pulling in the same direction. It is important to establish a small number of key metrics that relate to the life-blood of your system. How these numbers are determined and what they are, naturally differ from system to system and from business to business – but from a DevOps perspective they generally relate to the performance of your system, what current system load looks like and if you are in a good or bad place in terms of uptime.
Once these key performance indicators are determined, you want to make sure the team always has access to these whenever they need. My advice on this point is to go out and buy the biggest flat screen your budget allows and nail it to a wall in your office for all to see (DevOps, sales, marketing … everyone) and use a dashboard like Geckoboard to display the numbers that are important to you and your team. There’s nothing quite like the possibility of ‘airing your dirty laundry in public’ to keep the DevOps team focused on the job.