Log Analysis for Orchestration Change Management

Are you suffering from server sprawl? You might be and don’t even know it. Server sprawl occurs when there are an unknown number of rogue VMs; VMs with unknown workload,  owners, or purpose. And no one is fearless enough to delete or suspend them. Orchestration tools make creating new nodes so easy that almost every organization who uses them is suffering from server sprawl…at a high price and lots of confusion. log-analysis-for-orchestration-change-management-2

Server sprawl can be managed especially if the effort comes from the technical team, not finance wondering why the cost of cloud services is going up at an unexpected and exponential rate.

The Top Reasons Why Server Sprawl is Bad:

1. It costs a lot, and can be hard to quantify. I do not need to tell you how expensive unused and running machines on AWS or other cloud platforms can be. And most cloud providers do not provide a bill that would indicate an issue; only good chargeback models can help. Ones that most organizations do not have in place.
2. It is hard to know which version of which orchestration script was used to create a machine. Versioning of the scripts themselves is common (and great), but that versioning does not go beyond the source repository, and most organizations do not know which machine was provisioned from which version of which script.
3. Additionally, it can be hard to know which version of which orchestrations script is associated with a particular build. Just like versions of scripts and machines, no one knows which version of scripts are part of each application version. What happens when a component running on the server, such as Composer for PHP, needs to be updated for a new release to work?  Only the latest orchestration script has been configured to download and install the latest version. So how do you know which VMs are ok and which are not? In this case the best indicator is going to be a failed test. Realistically most environments will have a many to one association, so it is even more complex.

All of this adds up to a concept that is not new. I am basically talking about change management. A dreaded word associated with red tape. But in the case of modern development, the same tools used to automate the delivery chain, can be used to manage it. One tool to help with this is a log analysis service that you can use to collect and analyze logs from your orchestration server, logged application versions at each stage of the pipeline, and log machine identifiers.

Altogether this gives you the proper tools to correlate data streams and build dashboards to analyze  your orchestration process. They also allow you to leverage alerting, anomaly detection, and inactivity monitoring.

The best part is this can be easy to implement:

1. Make sure you have Log agent as part of your orchestration script. Here is an example with a Chef recipe. The script downloads, installs and then links the logging agent to the log analysis platform, for every machine, on each provisioning.

2. Log your script runs, name, date, version. The easiest way to do this is to store the orchestration server log files in your log analysis platform. You can set that up in just a few minutes. Install your logging agent on your orchestration server, and point to the log files, like Chef log files in /var/log/chef-server.

The other thing that is important for change control is sharing with the team. Because orchestration is automated, and that automation is triggered by someone, the rest of the team can be left in the dark. All they usually know is the infrastructure is there.

By bringing orchestration into your log analysis tool there is no need to ask which scripts were run, when, and what is on them. At the very minimum you can track down the script version number and see for yourself.

Posted in Agent, Application Performance Monitoring, Log Analysis, Log Management, Server monitoring

Leave a Reply