We have started treating our monitoring dashboards like code deploying them automatically with the service they monitor. Grafana’s REST API has enabled this automation and we are noticing many benefits.
We have previously blogged about transitioning to a microservices architecture and the requirement of monitoring to make it work. The overhead of manually configuring and updating our monitoring dashboards, keeping them in sync across our various environments and the process of defining widgets for these dashboards was outside our natural workflow. None of this was ideal, neither the time spent on configuring new dashboards or time spent having to think about new dashboards and setup. We monitor in each of our environments, although we only alert on Live, and there was significant overhead in keeping dashboards in sync. Monitoring while giving us insight was also a burden.
Our culture has shifted to a DevOps environment. Developers are more involved with operations and we have introduced infrastructure as code. We see the benefits this is giving us and decided as we became more interested in monitoring, to figure out a way to remove some of the burdens described in the opening paragraph using similar concepts. Our interest in monitoring has increased due to our migration to a microservices architecture. There is a monitoring bill with microservices that needs to be paid again and again and again. Our unit of deployment for our services is a docker container. The exact same docker container which is used throughout the deployment pipeline and these deployments are all automated. We like this and wished to do something similar with our monitoring.
We have started to move our dashboards along the deployment pipeline in step with the services they are monitoring. Each project has a monitoring folder checked in along with the src code in Git. This folder contains the dashboard definition, we have a default one that new projects can use, so we can see the basic requirements. We no longer have to think about this, it is just another step in the creation of a new project, getting promoted through environments with the services being monitored. It has freed us up to think past the basic monitoring requirements and ask other questions. What is meaningful to monitor from a business perspective? How do we shorten the feedback cycle on this new feature to see if it is successful or a Google+?
Our monitoring infrastructure
Each of our services are currently using Dropwizard metrics to expose metrics that are pushed to Graphite for storage . Grafana is used to organise and display our dashboards. We really like Grafana. At a basic level it makes a big difference having pretty looking widgets and graphs.
Our monitoring UX is stepping out of the year 2000 into the modern day! The key feature however is its REST API for creating, deleting and updating dashboards via a json definition. This is what allows us to treat our monitoring as code. The support for multiple backends and the new ability (Grafana 2.5) to mix datasources in one graph is opening up a lot of possibilities for consolidation of our logs and metrics.
Our dashboards are defined via json for Grafana and we use Grafana’s REST API to push them through our deployment pipeline. They have some simple examples of how to define the dashboards and how the API works. I won't rehash it here but it is easy to see how this API could be used in your deployment process to keep your dashboards in sync. We use Bamboo and as part of each environment’s deployment we add a task which creates the dashboards.
We typically start work on an Epic. During the initial phase of boiling down an Epic into deliverable Stories product managers and UX get a chance to specify monitoring metrics that are interesting to them. These are included in Stories right next to the acceptance criteria and are estimated along with the rest of the Story when it bubbles to the top of the backlog. During development we can pop up a local monitoring environment via Docker and launch the monitoring locally both to create the new dashboards and to see how our service is behaving locally. Once this feature starts moving through our environment towards production it is deployed in each environments monitoring as part of the code deployment all of which is automated.
Benefits we are noticing
Having defaults that are automated and part of the development project has meant that we don't have to spend lots of time with the basics. It has freed us to think about other requirements. Our environments are in sync and our dashboards match the version of the software that is deployed. It feels like monitoring has become part of what we do rather than something we bolt on afterwards.