What is observability?
Observability is a set of DevOps practices that augment monitoring and lead to the ability to perform more frequent deployments of higher quality code. At its core, observability comes down to using logging, monitoring, and tracing to answer questions about your environments and applications. The analysis is often performed in real-time (or near real-time) dashboards.
Where can I start?
The first thing you’ll need to understand is — what do you want to know about your applications and environments that you don’t know today, or can’t easily know at-a-glance?
From an introductory standpoint, you can think of this data in two ways: queryable logs, and metrics.
No matter what system you plan on using, you should standardize how all your systems create logs. Consider the following:
- Format: If you can, standardizing on a common log format is ideal. Every application should log, at minimum, a specific set of fields such as level, timestamps, the application, the host, the request id, log code, message, and context.
- Log Codes: In addition to level and message, try to choose log codes that use some consistent format, for example ADDRESS_VALIDATION_FAILURE. This will make querying much easier, especially in localized applications.
- Context: This is something important that we see missing from a lot of ecosystems. Since the goal is that all applications will log in the same format, what about data that is application-specific? This is where context comes in. In addition to your other fields, there should be a section of your log that can be called data, context, metadata, or any name that works for you. The goal here is to collect key/value pairs. For example if your application is logging a payment failure, the context data could have a key of “orderNumber” and a value of “12345”. This adds context to the log entry that can make analysis more robust.
- Transaction Tracing: In a microservices world, this is incredibly important. There are many tools and formats that can aid in this such as New Relic, OpenTracing, and Zipkin, which add a header into every request. This header is read by the next application so you can follow one request ID from the very beginning of a browser click all the way through dozens or hundreds of services.
As your logs grow, querying becomes more expensive as some logging platforms will charge per GB scanned. Not only that, scanning terabytes of data can be slow. An example could be: How many HTTP 500 errors of type Exception::ABC occurred in the past 7 days?
This is where metrics can be very helpful. Behind the scenes, metrics are stored in a format that is observable in real-time, and with time-series data you can quickly query your metrics. Queries that could take minutes or hours to parse through full logs can be retrieved in milliseconds or seconds for metrics. Additionally, metrics can include system and application level data! Over time, you could track aspects of your environment such as memory and CPU usage, or application-specific metrics such as number of logins, inventory levels, sales, etc. While every metric has a place, for example some may be better suited in a reporting platform over a monitoring / metric system, it really comes down to what’s best for your business.
What can it do for me?
Being able to see the lead-in to an event can buy enough time to mitigate or prepare for that event. For example, if you perform a deployment and see your error rate climbing for an error that could lead to lost sales, you could manually roll back your deployment — or in a more advanced environment with feature flags, have that flag flip back to its previous state, all based on metric thresholds.
Real-time insights can lead to better prediction, and can provide value because they are inherently time-based. The thing to keep in mind here is that in a robust system such as New Relic, you can not only watch your metrics in near real-time, but you can also look back in time at your entire system. This includes every host, application, the network — everything you can think of that you’re reporting to your observability platform. This unlocks endless potential of both your real-time and historical analysis.
Detection of anomalous behavior
You can’t prepare for every scenario, so by giving your observability platform as much information as possible through logs, metrics, and transactions that the platform can detect anomalies. Let’s consider a manually created alert, something like Send an email when the error rate is above 3%. That’s simple enough — but imagine how many specific scenarios there are.
What if you run into a situation where one API call to a specific system is erroring 1% of the time, but normally errors at 0.2%. Modern observability platforms can be configured to detect anomalies, and that would certainly fit the bill. If you have anomaly detection enabled, you would receive an alert and be able to look into what’s happening in an immediate term.
What steps can I take?
In the next few articles, we’ll be addressing the following topics:
- New Relic (APM): An industry-standard observability platform
- Google Data Studio (Reporting): An entry-level dashboard platform integrated with Google’s suite of tools, as well as external tools
- Fluentd: A multi-source, multi-destination data collector to allow using multiple logging platforms