Let's Talk Observability

Photo credit: Hernan Fernandez Retamal (used under GNU Free Documentation License Version 1.2), European Southern Observatory (ESO) #
[Originally posted on LinkedIn in Jan 2025]
Let’s talk “observability”.
Observability is, in essence, metrics that define the state of the platform, or software, which help define business behavior and influence decisions about how to direct engineering effort.
In traditional operations, it’s saying “I need a trail of behavior and events to determine scale”. Early in my career, one of the things the team I was on had to generate every week was an MS Excel spreadsheet with data culled from log files. This data built a month over month graph of users by underlying game, which in turn pointed the business to figure out what was most engaging, where to deploy servers, and what platforms to invest engineering effort in. These days, it’s old hat, you’d leverage any one of a dozen open source tools to build this information and have it on the fly, but at the time it was fairly new, and I was just as green as the dotCom Boom.
Modern observability efforts created Time Series Databases (TSDBs). To make smart business decisions, there’s a need for counts of specific events, a value at a point in time, a way to store the value and timestamp, roll up data to present it on demand. So, software like Grafana was created to generate graphs to visualize the information.
Observability has also created things like FluentD and FluentBit, which can parse logs and produce metrics for a TSDB without having to specifically build metrics in to the application itself, so long as it generates logs that can be parsed, and a destination to send events exists.
And, in recognition of that need, just about every cloud platform and monitoring service now provides these features as standard (GCP, AWS, and DataDog all have this feature, built in, if you’re willing to pay for it), and provides a fairly easy API to ingest event logs.
When designing a platform, requiring these functions early, even if only half-assed, pays dividends, as you’ll have to add it later regardless. Does your application generate a log? Can you trace each error to the line that generated it? Does your application generate several log types in a single binary? Can you determine which component of the binary generated the log line, without having to dig through the source code, run it locally to recreate the event, or build the environment from scratch on your laptop?
If the business lives and dies by the reliability of a single application, the observability requirements for the app should be well defined early and some engineering cycles spent refining them. That effort will reduce growing pains later, because it’ll let your run lean and mean sooner, reducing overall cost of the platform.
And, that’s not even going in to monitoring and escalation. Which I’ve written about earlier.