Revisiting Observability: A Deep Dive Into the State of Monitoring, Costs, and Data Ownership

Originally posted on DZone - Author: Mike Moore

Hey internet humans! I’ve recently re-entered the world of observability and monitoring realm after a short detour in the Internal Developer Portal space. Since my return, I have felt a strong urge to discuss the general sad state of observability in the market today.

I still have a strong memory of myself, knee-deep in Kubernetes configs, drowning in a sea of technical jargon, not clearly knowing if I’ve actually monitored everything in my stack, deploying heavy agents, and fighting with engineering managers and devs just to get their code instrumented only to find out I don’t have half the stuff I thought I did. Sound familiar? Most of us have been there. It's like trying to find your way out of a maze with a blindfold on while someone repeatedly spins you around and gives you the wrong directions to the way out. Not exactly a walk in the park, right?

The three pain points that are top-of-mind for me these days are:

The state of instrumentation for observability
The horrible surprise bills vendors are springing on customers and the insanely confusing pricing models that can’t even be calculated
Ownership and storage of data - data residency issues, compliance, and control

Instrumentation

The monitoring community has got a fantastic new tool at its disposal: eBPF. Ever heard of it? It's a game-changing tech (a cheat code, if you will, to get around that horrible manual instrumentation) that allows us to trace what's going on in our systems without all the usual headaches. No complex setups, no intrusive instrumentation – just clear, detailed insights into our app's performance. With eBPF, we can dive deep into the inner workings of applications and infrastructure, capturing data at the kernel level with minimal overhead. It's like having X-ray vision for our software stack without the pain of having to corral all of the engineers to instrument the code manually.

I’ve had first-hand experience in deploying monitoring solutions at scale during my tenure at companies like Datadog, Splunk, and, before microservices were cool, CA Technologies. I’ve seen the patchwork of APM, infrastructure, logs, OpenTelemetry, custom instrumentation, open-source, etc. that is often patched together (usually poorly) to just try and get at the basics. Each one of these comes usually at a high technical maintenance cost and requires SREs, platform engineers, developers, DevOps, etc. to all coordinate (also usually ineffectively) to instrument code, deploy everywhere they’re aware of, and cross their fingers just hoping they’re going to get most of what should be monitored.

At this point, there are two things that happen:

Not everything is monitored because we have no idea where everything is. We end up with far less than 100% coverage.
We start having those cringe-worthy discussions on “should we monitor this thing” due to the sheer cost of monitoring, often costing more than the infrastructure our applications and microservices are running on. Let’s be clear: this isn’t a conversation we should be having.

Indeed, OpenTelemetry is fantastic for a number of things: It solves vendor lock-in and has a much larger community working on it, but I must be brutally honest here: it takes A LOT OF WORK. It takes real collaboration between all of the teams to make sure everyone is instrumenting manually and that every single library we use is well supported, assuming we can properly validate that the legacy code we’re trying to instrument has what you think it does in it. From my observations, this generally results in an incomplete patchwork of things giving us a very incomplete picture 95% of the time.

Circling back to eBPF technology: With proper deployment and some secret sauce, these are two core concerns we simply don’t have to worry about as long as there’s a simplified pricing model in place. We can get full-on 360-degree visibility in our environments with tracing, metrics, and logs without the hassle and without wondering if we can really afford to see everything.

The Elephant in the Room: Cost and the Awful State of Pricing in the Observability Market Today

If I’d have a penny for every time I’ve heard the saying, “I need an observability tool to monitor the cost of my observability tool."

Traditional monitoring tools often come with a hefty price tag attached, and often one that’s a big fat surprise when we add a metric or a log line… especially when it’s at scale! It's not just about the initial investment – it's the unexpected overage bills that really sting. You see, these tools typically charge based on the volume of data ingested, and it's easy to underestimate just how quickly those costs can add up.

We’ve all been there before - monitoring a Kubernetes cluster with hundreds of pods, each generating logs, traces, and metrics. Before we know it, we're facing a mountain of data and a surprise sky-high bill to match. Or perhaps we’ve decided we need a new facet to that metric and we got an unexpected massive charge for metric cardinality. Or maybe a dev decides it’s a great idea to add that additional log line to our high-volume application and our log bill grows exponentially overnight. It's a tough pill to swallow, especially when we're trying to balance the need for comprehensive and complete monitoring with budget constraints.

I’ve seen customers receive multiple tens of thousands of dollars (sometimes multiple hundreds of thousands) in “overage” bills because some developer added a few extra log lines or because someone needed some additional cardinality in a metric. Those costs are very real for those very simple mistakes (when often there are no controls in place to keep them from happening). From my personal experience: I wish you the best of luck in trying to negotiate those bills down. You’re stuck now, as these companies have no interest in customers paying less when they get hit with those bills. As a customer-facing architect, I’ve had customers see red, and boy, that sucks. The ethics behind surprise pricing is dubious at best.

That's when a modern solution should step in to save the day. By flipping the script on traditional pricing models, offering transparent pricing that's based on usage, not volume, ingest, egress, or some unknown metric that you have no idea how to calculate, we should be able to get specific about the cost of monitoring and set clear expectations knowing we can see everything end-to-end without sacrificing because the cost may be too high. With eBPF and a bit of secret sauce, we'll never have to worry about surprise overage charges again. We can know exactly what we are paying for upfront, giving us peace of mind and control over our monitoring costs.

It's not just about cost – it's about value. We don’t just want a monitoring tool; we want a partner in our quest for observability. We want a team and community that is dedicated to helping us get the most out of our monitoring setup, providing guidance and support every step of the way. It must change from the impersonal, transactional approach of legacy vendors.

Ownership and Storage of Data

The next topic I'd like to touch upon is the importance of data residency, compliance, and security in the realm of observability solutions. In today's business landscape, maintaining control over where and how data is stored and accessed is crucial. Various regulations, such as GDPR (General Data Protection Regulation), require organizations to adhere to strict guidelines regarding data storage and privacy. Traditional cloud-based observability solutions may present challenges in meeting these compliance requirements, as they often store data on third-party servers dispersed across different regions. I’ve seen this happen and I’ve seen customers take extraordinary steps to avoid going to the cloud while employing massive teams of in-house developers just to keep their data within their walls.

Opting for an observability solution that allows for on-premises data storage addresses these concerns effectively. By keeping monitoring data within the organization's data center, businesses gain greater control over its security and compliance. This approach minimizes the risk of unauthorized access or data breaches, thereby enhancing data security and simplifying compliance efforts. Additionally, it aligns with data residency requirements and regulations, providing assurance to stakeholders regarding data sovereignty and privacy.

Moreover, choosing an observability solution with on-premises data storage can yield significant cost savings in the long term. By leveraging existing infrastructure and eliminating the need for costly cloud storage and data transfer fees, organizations can optimize their operational expenses. Transparent pricing models further enhance cost efficiency by providing clarity and predictability, ensuring that organizations can budget effectively without encountering unexpected expenses.

On the other hand, relying on a Software-as-a-Service (SaaS) based observability provider can introduce complexities, security risks, and issues. With SaaS solutions, organizations relinquish control over data storage and management, placing sensitive information in the hands of third-party vendors. This increases the potential for security breaches and data privacy violations, especially when dealing with regulations like GDPR. Additionally, dependence on external service providers can lead to vendor lock-in, making it challenging to migrate data or switch providers in the future. Moreover, fluctuations in pricing and service disruptions can disrupt operations and strain budgets, further complicating the observability landscape for organizations.

For organizations seeking to ensure compliance, enhance data security, and optimize costs, an observability solution that facilitates on-premises data storage offers a compelling solution. By maintaining control over data residency and security while achieving cost efficiencies, businesses can focus on their core competencies and revenue-generating activities with confidence.

Mastering Kubernetes: Key Metrics for Cluster Monitoring

How to Get Ready for the AI Revolution: Embrace, Learn, and Innovate