Episode 5 – Hawthorne, Observability (and Effects!)

This week on Rent, Buy, Build, we're talking about monitoring, analytics, and observability. As always, we'll try to answer that age old question: if an app crashes in a container and no one is around to hear it, does it affect your SLA?

Obervability! Observability! It’s a word, we swear!

This week on Rent, Buy, Build, we’re talking about monitoring, analytics, and observability. As always, we’ll try to answer that age old question: if an app crashes in a container and no one is around to hear it, does it affect your SLA?

James Hunt
Welcome to Rent / Buy / Build. I’m James Hunt.
Brian Seguin
And I’m Brian Seguin, and today we’re talking about observability, which can really be a broad range of topics like logging, monitoring, tracing, chaos engineering. And that really applies to both platform and applications. That’s a lot to talk about in one episode. So James, what are we going to focus on today?
James Hunt
Well, we’re going to cut down that gigantic topic into a much smaller more bite sizable chunk, we’re actually gonna talk specifically about observability, as opposed to monitoring or logging or APM. The definition to me of monitoring is, “is everything okay?” When the monitoring system goes off, sends out an alert, pages somebody or whatever, it has detected, what it thinks is an anomaly or a problem with the service. This could be everything from a data center is offline to fiber cables have been cut, network latency is up through the roof.
James Hunt
Monitoring is an all-encompassing across all levels of your your platform, your stack and your offering.
James Hunt
Logging is what happened in the past. Logging is like an audit trail. It’s a forensics tool for re-piecing together, what happened in any given incident that maybe the monitoring system tipped you off to. A lot of systems out there will blur the line between logging and monitoring. And then that doesn’t even get into APM — application performance monitoring — or systems performance monitoring, where you’re trying to determine not only “is everything okay?” but are the servers and is the application code performing to our expectations? Or is there — did we introduce a regression? Or is there a problem? Are we consuming too much CPU? Do we need to scal?
James Hunt
That all kind of falls under the “performance monitoring” and performance monitoring is where you see what most people think of when they think of monitoring: pretty graphs with lines going up and down, and hopefully, for the requests per second going up. And for the amount of stuff used going down.
James Hunt
Observability, to my mind, is a whole different breed of thing that you use for — once you know there’s a problem, I think there’s a problem, observability gives you a window into the day-to-day operation of your application to figure out what is it doing; to kind of get to the bottom of “is this a problem?” And if it is a problem, what is it doing that is problematic?
Brian Seguin
Okay, so we’re talking about observability, giving a window into the application to see if there’s something wrong — now, our applications subject to the Hawthorne effect? Which is, you know, if — for those that don’t know, the Hawthorne effect is where people perform better when they’re being watched — and it’s from experiments that were done into the 1920s in the 1930s. So if you have observability, into your application, does it perform better?
James Hunt
You mean, does the application become self-aware, know that it’s being watched, and then adjust its behavior accordingly?
Brian Seguin
That’s right!
James Hunt
In a sense, sometimes, yes.
Brian Seguin
(unconstrained giggling)
James Hunt
And that might come as a shock to some of our listeners. And the reason I say that is because if you’re building an application to be observable, you will build it in a slightly different way. And I think — whenever I’ve done this, I’ve found bugs that I would not have seen otherwise, because I wasn’t trying to dig into the internals.
James Hunt
But if you take something that’s not built for observability, and you instrument it with something that can cut across the implementation code, something like eBPF, or, or really an instrumentation like JMX at the runtime layer for the application ecosystem; for the language; no — you’re not really going to see anything different. What you will see, however, is what — the Pareto principle — you will find a whole bunch of low hanging fruit problems as you exercise observability in your stacks.
Brian Seguin
So the key is to make the developers aware that what they’re building is going to be observed and that will institute the Hawthorne Effect.
James Hunt
Right, there we go!. It’s a second order Hawthorne effect. observable life is not — the unobserved life is not one worth living.
Brian Seguin
Okay, so I guess from an observer, observability standpoint, what’s the “rent”?
James Hunt
I keep saying this on every one of these podcasts — rent is SaaS, right. So when you’re renting observability, what you’re actually renting is the instrumentation and the data crunching; the aggregation of all of this, this data and this information and a lot of the heuristics of what’s going on,
Brian Seguin
But I mean, are you paying someone else to watch your application for you? Or are you just getting an application getting the SaaS application that’s pointed toward your application to monitor what you’re doing?
James Hunt
Usually when you’re renting observability what you’re renting is the the orchestration of the observability not somebody watching but making the tools available to you and your team so that it’s easier for you to watch. Because you can’t watch everything. That’s that’s the the main trade off of any monitoring system is, what do we think is important enough to alert a human operator for. And that’s why we build in these hysteresis models, these, these anomaly detection engines. That’s why machine learning is huge in this space — because there’s a lot of data to look through. And and we really want somebody to provide, in that rental scenario, we want someone to provide all of that expertise. So a lot of this is you’re going to point your app code at a SaaS endpoint is going to collect a bunch of information. And then it’s going to make some awareness on your team.
Brian Seguin
Do most of these have customizable reporting solutions on their end, or is everything mostly out of the box, i.e. for most use cases?
James Hunt
The really good ones are both, they’ll give you a lot of starting ground. A lot of here’s what we think most; like you’re a Java app, or you’re a Ruby app, or you’re REST-based application written in Node. Here’s the types of things we think you’re going to need. And we’re talking about vendors like New Relic, and DataDog and DynaTrace. And a lot of these, like I said, in the end, when we’re talking about what exactly is observability, a lot of these vendors wear multiple hats. So they’re doing observability. But they’re also doing application performance monitoring, and some logging, and forensics trails for what’s happened in the past.
James Hunt
The one that I think is interesting is Honeycomb. honeycomb.io is an eBPF-based one, and the thing I like about eBPF — the extensible Berkeley packet filter. To get slightly nerdy for just a little bit, eBPF is a way of taking an application that wasn’t built to be observed. And adding scaffolding around it via the Linux kernel to intercept things. eBPF is fascinating, because it lets me take code that somebody else wrote, and figure out what it’s doing without having to ask the developer, and without having to look at the code.
James Hunt
So you can do things like say: anytime this process opens a file descriptor, I want to know what the file descriptor number was, what it was opening, and if it succeeded. And you can then use this to do things like: let me know anytime this application tries to contact this other host on the network, or tries to access the database files on disk.
James Hunt
And what what the observability SaaS market has done with eBPF is taken applications that are huge, and externally instrumented them so that you get all this vast wealth of information, literally a play by play of what the app is doing, without the development team having to stop the velocity on their implementation of features and backlog to put in all the observability.
James Hunt
If you have a a 2 million line of code application, you don’t want to have to go through every single function in it and say, put in print statements or put in some, something that reaches out to an HTTP endpoint or a local daemon and says, “hey, I just did a thing.” And that’s — eBPF lets you do that kind of for free, because it’s built into the Linux kernel. So it goes back to being part of the runtime. And most of them are moving that way, because it’s just plain easier. Because you don’t have to — as a developer, you don’t have to do anything ahead of time.
Brian Seguin
Interesting, and there’s no additional tax on your infrastructure.
James Hunt
There’s a tiny amount. I mean, there’s overhead because the kernel is doing stuff. But it’s not enough to outweigh the benefits of not having to turn this stuff on and off, right? Because, in general, when the problem happens, it’s actually the reverse Hawthorne, right? The problem happens when you’re not looking. I have been through countless scenarios with clients, customers and my own applications where something breaks, and in order to get visibility into what’s going on, I have to add more instrumentation. And in doing so, I end up having to restart the process or deploy a new version of the application and the problem evaporates and it doesn’t come back. So now I don’t know if I fixed it. I don’t know what was wrong. I don’t know what the root cause was, and I have no context. But with things like eBPF, we can turn on the X-ray vision when we need it to see what’s going on in the application.
Brian Seguin
So for these SaaS rent solutions for observability, how are they charged? Like how’s the licensing is a consumption-based? Is it license-based, like how, how does it work?
James Hunt
They’re usually data retention-based.
Brian Seguin
Oh, interesting.
James Hunt
So if you keep, and you keep 30 minutes of the time series data of what events happened, you’re going to pay less than if you keep 30 days. And that actually works out to the benefit of the customer — or the consumer, the person who is using that data — because in general, this type of observability is an on-demand thing. Observability in this case, of being able to see what’s going on in an application, what it’s doing step-by-step is primarily a temporal proximity proposition; I want to know what happened, right now I’ve got a support call, and I’ve got customers that are having issues, I need to know what’s going on today, I’m not going to look into and try and reproduce an environment from two weeks ago.
Brian Seguin
So once I implement one of these solutions, I should be able to predict fairly steadily what my cost is going to be based off of the retention model.
James Hunt
It depends on as you scale up the application layer, as you scale up more processes reporting into these systems, you will have more data. Because like I said, it’s retention — it’s not so much time-frame retention, as on the charging side, they charge you essentially per bite. But because you get to control how much timeframe you keep, and you’re generally gonna keep those slim, you know, a couple of hours, usually, you can kind of control your costs there.
Brian Seguin
So what’s a “buy” solution?
James Hunt
A “buy” solution is tricky, because again, a lot of these things are so big, and so heavy on things like machine learning, and these things that “buy” is really just: are you going to just run an on-prem version, or an Open Source APM tool and use it as an observability tool?
Brian Seguin
Okay, what types of solutions are there for that?
James Hunt
I honestly, I haven’t seen most of these play out, we did some research in this because by was an interesting proposition. It’s not something I’ve seen too many people do.
James Hunt
APM is so complicated. Most people don’t want anything to do with it and are happy to lift that burden from their developers. But there are a ton of Open Source things out there, they’re usually focused on a specific aspect of your infrastructure. For example, SQL profiling, or a specific language, runtime, or ecosystem, like a Java thing, or .NET. for example, AppMetrics is an APM tool for .NET — for Core, and .NET framework. Skywalking is a thing from the Apache Foundation. All of these things are for more specific audiences than I think the SaaS offerings are, and the amount of effort and expertise of a “buy” solution; that you need to effectively pull off a “buy” solution is only exceeded by what you need for a “build” solution.
Brian Seguin
Okay, so what is a build solution here?
James Hunt
You take a lot of those Open Source components, and you weave them into a complicated mesh of your own observability that’s tailored specifically to what you need. The only real advantage to a build solution is that — the SaaS offering, they come with all these out of the box reporting options, right? All these things you can look at. But in reality, I haven’t found there are a ton of KPIs that are actually valuable for any given application.
James Hunt
The challenge in front of APM companies is they have to appeal to a very broad and very wide audience of developers — people in different languages and in different ecosystems and in different application types and niches, whether it’s backend processing or frontend web apps, or or API work. So they have to go very wide and very broad in what they offer. But if you’re building your own, things like eBPF and JMX will make an excellent base, and you can target specifically the things you know, you are worried about with your application.
James Hunt
I’ve seen numerous cases where people didn’t even think about observability until something bad happened. And they were in the middle of a massive outage or an inexplicable or unexplained slowdown or latency increase. And they turn to tools like JMX or eBPF that were already available and entrenched in the lower echelons of their stack. And they were able to just kind of zero in with like, almost surgical precision and find things like: how many times was this particular process going out across the network to do something, or how many times was it allocating more memory for garbage collection? Those kinds of stories — they make for great conference talks, for starters, because they’re self-contained, everybody’s had these problems. And it’s a really whiz bang bit of technology that you get to show off.
Brian Seguin
So are you saying that the build scenario is more of a shadow IT place where they don’t have a solution incorporated already, and then they, they just have to get something implemented right away, and they don’t have may not have budget to do it or may not have the procurement cycles
James Hunt
I don’t know that I look at it as “shadow IT” so much as I look at it as “commodity application development” or infrastructure. What when we say for example, an application team comes across a need for key-value storage. Very rarely does the dev team, or if it’s a DevOps team, the combined technique and talent of both of those disciplines; very rarely do they go out to procurement and say, “hey, do we have a multimillion dollar key-value storage solution that we already pay for that we can use?” No, they go out and they download Redis. They put Redis into the mix, and it’s really, it’s not a piece of infrastructure, so much as it becomes part of the architecture.
James Hunt
So I don’t know that I’d call it “shadow IT,” I’d call it more full stack development, or full stack DevOps. There we go, maybe we can coin that term — Full Stack DevOps. Because you’re not just talking about the development, but you’re also talking about execution, environment and instrumentation.
Brian Seguin
Interesting. So it sounds like the recommendation and please correct me is, you’re always going to want to, you’re always going to want to write your observability solution, because there’s not many choices out there for buy and building is very niche. Is that, is that right? Or…?
James Hunt
I think it depends on your scale.
Brian Seguin
Okay.
James Hunt
If you have hundreds of developers, and you’re a larger company, you know, you’ve got enough application developers that for instance, you’re running your own PaaS. Or you have, you know, a hefty footprint, and you’ve got a lot of people who are very agile, moving along, doing what they do all day, and you can afford it, the rental for an observability system is a no brainer. If you’ve got enough people that will use it, it’s very cost effective.
James Hunt
And it’s not just like when you talk about like DataDog, or New Relic or DynaTrace, they’re not just observability. So you’ve also got that piece of the pie to consider. There’s there’s the logging and forensics, there’s the machine learning for alerting and latency drops and other things that can engage your support organization.
James Hunt
If you’re big, I think rental is a good thing.
James Hunt
If you’re small; if you have, you know, a handful of developers, and not enough users to cover the revenue stream of paying like to bring in enough revenue to pay for that service offering; if it’s not cost effective, “build” is an excellent approximation of what you would need. It does however, require, as most “build” solutions we talk about do does require a little bit more skill set and expertise. But the nice thing about the Internet is that you can go out and find all kinds of people talking about things like eBPF and using JMX and deep monitoring into these platforms because you’re not solving new problems; you’re solving your problems that other people have also had.
Brian Seguin
Interesting.
James Hunt
Alright, join us next time on red buy and build as we discuss security, specifically the management of secrets in your infrastructure. How do we keep the passwords private and the secret keys secret? We’re going to talk about secret zero, and we’re gonna go through a couple of options for things like Vault, KMS, and plain old files on disks. Should be fun. I’ll see you next time, Brian.
Brian Seguin
See you next time, James.
James Hunt
You can find all episodes of rent by build online at RBB that Stark and Wayne calm or wherever podcasts are sold. Hey, before we go, I just want to let you know Brian and I will be talking about open source North a conference being held virtually this year. On May 20. We’ll be talking about pretty much what we talked about here on the podcast, renting, buying, building all the pieces of cloud native. Hope to see you there.