Episode 7 – Distributed Monitoring, Everywhere

We're back on the Observability bandwagon, this time to talk about Distributed Monitoring! Ever wanted a globally distributed, outside-in view of your application's latency and response time, but didn't know if such a thing was available for purchase, rental, or assembly? You're in luck...

We’re back on the Observability bandwagon, this time to talk about Distributed Monitoring! Ever wanted a globally distributed, outside-in view of your application’s latency and response time, but didn’t know if such a thing was available for purchase, rental, or assembly? You’re in luck…

Brian Seguin
Hello, you’re listening to Rent / Buy / Build. I’m Brian Seguin.
James Hunt
And I’m James Hunt.
Brian Seguin
Today we’re talking about monitoring, but more specifically: distributed monitoring, why you need to consider it and what you might be missing. James, what is distributed monitoring?
James Hunt
So when I told Brian, “Hey, I want to have an episode, I want to talk about distributed monitoring,” Brian, who is a much better podcast co- host than I am, he went out and did a ton of background research.
Brian Seguin
Yeah, background research on monitoring itself, right.
James Hunt
So the word monitoring-
Brian Seguin
Just the word monitoring, right. So like, I’m here, I’m researching how Ring monitors — like ring the doorbell app, right? How that monitors different people’s movements. Then I started to realize I need to go into Cloud Native monitoring, like monitoring is a massive, massive industry.
James Hunt
Indeed. So yeah, he sends me the Gartner report and a whole bunch of other stuff. And I said, No, no, Brian, Brian, how many hours? And he said, “I haven’t seen my family in three days.” And I said, that’s about right. But seriously, but but serious.
Brian Seguin
So a few hours ago, James says, Hey, we’re talking about distributed monitoring, right? Not anything that you’ve researched before. So this should be an interesting podcast
James Hunt
Fear not, we will definitely touch on tracing and auditing and logging and all the other stuff throughout the course of this podcast, because as Brian noted — correctly so — monitoring is a massive topic. So today, we’re gonna talk about distributed monitoring. I saw a talk at a conference virtual conference, because all conferences these days are virtual, put on by the Kong folks, I will — I am blanking on the name right now. But I will definitely put it in the show notes. It was an excellent talk about distributed monitoring
Brian Seguin
isn’t Kong an API thing. What does monitoring have to do with Kong?
James Hunt
Well, I mean, if your API is down, it’s not much of an API. Is it? Right? I mean, it’s if a tree falls in a forest, right? If an HTTP request hits a dead port, does it make a response? No, it does not. As the old saw goes, No, the I will point out that the talk — the speaker was not from Kong, he was a guest speaker. Usually you go to these these types of industry trade shows put on by a single vendor, and it’s like, oh, now we’re going to talk to this field CTO and that field CTO — they actually had a fairly good mix of topics. But I’m getting off topic of mine. So the talk was about distributed monitoring and whether you should be doing it, which I thought was an interesting take, because it’s one of those things where oftentimes with monitoring, it’s not a “should you be doing it?”, it’s “how should you be doing it?”
Brian Seguin
Right, everybody’s monitoring solution has to be different because they have to track different types of application performance metrics, because everybody’s customer is different, right? So I think distributed monitoring is the, the actual act of looking at your application from different places,
James Hunt
Right! Because the internet is wide and vast. And there are lots of places where your users or your clients could be connecting from. You may be running your services in AWS east, right, you may be in the Virginia data centers, you might be on the west coast, you know, in the other AWS data center. But you’ve got people all over the place, and they’re all going to have a slightly different experience in getting to your application. They’re going to have different latencies, they’re going to have connectivity issues, right? They’re going to have response time issues. And the the distributed monitoring is how can we detect those before it becomes a revenue impacting event or a user base impacting event?
Brian Seguin
From a distributed monitoring standpoint, it’s not for every single customer — distributed monitoring is for companies that have a wide breadth of customer base, you know, they might have customers in Asia, they might have customers in Europe, they might also have customers in the United States, or, you know, Latin America, or even Australia and random New Zealand. But–
James Hunt
Random-
Brian Seguin
Random New Zealand, yes
James Hunt
That’s not like the regular, it’s-
Brian Seguin
different. It’s the random New Zealand.
James Hunt
And it’s not even just where the customers sit. In the modern world of VPN services. It’s not unusual for someone to be sitting within, you know, within 100 miles of your data center, but be coming from across the ocean, because their VPN service is routed through the EU, or they’re sitting on VPN through another cloud provider. There’s all kinds of different ways that people could get to you. So even if you think physically or geographically speaking, your user base is concentrated in one region, there’s always a possibility that they’re coming from a different region.
Brian Seguin
Well, not even just the region standpoint, are you monitoring inside of your own network? Or are you monitoring as the customer sees it: outside of your own firewall rules, I mean, we’ve definitely seen it where a company sets up a monitoring solution inside of their own network environment that kind of tags along with their applications. But their security / networking team, all of a sudden just created some new firewall and one feature of the app is not available anymore, because the customers can’t get into it. And the monitoring didn’t pick it up because monitoring was internal. So it’s important to also ensure that you’re monitoring — your distributed monitoring solution is outside of your own network. And it’s also geographically dispersed, you know, either similar to your customer base or in strategic points, right?
James Hunt
I mean, the easiest, most surefire way of ensuring your monitoring is as well-covered as possible is to write a massively popular browser extension that maybe blocks ads, and then use it as a base of operations for testing connectivity to your website from all those browsers out there; effectively creating a botnet. That’s right, that’d be a botnet. Yeah, we don’t want to do that. So since we can’t do that, the next best thing is to find infrastructure footholds throughout the globe, and put small agents there.
Brian Seguin
Luckily, all of your IaaS providers that you might be using have footholds. And if you’re using on prem solutions, even better to have this type of distributed monitoring system on top of an IaaS provider, I think this is one of the cases where either you’re renting from a from a distributed monitoring SaaS solution, or you’re procuring space inside one of the IaaS providers to put your own modern insulators solution in a distributed fashion. Right?
James Hunt
Right. So if we want to look at the the technical components, really what we’re talking about, is you’re putting a “ping” command running from a whole bunch of different places on the internet, across different continents, across different geographic boundaries and different regions. You can either have somebody else do that for you, and pay for access to a certain number of transactional tests per minute, per hour, per whatever. Or you can go out and build your own footprint. You can easily sign up to AWS or GCP, or Linode, and spin up small VMs that do nothing but try to reach your website every X minutes. The real question is: is that going to be cost-effective from the consumption perspective? And is it going to be valid? Or rather, is it going to be too much work to maintain all that apparatus? Because now you’re on the hook for that stuff being up, right? It’s similar — we talked about secret zero when we talked about secrets management. And the problem doesn’t get better, by having more things to manage. Right? Who who watches the watchmen, right? monitors to make sure that all the parts of the monitoring system are up. So it’s usually better unless you already have a lot of footholds, and we’re not talking like three points on the globe. We’re talking, you know, 20 to 40 different points of origin for doing these tests so that you-
Brian Seguin
You might say it’s federated, right, it’s it’s interwoven throughout wherever people will be connecting from.
James Hunt
Right. And there are systems and companies out there that will rent you for a much, you know, for a monthly fee, based on number of transactions, places like Keynote Systems. Pingdom, SiteUptime-dot-com; these kind of first order is my website working monitoring systems actually work fairly, fairly well for gathering response time and latency.
Brian Seguin
So I think what’s interesting to call out is that, you know, in the Cloud Native space, there are a ton of monitoring solutions, both in rent and buy. There’s also a ton of monitoring solutions that are custom made by clients, because there’s a specific need, but specifically in the distributed monitoring solutions, not all of the, you know, cloud native monitoring, software solutions that are out there that are marketed, can actually do the distributed monitoring is, you know, is that correct?
James Hunt
Right. It’s — we were talking before the podcast at idle chit chat has to do about monitoring systems in large monitoring service providers. And it was the the question that came up was, why it wasn’t really so much a question, but we were talking about how these big vendors like the DynaTraces, the New Relics of the world, the DataDogs, they just seem to have so many offerings
Brian Seguin
They do.
James Hunt
Right?
Brian Seguin
And very confusing pricing structures, by the way.
James Hunt
I think they do that on purpose. No; they have they have a lot of stuff, you know, in this, I need to talk about AI NML, right, the new kids on the block from an observability perspective. But you’ve got a ton of services, because it’s worth it once you’ve chosen a monitoring vendor to have them do more with your data.
Brian Seguin
Yeah.
James Hunt
If they’ve already got access into the application via some sort of sidecar agent, why not also have them try and do dependency validation? Is this Tomcat server down causing more other outages? Why don’t we correlate those events? So you get into this relationship with your monitoring vendor that I think is unique to the observability space where you want to give them more stuff, because it’s cheaper. But-
Brian Seguin
It’s so easy to have add ons to your data, once you’ve you already have it all in one place.
James Hunt
And once you have a good pipeline for ingesting that data and a good visualization engine, I mean, heck, why not just throw on more, more slicing and dicing of the data. But we don’t see that with distributed monitoring, because it costs money to do that sort of outside-in monitoring. And that’s really what distributed monitoring is about it’s an outside-in view of the system.
Brian Seguin
And distributed monitoring is an effort to try to, because it’s an outside-in perspective, it is trying to eliminate those, you know, Twitter posts and those social media posts of, “Oh, this is so slow, this is down,” you know, right? Like it’s that you might not actually get if you’re not doing it from the outside-in perspective,
James Hunt
Right. One of the things that I find useful with distributed monitoring is you can use it in a couple of different ways. The most obvious one is as a differential diagnostic tool, the site is down — is it? If somebody reports in you know, your customer logs a support ticket says I can’t get to the website to log in to do the thing that I do. The first thing your support engineer is going to do is open a web browser and check. And if it works for them, now you’ve got two points of data: (a) it doesn’t work for this person, (b) it does work for the support engineer. And you have to find where the line that separates those two worlds is, where’s the boundary to try and figure out what the problem is, with distributed tracing or distributed monitoring rather, you already have a whole bunch of active, external agents, feeding information into your network operation center, your support center saying, “hey, it’s down from Pennsylvania, but it’s not down from Switzerland. And here’s all the places on the globe where it’s reachable.” And you can kind of build that that vision, that image of where might this problem be? And you can focus your your detective work. If you find out that half the points of origin can’t reach the website and half of them can, then you need to start looking at backbone network issues, right? You need to look at provider cuts and other-
Brian Seguin
I would assume that the need for distributed monitoring grows as you start to have these clusters deployed in different regions as well, right? Because you might actually be having a cluster down that’s in Europe. But your normal monitoring solution, which might be you know, based in the US, might only be getting the traffic — might be naturally getting the traffic routed to the cluster in the US.
James Hunt
Especially when you’re talking CDNs
Brian Seguin
Yeah.
James Hunt
Right. CDNs and geo location and all that stuff; edge computing again. Hey, edge computing. Hi, I swear to God, we’ll get to you, eventually in the topic list. Another huge area for for us to discuss,
Brian Seguin
I think. I mean, basically, I think every single company needs to have some form of monitoring, at least to the core critical components of what their customers are using, with their core business logic is from a distributed monitoring standpoint, it’s it’s really growing in importance. But everybody needs to have a distributed monitoring solution at some point, right? They need to be looking at it from the customer standpoint, especially as they grow out regionally or globally.
James Hunt
And I think this scale is the important factor here. The other thing that that distributed, or rather outside-in monitoring gets you is: if your provider goes offline, your external monitoring provider most likely will not unless there’s been a massive outage on the Internet, right? Whereas, to your point earlier; you said well what if the monitoring systems right next to the k8s cluster, the Kubernetes cluster that houses the app, what happens when the whole thing goes down?
Brian Seguin
Everything goes down!
James Hunt
Which does happen. Everything goes down and you don’t get an alert telling you that because the monitoring system died before it could get the message out. So to your point, I think it is a requirement for everybody to have some sort of outside-in monitoring whether or not it’s distributed, gets into customer base.
Brian Seguin
So I think what it almost seems that rental is a really good option in the standpoint, right? So like, what would you say like renting a distributed monitoring solution to, you know, outside-looking-in. Is that easy?
James Hunt
I think it’s the best bet for 80 to 90% of people running things on the public Internet. Because you can grow into it — you can start small, I only need to check the front page. And I really only need to check it every couple of minutes or every hour even because you’re using it not so much as an early warning, first detection, you’re using it as “there’s a bigger problem that has survived longer than five minutes that we need to take care of.” It’s a latency issue, it’s a response time issue, or it’s an outright connectivity issue. And maybe your user base is concentrated. So you can say, look, we really — for launch, we’re just going to do one West Coast US, one East Coast US, and one EU probe, or point of origin. As you scale out — and this was kind of the point of the talk from the Kong thing — as you scale out, you can add more. If you want to, if you start targeting, let’s say you’re an e-commerce platform, and you’re mostly targeting North America, it doesn’t make sense to drop a point of origin probe in Asia. But as soon as you start to do business in Asia Pacific, there’s a whole bunch of places you should really be checking from to make sure that you are aware of connectivity issues getting to your ecommerce platform, because for no other reason. It might help to chase down discrepancies in traffic and volume. If your website normally does — and I’m going to use completely fictitious numbers and Brian’s gonna cringe because they’re completely unrealistic — but let’s say you do $1,000 in e-commerce on a Monday. And then on Tuesday, you do $100? Why did it drop off? Absent any other indicators of what happened, what are all the possible ways that that might be explained?
Brian Seguin
There’s so many different things, right? There might be a sale going on on Monday. You know, there’s so many non-technical circumstances that might impact it, it just might be consumer behavior, right? You don’t know. But I guess you have to rule out a technical failure. Right?
James Hunt
Right. So if you have distributed monitoring, and it tells you “Hey, by the way, the EU had a spike in load times from 140 milliseconds to over four seconds.” Well, that clearly is a problem most likely explained the volume. And lets you kind of focus the search, right? You dig into the numbers, you find out: yeah, the US sales were the same as the US sales from the day before, but the EU sales essentially vanished. And that’s because (we think) the working theory would be that the response times drove them to other ecommerce platforms, other vendors that also sell the same thing you sell; to your competitors. I am knowing that means you don’t have to change your advertisements. You don’t have to change out your marketing promo. It’s really not a thing that your behavior can do. You just need to fix this response time issue.
Brian Seguin
Yeah, you want to make sure that the monitoring is implemented in a way that it gives you the right metrics to identify the technical issues. So you don’t assume there’s a business issue associated to something. You don’t want Phantom business issues created by technical issues, because you just can’t detect what the technical issue is.
James Hunt
Now say that 10 times fast.
Brian Seguin
Yeah, so what is a rent solution here? Like what kind of companies can you rent a distributed monitoring?
James Hunt
Like I said, there’s there’s a ton; usually they fall in the, the market cap, or the market classification of “is my website up?” And you can use this for everything from straight up API calls to REST endpoints that are public, that browsers consume to actual “How long does it take to render the front page?” They vary in complexity and sophistication along those axes. SiteUptime-dot-com is one; Pingdom is another. I mentioned Keynote Systems. I’ve used them in the past. But really, that’s from the rental standpoint, you can build these, like I said, you can build for for something like probably $5 to $10 per point of origin per month. You can run your own very small thing on Linode or AWS or GCP.
Brian Seguin
Well, I think that’s important if especially if you want to test specific things. I think monitoring is one of those cases where there’s one specific mission critical, business critical thing that you need to make sure is working all the time that you were just going to say “Hey, my DevOps team is just going to write a monitoring solution just to make sure that one thing is up and operating.” And I’m sure that’s happened to you in the past, right?
James Hunt
Right. No, I mean, if you look at — I’m a monitoring person, by trade, I spent eight years in the biz, building that very same thing for a company that focused very heavily on web presence. And we wrote, hundreds, if not thousands, of custom plugins, for testing, all manner of different things where off-the-shelf monitoring systems couldn’t really do it and if you look at things like Prometheus, or Nagios, they’re all extensible for a reason.
Brian Seguin
So I think monitoring is one of those type of solutions, and even distributed monitoring is one of the solutions that you’re going to have a lot of use cases for build. Rent might not cut it. Buy might not cut it.
James Hunt
Yeah, I can’t actually think of too much in the “buy” space. To be perfectly honest, I don’t, there’s not much. It’s a very, it’s a niche, but growing part of the observability space. Especially as edge computing kind of takes off, there’s a lot more points of place, because that’s just talking straight. What we’re talking about right now is using existing data centers; big, big rows of servers that are running virtualization farms for the cloud providers. But once Edge starts to take off, you might see the rise in the next, you know, four to five years of companies that provide Edge points of origin, not just for compute near customers, but also for verifying what is the latency look like not just on the eastern seaboard, but in this neighborhood in DC, or this neighborhood in Beverly Hills. That’s a possibility, I can definitely see that. I think at that point, you might have a licensed solution or a bigger rental solution, but the build is really: spin up your own servers and a whole bunch of different IaaSes and put either custom monitoring or ping checks — if we’re honest — is a good first approximation,
Brian Seguin
I think rent is really good if you have very general needs. And build is definitely a solution that you, you might want to rent and build here, right? Because build might be the I have these, like I said, these mission mission critical features that drive my revenue for the customer, I need to make sure these are working all the time. And I want that redundancy.
James Hunt
You do have to draw the line between between functional monitoring and network monitoring. So while functional monitoring almost invariably benefits from being custom to the application being monitored, network monitoring, not so much. A ping is a ping, regardless of what your application does. And the response time, assuming you hit an HTTP endpoint is going to be the same, right? There’s no custom logic in making sure that the main API call finishes within 200 milliseconds. And what you’re really getting as a value from the distributor monitoring is all those different points of origin. So while you should be writing custom monitoring to make sure that the platform is functional, the functionality rarely depends on where the end users coming from. Right, GitHub doesn’t act differently if you’re in China versus the US. The code doesn’t care where you come from. So we do have to be careful about that functional divide. The other good thing for a build scenario really depends on how many places do you already have a foothold? You mentioned CDN and geolocation. Having clusters that are closer to your customers, actually opens up a very interesting opportunity for you from a build perspective. If you’ve already got, for example, Kubernetes clusters, and they’re in the EU, and America, then you’ve already got the groundwork laid for doing your own distributed monitoring from one system to the others. Right, you don’t even have to buy additional infrastructure spend beyond– if you’re already operating in those data centers, you might as well take advantage of that to kind of watch yourself in the mirror as it were.
Brian Seguin
So I think this is one of the sections of the very large topic of observability
James Hunt
Obervability?
Brian Seguin
Obervability ba ba bla bla
James Hunt
One day I’ll spell it right, I swear
Brian Seguin
So this is one of the sections of the very large topic of observability. We will be touching on other topics of the observability very broads space, and even more specifically, we’ll be talking more about the actual monitoring space in and of itself. This podcast (episode) is specifically to discuss distributed monitoring. And we still have a lot of other monitoring type solutions to discuss, like, performance monitoring and whatever else James tells me to do.
James Hunt
And you’re the one who did all the research, Brian. So you tell me what we’re gonna talk about.
Brian Seguin
(laughter)
James Hunt
So Brian, are you excited for May 20th?
Brian Seguin
What’s May 20th?
James Hunt
Open Source North conference. You may have heard of it.
James Hunt
Oh, yes. I’m very excited about May 20th at Open Source North.
James Hunt
Why are you excited about May 20th at Open Source North?
Brian Seguin
Because we’re talking about Rent / Buy / Build.
James Hunt
Oh, my gosh, we are?!
Brian Seguin
We are presenting at that conference. Hopefully not about something that I’d have had no time to prepare about.
James Hunt
Is it observability?
Brian Seguin
No.
James Hunt
Oh, what are we talking about, Brian?
Brian Seguin
I hope it’s not about observability. We are talking about Rent / Buy / Build and how to create a process inside of your own institution on how you decide whether you rent things, buy them, or build them yourself.
James Hunt
Oh, we’re giving away the secret to the podcast.
Brian Seguin
Yeah, it’s not so much of a secret every single company has to do this.
James Hunt
I think tickets are still available. They might still be available. By the time this podcast hits the air waves. Waves? Air packets?
Brian Seguin
I think this podcast is airing on May 20th.
James Hunt
No it’s not! No, it’s like two weeks prior. In any event, I hope to see all of you there. It’ll be fun.
Brian Seguin
Please ask me stupid questions. So I don’t feel so bad..