Episode 8: It’s Log! Log!

Continuing (a bit) from where we left off last episode, in this episode of Rent Buy Build, we tackle logging! When you should do it, whether you should do it, and how you should do it.

Continuing (a bit) from where we left off last episode, today we tackle logging! When you should do it, whether you should do it, and how you should do it.

James Hunt
Hello, you’re listening to rent by build the podcast where we look at the pieces and parts of a cloud native platform and ask the question, should you rent this, buy this or build it yourself. I’m your co host, James Hunt.
Brian Seguin
And I’m Brian Seguin.
James Hunt
And today we’re going to talk about logging yet another pillar of observability.
Brian Seguin
Yeah, James, these whole pillars of observability are so confusing. They are so intertwined, you try to look just for something that solves your logging issue. And it has all these other things baked inside of it. Similar with the application performance monitoring, those are often linked to logging as well. And like, it just, there’s a lot of add ons, you know, in this industry, you can’t exactly get just a logging solution, you’re getting a login solution with, you know, some type of audit ability or
James Hunt
right I mean, observability is huge. That’s why they made a whole nother term. Because observability didn’t exist. 10 years ago, we just called it logging and monitoring and performance and metrics and graphing, you know, so when we talk about logging, though, we really kind of need to define what is, what is a log? And what is the act of logging?
Brian Seguin
Yes.
James Hunt
And from a technical perspective, logging is the application, doing something and making a note of what it did, preferably with all of the context necessary to understand what happened at that point in time, and nothing more. And the goal of that is to allow developers or systems administrators or platform operators to go back through and reconstruct what was going on at any given point. For purposes of troubleshooting for performance analysis and other things. What has happened over the years is people have realized that if we have more data, we can bring bigger and bigger tools to bear on that data and find patterns in the data. So logging went from being a transactional record of what happened to being let’s just dump as much data down the firehose as possible. And later, we’ll find this useful. And that, I think, is where you get this explosion of services. Because now I need the ability, because I’m logging more than I ever have been, I need the ability to make sense of those logs, whether that’s search, whether that’s visualization, whether that’s pattern detection, machine learning AI, I’m not gonna say blockchain because it has no bearing on anything. But those kinds of things kind of muddle up what we mean by logging, and I have a fairly provocative stance on on logging in general and modern applications.
Brian Seguin
You do?
James Hunt
Yes, I do. I don’t think applications need to log nearly as much as they do.
Brian Seguin
Well, that’s a very interesting point, because the, the industry has taken the stance, well, we’ll just log more, and we’ll be able to find out more information about what went wrong. Or we’ll be able to use our logging to fulfill an audit request, or we’ll be able to use our logging to help plug these security measures. And it’s not what logs are for. Right?
James Hunt
Right.
Brian Seguin
And that also gets very cumbersome when it comes to actually having to store something because a lot of the audit rules require storage over a long period of time. And if you’re storing all of your logs, and you’re creating one gigabyte of logs per minute, depending on your scale, you could be you could be storing massive amounts of data that can be extremely costly,
James Hunt
Right. And a couple of things have exacerbated that. Things like structured logging, you know, when I was a kid, back in my day, we logged human readable messages, like connection incoming from this IP address on this port. Nowadays, everything’s a JSON object, which means it’s structured and serialized, non scalar value. So you’ll have keys and attributes and metadata, and you can put lists in it. All of that serialization adds more weight to the log messages, it does allow tools to pull the data out without having to resort to surest sticks. So instead of saying, well, the the message pattern is connection incoming from and then the next token is the IP address. And then it says the words on port and then number, you could just say yeah, pull the IP field, pull the port field, and then show me a summary of how many connections are coming from each IP address and give me the top 5% to find an abuse. But I would posit The problems with logging is logging kind of gets lumped together in all things that are an event based piece of context being stored is logging, which is technically true, but not a very useful definition. And I think we need to differentiate audit logs, from transactional logs from debug logs.
Brian Seguin
Now logging was originally generated for the debug logs, it was created for debugging purposes, it wasn’t created to record transactions, it wasn’t created to, you know, fulfill audit requirements, it was really for the developers to be able to go in and try to figure out what went wrong how to recreate situations. But what we’ve actually seen is, since you already have the data, well, why can’t we just use this big huge data pool to pull out what we need to solve, you know, transaction monitoring, and an audit login. And I think the real thing is, you can, that’s fine. But make sure that what you’re doing is you’re pulling it out and putting it into some type of separate database, some type of separate structure. So like, if you’re storing them over time, you’re not having to query all of the log stash.
James Hunt
Right, and one of the other benefits you get with a generic logging system. And I’m going to use Splunk, because I think most of our listeners are familiar with it. I know, I’m definitely familiar with it. Splunk takes those log messages and does store them as semi structured data for search, it will pull out and parse log messages. I think instead of using that, as the end of your audit, and transactional logging, you really need to build it into the architecture of the application. Because we’re tools like Splunk, excel is where you’re not in control of how the login gets done, right, you have a third party component, it’s your web server logs, or your database logs, or something else that’s logging, and you need to see the data. And that’s helpful from a troubleshooting perspective. But if you’re looking at transaction where you’re generating the log messages, and you’re consuming the log messages, I think in general, you’re going to be better off building a specific part of your API or your application to handle
Brian Seguin
That is a fantastic point, because what has actually happened in the industry is, these companies have implemented logging as an afterthought, because they don’t want to fix their application, because they don’t want to fix the issues surrounding auditing and surrounding transactional logs and creating out different databases for search history and things of that nature. Right. They’re just, they’re just taking these logs, and they’re plugging it into things that it wasn’t made for.
James Hunt
And, and that gets costly, especially as you start moving into Cloud based logging services, or even licensed on prem services, which we could talk about.
Brian Seguin
So I think that is, the exact point that I wanted to make for this podcast is that, as a business owner, or a product owner, you need to be able to weigh out, you know, the re architecting of the application to meet your audit and transactional record keeping, you need to be able to weigh out refactoring that with just putting logs on top of it in order to solve the problem now, because a lot of people also get stuck in the place where they put these logs. And they’ve been doing it for so long in that manner. And what has happened is they don’t want to go back and fix it because it’s too much to unfurl.
James Hunt
Right. And to go along with that a lot of product owners I don’t think consider business customers or auditors, specifically as product customers. The most of the use cases that you’ll see out of product teams are about the user, the person who’s actively interacting with the software everyday using the application, you know, paying for the application. Or as required by you know, their job to use the application for internal applications, very few product owners tend to prioritize the audit side of the business. But if you consider the auditor, or the security team, or the the performance engineers as customers of your product, the data flowing through it, then it starts to make more sense to not just rely on logging to flip that on its head. Imagine you were running an e commerce shop. And rather than have a relational database for the orders that every customer has put through status, and when shipped and all this, you just started from the beginning of time read through your log stream and someone said, Hey, I need to know if what what open orders Do I still have? Okay, hang on. I’m going to go to the log source. And I’m going to read from the beginning and Is this your order? Is this your order? Is this your order? Is this your order? Oh, here’s an order. You put this on this date, okay. And then three days later this happened. And as you could do that, right, you could read all the way through the log and that would be horribly innefficent.
Brian Seguin
It definitely doesn’t scale. There’s a lot of storage, like we said, it’s very costly. But I think the worst part will be not necessarily the the cost of storing all the data, it’s probably the response time. Because the more logs you get, the more you’re you’re searching through the logs to get that type of transaction history, the longer the customer is going to have to wait, which means you’re losing customers.
James Hunt
And that’s a ridiculous example. And it’s a ridiculous example, on purpose.
Brian Seguin
Is it? I’m pretty sure I’m pretty sure there’s people out there that are doing that. Now, it may not be for it may not be for orders, right? But it’s probably it’s for other things.
James Hunt
And That’s what you’re doing your auditor, you’re saying, Look, I don’t care enough to actually pull out the events that we care about, from an audit perspective, or from a transaction perspective, if you want to see what transactions are happening. You’re having to pick through a whole bunch of unrelated stuff. Do I care if I’m looking at an ecommerce platform from a transactional standpoint? Do I care what session ID they came in on For this order? status inquiry? No, I don’t do I care what IP address they came from? No. Am I getting that from web logs? Yes. 100%? Do I care what images they loaded when they loaded up the orders page? No, but I’m getting that from my nginx logs. So I think for the most part, logging doesn’t need to happen the way it happens today, I think the industry is incumbents are invested in it happening. And that’s why when you say it’s hard to find just a logging solution. That’s because to differentiate themselves and keep people pumping data through the log systems, the big providers had to build in analytics and machine learning and pattern recognition, all this stuff.
Brian Seguin
And because of that complexity, it is almost impossible to get a consistent charging model for how these logging systems are charged when when I was looking for rent and buy scenarios, it was just all across the board. The only common thing that I could kind of deduce is that paying for storage of the actual logs for a SaaS solution, right, you’re you’re paying per gigabyte of storage for the logs, which can add up immensely. If you’re implementing a licensed logging solution, like you’re putting sidecars onto your applications, that also is causing some overhead, it’s it might not be last, but you’re also paying for a lot of the storage as well on your own systems. So there’s still that huge storage component there that you have to think about Not to mention, you might be paying for licensing for other features of the different platforms, it gets very confusing very quickly. And there’s a million different names for all these different tools and which ones are market leaders. And it’s it’s, you know, if I go through the typical logging solutions, I can find a Gartner report that says they’re the top in the Magic Quadrant for this one particular facet of logging that has been kind of built on. But as
James Hunt
I am or auditor
Brian Seguin
Yes, or APM or, or what have you which which all are 10, gentle all under the umbrella of observability of observability
James Hunt
observability.
Brian Seguin
There, they’re all under the umbrella of observability. But it’s if you’re just looking for a pure logging solution. There’s not much out there.
James Hunt
The easiest one I can think of is paper trail. I’ve had good luck with paper trail. They have a few other things. If you’re talking, I think the closest you’re gonna find to a pure logging solution is actually old school syslog which just puts logs into files and then doesn’t do anything else. It might rotate them right log rotate is a thing. But yeah, there’s there’s not a lot. The other thing I want to talk about in my provocative no logs stance where I don’t believe you should like it goes back to that debugging stepwise. The debugging stepwise logs, I think are useful, but not as logs. And I say that because modern advances in kernel technology and other things related to cloud technology and containerization have brought us one of my favorite technologies for the last 10 years EBPFE Extended Berkeley Packet Filter language which has nothing to do with packet filtering anymore. But really, it’s a way of loading your own code into into the Linux kernel. So you can kind of say, look, here’s how the application is running along and it’s humming and then you say hey, we have a problem. We need to see what the application is doing. You can build either through high level tools or low level EBPF coding manually. You can build probes that say hey, let me know who this thing is talking to let me know what system calls it’s doing what, you know, what is this application doing inside. And because the probes are independent of the application code, the application developer doesn’t need to write those probes ahead of time. And that’s always the failure of debug logging is either you have too much of it because your debug logging everything, or you don’t have logging around the one part that’s failing. And by the time you put that in and roll out a new update, the problem may have disappeared may have shifted to another part of the architecture and other part of the cluster a different service instance. And so you’re chasing that problem. Whereas with things like EBPF, you can cut through all those layers, and zero in on what’s going on with your database. And that completely obviates any need to do debug logging. Which leaves you with audit and transactional logging, which as I said earlier, I firmly believe you need to build custom things for
Brian Seguin
That’s always a fun thing is seeing the developers working on something and they’re like, Well, did you trace the logs? Did you go through the logs? They’re like, yes, but the problems not there. It happens so many times, either, there’s too many logs, they can’t find it, right, or it’s not actually logging the one component that’s related to it, and it’s somewhere else,
James Hunt
Right. And that first problem is what most of the log service providers are trying to solve is being able to sift through the massive amount of log data to get exactly what you want. And I think you should turn that problem around and just not log anything and be able to look into the application through something like EBPF probes. That said, EBPF is a fairly, not abstruse. But it is a complicated topic. Andit’s getting easier to use and to develop for. There’s a lot of really good and user end user tooling or end developer tooling out there. And I do think there’s a case for logging just like there is with really anything in the cloud, there is an entry level, just do this. It’s good enough. And as we get into the rent, buy build decision. I’m gonna go with my normal gut of you rent the log provider, when you’re small, and you pare down to figure out what you need, and it gets you up and running quickly. It gets you to market. You don’t have to spend a ton of time later you can figure out what are the auditable events? What transactions do we care about? And how do we instrument EBPF as the application as it spins?
Brian Seguin
Well, I mean, you kind of already made the other point of the podcast before when you said that, no one should log.
James Hunt
I did not say no one should log I just said I think logging is not necessary in as many cases as it sees use.
Brian Seguin
So whether you should rent it, buy it or build it, you shouldn’t do any of the three. But if you have to.
James Hunt
I think it’s it’s in moderation. I think logging makes sense
Brian Seguin
Logging in moderation…
James Hunt
I don’t think logging is your end state. And that’s why I think A I think buy or build is right out. I don’t think anyone should be spending the time to build a logging system that is for generic logs. Yes. If you’re going to do that build an API or a service that handles the domain specific data correctly, you know, audit events with who did it. When did it happen? Where did it happen? What happened? transactional logs to say what account what customer what transaction took place, but that’s not really your outside the realm of generic logging at that point. And it’s part of your application
Brian Seguin
For generic logging, unrelated to transaction history and unrelated to audit trails, you should either be implementing one from a SaaS solution base, which you also have to understand that you’re going to be paying for storage and consumption of your logs. And you’re going to be charged for that. But you’re also going to be charged for all of these other features that you may or may not use, what you should really be doing as an organization is understanding the features that login is used to solve typically and solve them in a more sustainable way. So rearchitecting, your application to store audit trails and a separate database, rearchitecting your application to store your transaction logs in a separate database. You know those things so you have it ready and parsed out ready to hand over to the audit, loggers are ready to hand over to whatever search you need to do to grab it. That’s your ideal situation. To implement one of these quote unquote loggingsolutions almost doesn’t exist, right? Just Just to kind of reiterate that point. You’re only you’re implementing a logging solution that has all of these other features built into it data science and all of these things, but they’re solving a symptom, not the root problem.
James Hunt
Correct, you’re treating the symptom, you’re not taking care of the issue. And I don’t think logging is not a thing you should do. I just think it’s something that log retention I don’t think should have should last for more than a week. I don’t think if you have something in your logs that you need to keep for more than a week, that needs to go in a transactional database somewhere that that needs to go in Postgres or Redis.
Brian Seguin
That is probably the best rule of thumb, you can actually have for this. It’s how long are you going to have to retain these logs, if it’s more than a week, because it’s grabbing some audit transactions or something? You’re doing it wrong,
James Hunt
Right If you look at the completely absurd, e commerce platform that does no relational database, and everything’s log, you can never prune your logs in that system. Because the logs have the data that has customer information. The reason we put all of that information in a database is so that we can prune logs for end performance and all the other stuff. But yeah, I think you should be able to rent and keep your costs down by keeping your retention tight. three to seven days, you want to usually be able to handle an over the long weekend unless you’re you know, 24 seven knock operate or network operations. If you have a 24 seven knock, you can do much less on the logs. Because the logs are really there as a last resort, if all of your other monitoring and observability tools failure in your instrumentation tools aren’t working, or something happened and it cured itself, you might find evidence in the logs. So you still need it, I just don’t think you need it to the same degree that like you mentioned a gigabyte of logs an hour, it’s it’s, that’s not unheard of, I’ve seen systems that do that. And they invariably cost way more than the value that they bring just in terms of keeping logs. The the build or the buy scenario for this, I guess really is let’s put something on prem, for our existing legacy applications, and not ship it out to the cloud. And the licensing there is really less about it’s half and half half closing or half managing spend. Right? A site license for Splunk is is going to be better than paying per gig. But the other half is about security and air gapping. If you don’t want to send all of your sensitive log data out to the cloud, because all your stuff is on prem or you just don’t have direct access to the internet, you’re probably going to end up licensing, or or finding an open source logs solution like Elastic Search, and cabana. And putting those in place and maintaining and managing and the same caveats that apply to all of our buy. Decisions apply here. You have to manage vendor relationships, you have to manage the hardware that it runs on, you have to keep it running and keep it happy and healthy.
Brian Seguin
From a storage standpoint, if you’re doing the air gapped environment, you also have to take into consideration how much logs you’re storing. Because if you run on a log space, and you have your logs in a on prem air gapped facility and suddenly run out a lot of space, you’re going to be in trouble, because it’s going to take you could take you a year to procure new equipment, right?
James Hunt
You mean I can’t just resize my hard drives.
Brian Seguin
I mean, you could but it might hurt. But this
James Hunt
is the cloud, Brian, everything’s resizeable and unlimited. Yeah, that’s a good point. If you’re talking on prem, you’re almost invariably talking either San or NAS. And those storage, they have upper limits that you will hit faster than you’ll hit in AWS or GCP or s3, right? s3, the unlimited hard drive. If you’ve run that on prem, you you have to provide the unlimited hard drives that Amazon’s giving you.
Brian Seguin
So to summarize logging is really to be used for developer troubleshooting. Typically, the rule of thumb is, try not to keep more than a week’s worth of logs. If you’re keeping more than that, you should be really figuring out why. If you’re using logging to solve audit or transactional things, you should be looking to figure out how to how to fix that into the future, you should have it on your roadmap to mediate that and get those into their own separate tracking systems. And there is no vanilla logging solution that you can just kind of plug in, it’s all kind of everything has its own feature. And yeah it’s complicated. Logging is complicated because it’s been used to solve so many different issues because the data is there. But there’s also danger and having the data there and using this.
James Hunt
I think of logging as a last resort thing. It’s a first implement last resort. And and I think you should operate from there.
Brian Seguin
I think logging last resort is actually a name of a bar in Alaska
James Hunt
Or a ski chalet in the Alpine forests where you can go skiing and then lumberjacking if you’re so inclined. lumberjacking By the way, it’s hard, just like logging. Yes, there’s no,
Brian Seguin
that’s right. Once you get all those logs down, you have to put them somewhere.
James Hunt
And they they jam up the river on their way down to the mill. You pay per current? I don’t know.
Brian Seguin
Thanks for listening to this episode of rent buy build. I’m Brian Seguin.
James Hunt
This is the amateur Comedy Hour version of the podcast. Yeah, no logging. It’s, it’s one of those necessary evils. I don’t think we’ll ever get rid of logging. I just think that the best case is to minimize it and build it into your data model. What are we gonna talk about next week?
Brian Seguin
next week?
James Hunt
Are we done? Did we do all the cloud stuff? No, that’s it.
Brian Seguin
Thanks for listening to rent by build. Next time we are going to be talking about image building and registry stuff.
James Hunt
Yes, we’re talking about whether you should run your own Docker image registry using an open source one, pay for one, borrow one from your friend, and then how we build images and how that whole thing works. Because there seems to be a fair amount of confusion over even among seasoned technical people that I talked to over what exactly is in an image. So we’ll probably go over the OCI spec. We’ll talk about the distribution spec and the image format spec. Brian’s eyes will glaze over. He’ll get real quiet on the podcast and you’ll have to listen to me for a while
Brian Seguin
I will hardly I will hardly be able to contain myself.
James Hunt
So join us next time as we talk about images. Rent by build is sponsored by linode linode has data centers all around the world consistent affordable pricing and fantastic tech support. Whether you’re looking for compute block storage, object storage or managed Kubernetes. If it runs on Linux, it runs on Linode