At a Tech Field Day event I attended recently, a presentation was hopefully titled something along the lines of “Visual Root Cause Analysis.” Well! The presentation had my attention. That title was tantalizing. Visual RCA…that would be glorious!
The promise of root cause analysis (RCA) is that software will look at a series of infrastructure events and, paired with current status, infer what needs to be fixed. In a console awash in red, RCA will get to the heart of the matter and send infrastructure engineers on a mission to fix the broken thing causing all the woe.
The problem with most RCA systems is that they don’t know what’s wrong. That is, they can tell you all day long that an interface has excessive errors, a disk in the array is on its last legs, or that a database queue is too high. But those are contextless problems. The issues have some impact on the real world of a user on a phone app or keyboard using an app, but RCA systems don’t really know what that context is.
Root Cause Of What, Exactly?
Think of it this way. Your non-technical boss doesn’t pop by your desk and ask you why there are excessive OutDiscards accumulating on Et4/0/36. You get asked why the network is slow, or why the CRM application is down. Those questions are context.
Therefore, I think RCA is a matter of definition. Do you want “root cause” to mean there’s a dead disk drive bringing the array to its knees, while you figure out the implications of that from there? Or do you actually want a correlation between Et4/0/36 throwing frames away and the user experience that results?
When I hear a vendor say “root cause analysis”, I think of the latter. That’s the non-existent panacea I’ve wanted forever.
From a business perspective, I don’t actually care about high memory pressure, that a drive died, or that an interface is choking. I care that the business problem that’s been expressed to me has a known root cause which software has identified for me.
I want software to correlate a real world business problem with a technical root cause.
Wishing Upon A Star?
On a simple level, this seems like a silly desire. “Well, of course we know that if that disk drive is dying the database is going to be slow and if the database is slow that the CRM app running queries against it will be slow and that if the CRM app is slow when hitting the database the users are going to complain. I mean…duh.”
That’s true in simple environments with just a few applications that are well-defined and hosted in-house. The larger problem is one of dependency trees. What are the odds that disk drive is on a system hosting a single database? And who are we kidding that all our apps run on physical infrastructure sitting in racks we can touch?
Nowadays, we’ve got cloud deployments. Highly shared infrastructure. Changing traffic patterns of stateless applications. Microsegmentation. Multi-cloud networking. It’s time for root cause analysis to step up its game.
Here’s what I’m looking for in my root cause analysis systems.
1. Actionable items. Don’t offer me contextless symptoms. Symptoms are pointers to the actual problem. I can’t fix a symptom. I can make symptoms go away when I fix the problem, though.
Give me an action item. Attach the related symptoms to that action item (see “correlated events” below), but don’t throw a bunch of symptoms in red at me and wish me luck.
2. Correlated events. When troubleshooting a problem, I want my RCA systems to show me the symptoms tied back to the fixable problem. I don’t want to have to click around through five loosely integrated applications to figure out that the long mail queue in Exchange is tied to a database problem which is tied to a network problem which is actually a congested interface which is weirdly filled with backup traffic in the middle of the day.
3. Cloud awareness. I need to monitor the goings-on in AWS and Azure as effectively as I do the metal in my data center. If my poor performance problems are tied to whatever PaaS that might be a critical part of my application delivery, I need to know that.
4. Customizable dashboards, not single panes of glass. Please vendors, stop telling me about your SPoG. A SPoG is not a selling point. What I actually need are multiple views into the application serving environment. “Application serving environment”–the infrastructure. Whatever the pile of switches, servers, storage, security, and cloud that is serving up the applications.
If you must give me a SPoG, then it needs to be a beautifully clean screen that tells me how the infrastructure is performing as a whole, lists my actionable items, and lets me drill into correlated events from there.
5. Distributed tracing. Eventually, I also want tracing. In other words, don’t just show me what’s wrong. Show me the entire transaction chain from user to app and back–every call against every hop along the way, including ephemeral containers that might not be around anymore.
This is more than simply a transaction graph that shows how long DNS calls, TCP transport, database calls, HTTP response, etc. took. Rather, this is a correlated view of a transaction from the perspective of every hop along the way. This is a hard problem to solve.
I’ve worked with solutions that used agents for this in the pre-cloud days. They were expensive, and let’s face it–everyone hates agents. But maybe OpenTracing takes off, and RCA product vendors could begin to rely on it as an information source.
Are Educated Guesses Still Good Enough?
This post started as a coffee-fueled ragequit on all things RCA. But then I deleted it–like an angry email I never really meant to send–and tried again.
Most of my career has been spent making educated guesses about why something is borked. With enough experience, educated guesses are pretty darn good guesses indeed. So good, you can even put a price on them in the form of consulting fees or a salary.
Educated guesses will continue to serve infrastructure experts well, but I feel we are getting beyond the abilities of intuition to help businesses out of technology outages effectively. Infrastructure is simply too complex.
Therefore, our monitoring systems need to more capable. Root cause analysis that connects infrastructure problems with business outcomes is needful.
Are we there today, where this is a pervasive problem that must be solved? Let’s ask a few questions.
- Are you evaluating Kubernetes to deploy your apps and keep them running? Got folks training internally to run it, or did you believe the lie that Kubernetes just works? How do you know K8s is working? Got the monitoring for that sorted out? Who watches the watcher?
- Are you migrating to SaaS for whatever services (who isn’t doing this)? Is the business cool with, “We don’t know what’s wrong, must be O365!” as an answer to why mail is weird today?
- Are you running composable infrastructure or HCI internally? Got that all nailed down? Of course you do. The vendors have their monitoring of these systems all nailed down for you, and these systems are redundant anyway, right? What could go wrong?
I could go on, but my point is really this–we’ve all bought whatever mix ‘n’ match solution fit the budget a couple of years ago, we’re doing this cloud thing because we’re supposed to, and we’ve still got this legacy (legendary) infrastructure hanging around because either we can’t kill it yet or we’ll need it forever.
Monitoring and root cause analysis tools are not unified in this space. They are fragmented. They fill niches. They are specialized. And yet, what the infrastructure does is not specialized. All infrastructure has the common goal of serving applications.
Where’s my unified monitoring platform? Where’s my software analysis that can provide business context around disparate types of infrastructure? Even SolarWinds doesn’t have all the answers here despite a seemingly endless stream of acquisitions over the years.
Maybe there’s no answer. Maybe the best we can do is continue on as we’ve done–educated guesses helped along by several different monitoring systems feeding us the information we need if we know where to look.
In a way, I’m okay with that. I’ll just keep raising my consulting rate.
The fundamental problem is that RCA is a process and not an artifact within a technical architecture. Additionally, causation is never singular as there is ALWAYS multiple underlying issues.
The focus is mostly on technical interfaces and the exclusive representation of only infrastructure in tools becomes the issue. Tools highlight a loss in technology or infrastructure. Due to this limitation, errors or faults are not always highlighted. And I’ve yet to see any vendor address human or process streams in their tools, never mind the ability to map these to applications or business critical functions.