Main content

Digital resilience - it’s all about striking the right business balance, says Splunk Chief Technical Advisor Mark Woods

Chris Middleton Profile picture for user cmiddleton June 20, 2024
Don’t regard resilience in solely technical terms, says Splunk’s Mark Woods. It is about balancing tech and culture.


It is better to build resilient digital systems than ones that are always robust and up to date. More, observability in critical systems should be aligned to an organization’s real-world behaviour, and not just to its technology concerns. So says Mark Woods, Chief Technical Advisor for machine data analytics company Splunk, now part of Cisco.

What does he mean by this? Likening all digital systems to the McLaren F1 and virtual eSports teams, where the partnership between Splunk and the automotive brand provides telemetry backhaul and performance analytics, Woods explains:

We host roundtables and open days for customers at the McLaren Technology Center, and also sponsor McLaren. But why should customers care? Well, we talk about resilience and what that has to do with an F1 team. And it is about that balance [between system design and real-world usage]. 'I need to finish the race, but there’s no point in finishing the race last'. So, [in that example] I don't build robust systems, I build resilient systems. You sacrifice some of that robustness to make sure that you finish the race as far up the order as you possibly can. But equally, there's no point in winning a race then not having a car for the rest of the season.

He continues the analogy:

So, we talk to customers’ technical teams and say, ‘When’s your season, and where’s your race?’. And they say, ‘We don’t have one… but we have this promotion, or this peak in demand’. And we say, ‘So that’s your race’. And your season is your business year, which your board has to report on, and maybe a quarter is your lap, and so on. So, when can you make that call to change something? Most of the time for your technical team, once they've started to extract that data, once they’ve started on day-to-day usage, there are only so many small windows of opportunity where they can make a critical decision, such as, ‘Am I going to have to deploy a new server?’ Or, ‘Do I need to completely change my code?’

OK. Yet metaphors can be dangerous, as author Milan Kundera once observed, and are not to be trifled with – Kundera’s point being that they can distort critical thinking and make you believe something that may not be true. But taking Woods’ example at face value, he is saying that systems simply need to be good enough for the job – to be resilient in real-world usage, rather than so encumbered by a need to be perfect or inviolable that they are not up to the tasks for which they are designed, perhaps. But does this concept really apply in the enterprise technology space, where customers rely on providers such as Splunk to be more than just ‘good enough’? Woods argues: 

For Splunk’s services themselves, it does not really apply of course. But for our customers’ own services, absolutely. A critical system is going to have to survive for however long before they renew and renew and renew. You can build the most robust product in the world, but you still need to upgrade it every six months. In the meantime, it needs to survive for that period of time. So, resilience is really about understanding what buffer you have to be able to make some changes, or even to go faster.

He refers back to the pandemic, during which the world was pushed towards relying on cloud-based systems in order to work from home – in some cases unwillingly. explaining: 

There weren’t that many major crises [in general terms], because we [providers] had built in enough buffer for those systems to be resilient, but had no idea that they were to that extent! There were lots of panicked decisions being made [in the tech buyers’ market], but the main systems had enough of that buffer to be resilient.

Downtime costs

Last week at its user conference in Las Vegas, Splunk published research on the hidden costs of systems downtime – findings that could be read as the hidden cost of non-resilient systems. Surveying 2,000 executives from Forbes Global 2000 companies, the research estimated that the total cost of unplanned downtime for those enterprises was $400 billion a year, including both direct and hidden costs, such as lost revenue, reduced shareholder value, damage to customer trust and brand reputation, and lesser ability to innovate at speed.

By my calculations, that equates to a median average of $200 million per company, which seems implausibly high. That aside, while 56% of downtime is caused by phishing incidents and similar attacks, 44% stems from software failures and other infrastructure issues. Overall, human error is the overriding cause.

The precise financial figures aside, therefore, the basic premise of the research is hard to argue with. Launching the report, ‘Chief Splunker’ and now Cisco President of Go-to-Market, Gary Steele, noted:

How an organization reacts, adapts, and evolves to disruption is what sets it apart as a leader. A foundational building block for a resilient enterprise is a unified approach to security and observability to quickly detect and fix problems across their entire digital footprint.

I put these issues to Woods, in the context of his definition of resilience as systems that are good enough – rather than, necessarily, always watertight and robust. He says:

We've always understood that downtime costs. I think the really good part of the research was not just looking at those things in isolation, such as what happens if the server is down, when it directly impacts things. The real learnings for us have been how important it is to have that service-level context, and how to help our customers relate that to the broader business impact.

As we will see later, this has ramifications for his own business, since the Cisco deal went through. He continues:

Often when you talk about such risks or impacts, they are separated, abstracted, from the actual data. […] Because it's not just, ‘Can I resolve this within X hours, X minutes, or immediately’. It's about starting to understand when those things begin to have a broader impact in many areas, or when the customer is affected. And bringing those things together. Often when people talk about observability, they think, ‘Is this a technical thing?’. They think, ‘I have moved to a new way of hosting in cloud. So, if I need some observability now, is that just a proxy for monitoring, so I can get more funding?’ But actually, what it means is understanding the interactions first. Perhaps the user is thinking, ‘I know something’s wrong, but the customer isn't affected at the moment’. And I'm able to track that, I’ve got time to understand it before I make a decision.’

In terms of the broader security picture, the question is to what extent are we now seeing AI-enabled attacks on enterprise systems? At the Chatham House Cyber 2024 conference earlier this month (see diginomica, passim), security consultant Jen Ellis seemed to pooh-pooh the idea that hostile actors are using AI at scale to launch attacks. Her reasoning was that it would be more costly and complex than traditional attacks, which still pay handsomely. She told delegates:

We've heard so much about this, except from the attackers themselves! And most security researchers are like, ‘Not yet they're not!’ It doesn't mean it won't happen, though. It probably will at some point. But here's the reality: attackers don't make their lives more complicated and expensive than they need to.

From Woods technical perspective, does that  match the reality of what he, and Splunk, are seeing on the ground? He says: 

Do actors use every technology they can possibly get their hands on? Certainly. And if any person on the street is able to access something, then it’s not much of a jump to assume [an attacker would too].

But then he makes a more interesting and useful point:

There is a problem when it comes to the definition of AI [in that context]. For example, if we are just talking in terms of generative systems and Large Language Models, as opposed to the broader spectrum of AI, taking in everything from deterministic, to non-deterministic systems. There have been some experiments – including by our own Threat Intelligence team – to see how much more effective [attacks] are when you start to apply generative techniques. And actually, there is not that huge jump forward in effectiveness. But the same may not apply to other forms of AI.

He adds:

Looking towards the future, what those systems can't currently do is understand intent or context [but they may be able to in future]. But today, what people in national defence communities are more concerned about, for example, is the subtle influence of these things. Especially when we get to generative techniques that can make a small change that has a large impact on sentiment. What many commercial enterprises are worried about is the sentiment of users and how [the use of generative systems to impact that] is a hard thing to track. It’s easy to nudge things at the moment. We are also moving to this paradigm where lots of people are putting context-driven questions into generative systems. And most of these systems, at the moment, are consumer grade, not commercial grade systems. But that may change.

Some people are concerned that much of the drive to regulate AI is, currently, being driven by relatively trivial uses – for example, by individuals using generative systems to write a piece of text, create a picture, or make a video. On that point, Woods says:

I think it’s a twofold risk. One is that this approach is ineffective and does little to protect people. And second, by being broad brush and just about the technology, that you do stifle innovation. With AI systems, they are only as good as the data that feeds them. So, actually we should look far more at ownership of that data. Do you just own the raw data? Or do you also own the content that's being pushed out too?

Splunk has recently issued new AI Assistants – like every other enterprise provider. Is there a risk that users are facing ‘Assistant overload’? That anyone with, say, a dozen different business applications might have a dozen different AI Assistants attached to them? Woods says: 

We've got ones for security, observability, and for service intelligence, and one that is specifically for our query language. And that's because they have to be context and domain specific. They need to be tuned for that. But there have to be guardrails in place. Otherwise, you ended up with recommendations you might not want, or you get to the point where you abstract so far away from reality that is not really ‘human in the loop’ anymore. But a future in which [more advanced] AIs are arguing with each other… Personally, I don't think we should be using them directly, but I think they're great as an interface to other systems.

The future

On the user conference stage, the ‘Chuck and Gary show’ made great play of the two leaders focusing on integrating the two companies: their brands, cultures, staff, and – above all – technology portfolios. But for any veteran watcher of this industry, massive integration projects are where deals often begin to go south, or become a dead weight on the business.

The relationship between Cisco and Splunk has been presented to delegates, partners, and staff as a shared journey between compatible partners, rather than a big, established company absorbing a smaller one’s tech, contracts, and talent. Is that the reality? Is it really the amicable walk in the park of the event’s careful messaging – a stroll down the Vegas Strip? Or more like a massive gamble in the Big Tech casino? Woods says:

So far? Great. But I don't think it'll ever be a walk in the park. Anyone that goes into it thinking that way? Well, talk is cheap, but action is what counts. From the security side of things, the portfolio makes sense. Anyone looking at that and putting it together would go, ‘Yes, I can do that. I absolutely see how this all sits together’. And, ‘I can see that you've got better, more fidelity coverage at device level. You have extended threat intelligence and you're able to bring in that network element, which makes complete sense.’ But still, there are more integrations come.

On observability, I think it's going to be harder, because observability is one of those areas that needs to align more to how companies act. So, is it through TechOps? Is it a service play? We’ve still not quite worked that out. In most companies, observability practices normally sit within the product development teams, or within the infrastructure and technical teams, but very rarely on the service desk. The challenge now is, how do you strike that service and product balance?

My take

In short, watch this space. Now the successful messaging of Splunk’s .conf24 with Cisco has passed, the reality of two very different – if complementary – players bedding in together begins. Come next year, we will see how it all played out.

A grey colored placeholder image