Observability is a hot topic, but Dynatrace shows it can be much more than seeing

Martin Banks Profile picture for user mbanks October 24, 2023
Summary:
Rick McConnell, CEO of Dynatrace, on observability and beyond.

Dynatrace
Rick McConnell

Observability is a hot topic, with many vendors now say they offer it. The idea is not new - dashboards have been around for many years. But as the cloud has created not only its own market growth, but also a growth in applications development – often using new languages and tools and the collaborative nature of open source development - have led to Dev/Ops and Continuous Delivery of both new and regularly updated applications. 

The result is two-fold There can be great success for businesses through greater agility and innovation. But there can also be greater chaos and the entropic scattering of applications that didn’t quite make the cut, but never quite got cut out. Some may still even have occasional bit-parts to play in other applications.

This leads to a thought – just as many vendors start promoting their observability’ credentials is the word and its context becoming passé? A better word would certainly be remediation, for it focuses on the real point of observability, i.e. the capability to set right a wrong that has been observed, and be able to do it quickly as part of the same process, and quite possibly do it automatically. It is certainly good that there are an increasing number of services and tools that can observe that a problem is occurring with a business process. Better that it can identify the functional process(es) that is/are the cause of the problem. But if they then leave the user to start the re-engineering process largely unsupported it can even make matters worse, not better.

Look, it’s Azure

This was a clear underlying message pushed across at the recent Dynatrace CEO Rick McConnell at the company’s recent Innovate conference held in Barcelona. His 'headline news' story from the event was that Dynatrace is now available as SaaS on the Microsoft Azure cloud platform, following its availability as SaaS on AWS since the back end of last year. That news does have its own value for users. In particular, the availability of Grail makes this the appropriate time to make the move to Azure.

As to why AWS first, according to McConnell:

At the end of the day, you just look at the share of overall cloud, and AWS has a much bigger share. We just prioritized where our customers were most likely to be. Yeah, the good point that you make is the Azure customer base is more aligned to ours than AWS, where you've got a lot of SMB, small and midsize business businesses whereas Azure tends to be more enterprise. We started work on Grail almost five years ago, so this is a monumental step forward for us. And five years ago, the market shares were even more differentiated. We never would have made an Azure choice five years ago.

Offering Microsoft’s view Priya Satish, Senior Director for ISV Sales at Microsoft Azure, suggested that having Dynatrace, and particularly Grail, available now would meet the need of today's digital economy for instant gratification when it comes to the applications and tools they use. Not many, she suggested, would want to give a second chance to any application if it provided a slower performance, or failed to deliver an intuitive, seamless experience. This is why observability is becoming an essential tool now.

Observability yes, but there has to be more

In a half hour private chat, McConnell talked more about how observability is really about what you can do with what you have seen, not just having the ability to observe. Making sense of the raw input and acting upon the subsequent results is the important element. It also leads to a diversity of applications and capabilities that go beyond the obvious direct results of observation. These, of course, are based around the ability to both identify a problem as it is occurring, but also use AI services to define and plan remedial actions that can either be implemented by skilled staff or, increasingly, by automated services.

AI services, in the case of Dynatrace, means its own Davis AI system, which has been in service some 10 years now and as a consequence has built up an extensive knowledge base on problem identification and remedial actions. This makes it a good basis on which to build Automated Process Management (APM) services, which is now one of the strong suits of the company.

One of the issues which obviously follows is that many potential customers will already have some level of observability available, usually in the form of proprietary dashboards for different applications. Individually, these can be very useful, but with the growing range of applications generated by cloud services, coupled with their increasing complexity and the business process requirement for them to collaborate together to achieve business goals, the result is now a growing multiplicity of dashboards and an even higher abstracted level of complexity and confusion.   

McConnell make no secret of the view that the only sensible option is to rip-and-replace such old environments. Yes, he would say that, wouldn’t he? While a unified Dynatrace implementation would no doubt work well, there is investment amortization to consider, as well as the fact that some applications may be important to keep, and manage as is. So identifying appropriate candidates for rip-and-replace can be an important effort in its own right. With Davis as part of the environment, there is a predictive AI component available that can help users to identify candidate applications, tools and services before they actually become a post hoc and troublesome event.

McConnell sees most customers, as they face up to such an event, starting with causal AI and root cause analysis - something breaks and they will want it fixed, fast. Most, he says, when they first come to try Dynatrace, are impressed with the way it can cut days of waiting for a resolution down to hours, and hours down to minutes. It is then that they realise that the Davis AI system can, in all probability, identify that a specific router that is beginning to flap and needs to be bypassed and taken offline, that disk space is being used up fast and memory capacity needs expanding, with it all correlated to a new application load. Before, the only observable factor had been an increase in performance challenges, he suggests: 

I think, especially now with overloaded legacy systems, there's going to be an increasing logic which says there's an increasing number of users getting close to hitting the wall. So just to be clear, on a workload-by-workload basis we would argue they're using the horse and carriage but want the car, or whatever the right analogy is. So yes, it's rip-and-replace pretty much every time.

Very similar arguments pertain to what McConnell calls "tool sprawl", where every new process or code problem instantly spawns a handful of new tools intended to resolve said issue. Almost regardless of success or otherwise, the end result is a glut of tools cluttering up the system and little information as to their veracity or otherwise. And the chances are that none of them have even been optimised.

This, he suggests, is being made worse with the arrival of generative AI and its ability to assist applications developers in their work. The rate at which applications can be created even starts to pose problems in terms of sustainability, if only because every one of those applications is a consumer of energy, even if only indirectly as stored code, regardless of their value to the business. There would seem to be a role for Dynatrace in weeding out duff applications as rapidly as possible. It is a problem McConnell has already confronted:

I was recently talking to the CTO of a large Australian bank and I said, how many applications do you have, and how many do you want to have? The answer was, 'I have 2,000. I want to get rid of 1,000’. There was this realization that the bank had way too many applications and needed to shrink that number. We can provide analytics on performance and usage and all sorts of things to provide some indication of how critical an app really is.

Organizations are worried about the problem accelerating in the wrong direction. With gen AI, a guy is enabled to create the next 1,000 apps before the management can get rid of even 10 of them. We help in the sense that we can make the explosion of apps more manageable. On the challenging side, we obviously can't prevent an organization's people from building a huge number of additional applications that may or may not get used.

One last application where that combination of observability, identification and (increasingly automated) remediation that is emerging is using Dynatrace to create sandboxes where users can read in test environments or production environments for deeper test and examination. McConnell says that a growing number of users have taken to using it this way. This does, however, seem to be only a question of scaling to have a Dynatrace environment capable of functioning as a digital twin of a production environment.

Given the increasing levels of complexity in applications, the growing use of Dev/Ops and Continuous Delivery, and the way they are used together in collaborative environments, the chances of an update or modification of one application causing problems in other applications over time is growing exponentially. This could make running digital twins of production environments within an observe/identify/remediate test environment an Interesting proposition. This doesn't faze McConnell: 

There's no reason in principle that I can think of why we couldn't do that.

My take

It is one thing to be able to see what is happening with a process, be it business, industrial or game: that does at least give one a chance of having some idea of what is happening and that can be helpful, especially when it is all going wrong. But if that is happening then making it work right, as quickly as possible, is the obvious goal. That takes a good deal more than just observability, for that still leaves the more difficult task of fixing it.

Loading
A grey colored placeholder image