Scottish agritech business Intelligent Growth Solutions (IGS) says use of low-code auto-remediation workflows in its Kubernetes environment has radically reduced the need for human callouts. It has also improved overall efficiency by a very welcome 40%.
IGS is a vertical farming company based in Edinburgh. It sees itself as helping address the gap for sustainable, high-yield forms of food production to help feed a growing global population that is starting to put stress on both resources and the environment. IGS produces a number of vertical farming products, including trays and ‘tower' systems, many of which it has been awarded patents for. To date, it has received over £42m in investment, a recent oversubscribed Series B round and has some 150 staff.
The organization is a major user of hi-tech, claiming to be 100% IOT-enabled and using Artificial Intelligence and machine learning as part of its mission to enhance indoor food-growing environments in either agricultural or commercial spaces. That's because this form of non-traditional farming makes a lot of use of smart technology and sensors. But none of that was helping reduce stress on its in-house support team, which are known as ‘SRE' (Site Reliability Engineers), and whose role is to make sure the vertical farms are running as they should.
Site Reliability Lead Owen Adams explained that the technology implemented to run vertical farms is highly sophisticated, and often requires dedicated and highly experienced staff to help keep it running smoothly. Finding such staff in a technology market already facing a major skills gap is hard, but recruitment and retention was starting to get even more challenging due to the apparently endless out-of-hour problems that needed to be solved. He said:
In one way, vertical farming is a very new technology, but it's kind of an extension of the classical ideas around the greenhouse. The actual concept of vertical farming with full environmental control is quite a new concept, though, because the technology hasn't really been there to do it successfully until recently.
Unfortunately, this new IoT technology is quite sensitive, and Adams said he was starting to lose track of the numerous ‘toil events' in the environments which could completely shut down their system. He adds:
Earlier in the year, and often overnight or at the weekend, we often saw a specific scenario happen whose solution is essentially restarting a bunch of services. This also really mattered, as the impact of this issue is essentially, we were flooding crops. What kept happening was that it was taking a bit of time to debug and solve the original fix, because it kind of comes in this weird interaction between the hardware and software. Me and the rest of the SRE team were starting to have to work around the clock to ensure the system remained optimal, with any number of 2am beeper situations.
A simple solution where an event gets triggered from the hardware side of things which we could then pick up on and then remotely run a series of auto-remediation steps would, I decided, really save a lot of time and human intervention. So, when we estimated that we might need to hire 170 more qualified SREs to cope, I knew we needed to find a better way.
Automated support tools
Not just any 170 IT people would help, he adds, as the IGS environment is "quite a complex product with a lot of moving parts" with a lot of complexity to manage, so training people to get up to speed would also be a resourcing issue.
To meet his challenges, Adams implemented a tool called Relay from infrastructure automation specialist supplier Puppet to lower the human cost of fixing problems. The product works by monitoring alerts, incidents, and tickets and then combining event-based triggers and workflows to automate cloud operations.
The system is cloud-native workflow automation platform for helping CloudOps teams quickly build and share fully automated workflows, regardless of coding or scripting experience. The aim is to ensure hybrid cloud environments are secure, compliant, and cost-contained. It does this by the code listening to events in a system, and immediately responding to them faster than the "speed of a commit".
By combining a low-code experience with powerful triggers and steps, IGS has thus been able to create automated support tools that support a great deal of overnight auto-remediation, and which anyone on the team can use.
The latter is an important point, he added:
This needs to be something which is very easy for anyone in the team to maintain and make the changes that need to be made if I am ever busy or in bed-which does happen sometimes!
Use of the software, which was installed in February 2021, has since reduced out of hours callouts by 40%. That metric is calculated by the number of times a relay workflow has resolved an issue out of hours versus the number of out of hours issues his team has had to deal with. Adams said:
My biggest metric of success is the number of times that it prevented an on-call code-the number of times it's performed a successful auto-remediation without having to get someone out of bed.
IGS is now expanding use of the software, taking it beyond incident generation. An issue might be pricing related, or about a piece of the system being down or misconfigured, and a report will go to IT. Now, he said, the team can very rapidly create workflows that can quickly solve the problem. This is starting to happen by various common fixes being automated and chained together, he said.
So instead of waiting for hours for a problem to get addressed and picked up, fixing it can happen within a couple of minutes and the situation can be reviewed later. That's actually a big step for us because it starts to make the whole system so much easier to maintain, and I'd say this software is now pretty key in our operations.
Next steps for us will be to see if it can even further reduce the amount of work required to create new workflows and new auto-remediations.