British Gas fires up its connected homes project with DataStax


The utility company is building out its connected home capabilities using DataStax products. But British Gas engineer warns that the shift to real-time data is a challenging one.

british gas smart meterBritish Gas Connected Home is moving from batch processing to real-time analytics via huge ‘data pipes’, streaming energy information on hundreds of thousands of homes across the UK.

The project utilises DataStax technology, which meshes together a number of Apache open source database technologies and wraps them with enterprise capabilities. Its a project that is growing and will become increasingly central to consumers looking to manage their home energy better.

I got the chance to sit down with Josep Casals, Head of Data Engineering at British Gas Connected Home, at DataStax’s European Summit this week in London.

Casals said that he started the project approximately a year and a half ago, looking to address energy analytics problems, where he needed a system that could handle “huge amounts of data”. He said:

We were looking for a database solution that could scale. From an architectural point of view, DataStax and Cassandra were the ideal product, because you can easily keep adding and adding nodes and there is no reason why it should work differently if there were 10 or 1000.

The first product that Casals and his team built was the British Gas My Energy Report, which is a web application and allows consumers to see information about their household energy consumption. Casals said:

Telling information about your energy usage is difficult because we don’t get the disaggregated consumption, so we don’t know which appliances are on. We just know that at a certain moment in time there is a certain amount of consumption. To be able to tell you how much you have spent on heating, or how much you spent on appliances, we needed to apply analytics and data science techniques.

We have data scientists and engineers in our team. And DataStax was well suited for that because it combines Cassandra, which is very good for time series data, and it combines Apache Spark, which is very good for machine learning and data science learnings.

With that we built My Energy Report, which 850,000 British Gas customers get. There are around 2 million smart meters deployed and of that 850,000 get the report because they have consented to give the data.

The system is also used for millions of other customers that still use the older style meters in their homes. However, this data is less accurate than that transmitted by the smart meters, given that it relies on patterns of consumptions, rather than discrete data.

Moving to real-time

The My Energy Report is currently generated once every month for customers, but British Gas is now working on more real-time streaming products. Casals said:

There are two now that are being delivered to customers, one is Boiler IQ, which is about productive maintenance and being able to tell you if there is something wrong with your boiler. That’s in the very early stages.

We have another one, which is a trial of 4,000 customers and we are getting energy information from these customers every 10 seconds. With that we can tell you things in real-time. The first thing we have built is that warns you if you have an abnormal pattern of consumption.

However, Casals said that the monitoring of abnormal patterns of consumptions is quite simple compared to what he and his team want to do. British Gas wants to connect the whole home and make it possible to monitor every detail of your energy consumption.

We have other things in store that we think will be more interesting to customers than just telling you that you have an abnormal pattern of consumption. Real-time information, or information based on your real-time consumption. Things like, for example, a connected thermostat – where you can control it from a mobile application. But we are adding many more devices, like door sensors, motion sensors, anything that is connected in the home.

With that we can do more interesting things – since we have a stream of temperature from your home, we can do a thermal model of your home, so we can tell you if you’ve been wasting energy by setting your heating too early in the morning because it heats up quicker than you think.

It’s a big change

Casals also had some interesting comments about the shift from batch to real-time using Apache database tools. He said:

It is a change in the architecture. Before we were using analytics in a batch way, we were storing everything into Cassandra and then once a month just running them. Now what we do is we use technology called Apache Kafka, which is like the nervous system of our infrastructure. We get messages from start meters, from thermostats, from sensors, and all of that comes as a stream to us.

You can think of Cassandra as the long term memory, it is where we store everything that we derive from these messages. And then to glue everything together we use Apache Spark Streaming, which processes the data continuously – that’s like our short-term memory, we keep a state of what is the last temperature from a home, for example.

He said that there is no way that British Gas could have used traditional relational databases for British_Gas_logothe project, claiming that it when you are dealing with these amounts of data it “would never work”.

Casals said that other companies considering projects of this type need to really understand how the change in approach is dramatically different to batch/relational tools. He said:

I’d advise that they start thinking about their architecture in a streaming way from the very beginning. It’s very different kind of architecture. If you set up things in the old way, with a data lake with Hadoop, then you can only deliver things in a slow way. It can get insightful results, but they need to come at set times.

Whereas if you design everything in a streaming way, you can update these results in real-time. A state of the art stack is more like assembling pipelines of data. And if you don’t have this in mind from the beginning, the transition is very difficult.

The difficulty is that the kind of processes that you need to design are completely different. So you would have to rewrite all your code. You need different infrastructure.

Keeping all of them running, 24/7, as data starts to increase, it’s challenging. It is a kind of architecture that is not very well tested in terms of support. Operations and support is used to seeing if some computer is running or if there is disk space, for example.

But it’s more difficult to determine that one of the pipelines is still alive but not producing anything. Tooling for support isn’t very up to date. We are getting more and more data, some pipes have 30 thousand messages per second. Everything works well in a test environment, but when you scale you experience a whole new set of challenges.