Cloudera CTO on Big Data analytics and security risks

Profile picture for user mbanks By Martin Banks September 23, 2014
Summary:
Big Data analytics help create a central location for all of the most valuable and sensitive business data, but Cloudera argues that this  can create  a significant security target.

Much is made of how Big Data analytics help create a central location for all of the most valuable and sensitive data a business can hold, but does that also create one of the most significant security targets going – and a single point of failure for all security systems?

awadallah_1289548089_53
Amr Awadallah

According to Amr Awadallah, CTO of Cloudera,  it does:

It is an issue across the board, but it is something we have covered. We have been doing implementations in the finance space, the government space and the healthcare space, and these people are fanatics when it comes to security.

We feel Cloudera has a very strong security story with our authentication, access management and encryption tools, and we couple that with a very tight system on the auditing side, for security is only as good as the auditing.

The encryption element is new, and is why we bought Gazzang in June this year. We are unique in the level of security we provide for our customers. Our competitors still have to work with other security vendors to provide a similar level of security, and that slows things down.

Continuing to improve and develop the security aspects of Cloudera’s distribution of the widely used Big Data analytics tool, Hadoop, is now to become one of the prime focuses for the company’s new Centre of Excellence.

The goal here, Awadallah says, is in making Big Data analytics ever-more solid for the users, particularly in areas such as improving the ease of use and exploiting more automation. Making the system as secure as possible is a key component here, for it reduces user concerns about having to hand-manage security requirements.

The goal, he suggests, is to give users more confidence in putting all their data into the system so they can extract more value from it.

The company will also be focussing on developing support for cloud implementations of Cloudera. At present the company still only supports on-premise implementations of its Hadoop distribution. He says:

We are now starting to see some of our customers building hybrid implementations where they want some of their Hadoop cluster out in the cloud, so we are investing in how that can be managed.

The reason most of the cloud implementations so far have been for new applications is the issue of moving data to and from the analytics cluster. With new applications, all of it is being born in the cloud, and that is where the data is, so it makes sense to also have the applications there.

But for legacy applications, the hassle of moving all that data into the cloud is just too much. In fact it becomes faster to ship the data on disk than over the Internet.

The other aspect where he would claim a significant advantage over the competition is through the company’s partnership earlier this year with Intel. According to Awadallah, it invested $740 million in Cloudera, and in the process folded its own distribution of Hadoop into the company.

That move alone looks likely to give Cloudera significant global coverage and market penetration, given Awadallah’s observation on the pairing:

Intel’s distribution is very big in China, while ours is very big in the USA.

Intel is also working on developments, such as on-chip encryption and decryption technologies, that Cloudera expects to be exploiting as soon as they are available. This means that it will no longer be necessary to run software to encrypt and decrypt data. It will happen automatically in hardware, and will also happen much faster:

We have customers that are afraid of doing encryption because it can slow everything down. With this development they won’t have to worry about that.

Intel obviously sees this investment as an important lever in helping it keep its estimated 96 percent share of the server marketplace, and according to Awadallah it is an example of the company following its traditional path of aligning itself with new applications and software developments as they become important in the marketplace.

However, while Intel may have the lion’s share of the server processor marketplace, there is considerable talk about not only new server and even datacentre architectures to accommodate the demands of cloud computing, but also new processor architectures.

For example, there is the growth in the use of Continuous Delivery of applications code to users; the expected move to much smaller, single function applications with short life-cycles; the switch from operating systems and towards new types of browser that can run these new applications within the browser; and much smaller servers that, while they may still be running multiple server instances in a virtualised environment, will be logically optimised to run these applications as fast as possible.

Changed relationships?

This raises the question as to whether such changes might affect the relationship between Cloudera and Intel, for Intel may yet be the loser to the one or more of the many semiconductor manufacturing partners of ARM, which now has a 64-bit server processor chipset design available? Awadallah suggests:

The microserver architectures are not suitable for big data applications. The world that we are trying to disrupt right now is that let’s have all data stored in a SAN on one side and all computation on the other side, with a network in between.

That is what the world looked like before Hadoop. And that did not work with big data because the data had to travel over the network for the compute to run my task. The network speed is not growing at the same rate as big data.

The model that we have with Hadoop is 'let’s have a bunch of servers with 12 hard disks, two Intel sockets with six or maybe 12 cores and then spread out the data over all these servers'. Then when a query, comes in the big problem is turned into micro-problems and they are spread out across the servers. Each server does its computation and we join back the results.

The pro of the big data approach is that it scales way, way better; the con is that to add more storage you have to add more servers, and more CPUs whether you need them or not.

But doesn’t this then map straight on to the highly parallelised microserver architecture which is seen as coming down the line? Awadallah reckons:

We may be talking about the same thing. The new architecture being developed by the likes of HP is moving to the idea of the rack becoming the computer, the rack will be the server. So it is not going to be a single server or a compute cluster, it is going to be the rack, and within this rack we are going to have a bunch of CPU blade servers, with only CPUs and memory on them.

There will also be a bunch of disks, which can then be logically mounted onto the blades within the rack. The problem is that there is not enough network bandwidth to do this between racks, only within racks.

19732924.cms
He acknowledges that this type of architecture is the way things are moving, and cites HP’s The Machine as an example. Intel is now also working on developing networking natively, directly on the chip so that Blades in a rack can talk directly with the storage. So in his view Intel is cogniscent that this is where the future is going.

But this architecture would appear to be one that would suit ARM-based servers. Awadallah, however, is not so sure.

ARM has no track record in datacentres, so the big question mark is can they take the IP that they have for building mobiles and tablets and bridge that into the data center? That is a very big question mark. I think ARM will have very much better luck with the Internet of Things (IoT) and I think they will make a big mark there.

He sees no advantage in having commonality of platform between IoT and big analytics systems, even if the latter will be working with data sourced by the former. This is because what is happening in IoT is largely about logging data from sensors, not about the heavyweight compute functions needed for analytics.

That compute capability is something which Intel has, historically, always been very good at he suggests, giving an estimation that it `owns’ 100 percent of the high performance computational systems marketplace. ARM, on the other hand, has been very good at providing low power solutions. Awadallah argues:

Nirvana is having low power and high compute, and I don’t think that is possible in one device. What is needed is a mix of both.

He does see scope for large arrays of highly parallelised microservers playing some part in big data analytics, if the computational requirements are small: for example, multiple repetitions of large volumes of the same, simple compute function:

And the proof is in the pudding. ARM announced some three years ago that it would be doing a server version of the processor, and there is still no sign of it out there…..nothing. There is not certification from Red Hat, which is the key for the next distribution, and no production deployments that I know of.”

He sees this as a problem for both ARM itself and its chipset producing partners which, unlike Intel, he sees as not having the ability to push its partners in the direction it wants:

ARM has not been able to push Red Hat and others more quickly. The issue is not the chip, the issue is taking these chips and putting them in a platform that actually ships, and then convincing Dell and HP to go and make that platform in conjunction with the software that supports the platform. That is the challenge.

Despite all that, he did indicate that there is nothing exclusive in Cloudera’s relationship with Intel that would preclude the company from investigating such possibilities. He is, therefore, keeping a close watch on ARM, and is aware that any industry having too much dependence on a single vendor is bad for both business and consumers.

My take

Awadallah is probably right about the position of microserver penetration into Big Data analytics, certainly for the short term.

But the combination of the pace of chip technology and architecture development: the possibility of IoT-related analytics being bigger and – through the need for real-time event management – increasing complex: and the increasing need for analytics to be performed ever-closer to the point of need, is likely to drive a faster and more complete transition to architectures of millions of microservers happen sooner than expected.