Getting the Big Data 'ducks’ into rows
- Summary:
-
Big Data has little to do with the bigness of the data but the ability to make sense and order out of it.
To achieve the best possible result from the analytics process, data from a wide range of sources has to be included.
One of the real problems here is that Big Data, almost by its very nature, is a mish-mash of different types of data. To achieve the best possible result from the analytics process, data from a wide range of sources has to be included – highly structured back-office database records rub shoulders with office productivity tool file systems and social media postings. Then there are the audio and video graphic files to work with as well.
All of these are grist to the data analytics mill, but getting them organised into a form that can be used effectively can take time, usually because it has to be done by systems not designed for such task. This makes I/O between the storage holding the data and the compute function one of the big bottle necks in delivering big data analytics services.
Solving this problem has become one of the development targets for DataDirect Networks (DDN), which recently unveiled the results of its endeavours, along with a new storage appliance called EXAscaler, that brings HPC-standard, open-source-based parallel scale out storage capabilities to the enterprise marketplace.
DDN’s solution to the big data I/O issue is the Infinite Memory Engine (IME), which is still being trialled in beta form, though it is expected to see a formal release into service early next year.
The company’s Chief Marketing Officer Molly Rector claims it is the furthest ahead with type of technology, which aims to speed data analytics applications by two to three time current performance levels.
It achieves this by introducing a software layer to manage and optimise the I/O function between the storage environment and the compute function. It comes in two basic parts: there is what Rector describes as a `light client’ that sits on the compute system.
This intercepts I/O calls to storage and sends the requests to the second element, the IME engine for processing. This process is seamless and transparent, and does not impede performance in any way.
Because of the nature of Big Data analytics, these I/O calls are usually requesting data from a wide range of different sources, in any order. This creates a major bottle neck in processing while the data is sorted.
Targeting
It is this that DDN has targeted with a dedicated solution. The engine re-orders the data for delivery to the compute system in a manner that allows it to start analysis processing far faster than currently possible, as Rector points out:
Because of the random nature of this standard approach, a system that should be able to process say 10 Gbytes/sec of data only delivers around 100 Mbytes/sec, just because of the way it was written.
This is typical of all systems at the moment. You run a job on the compute system and it then sits idle while the results trickle into storage. Tests have shown that IME can improve the I/O performance by upto 1,000 times. What is more, once re-ordered, datasets can then be re-used in the form, increasing performance even more.
That level of performance improvement may not seem to square with the claim that the overall two to three times performance improvement for complex analytics applications.
But according to Rector, this improvement gets multiplied because it means that one application is no longer saturating the compute system, freeing up resources for other applications or instances of the same application, to be run concurrently.
As a side issue, she indicated that there is also an expectation that IME will also be able to reduce energy consumption footprints by between 20 and 30 percent for sites running large, high-end, compute/storage intensive environments.
DDN is also working with systems vendor, SGI, which is testing IME in conjunction with SAP’s HANA in-memory processing environment. According to Rector, it may have a significant role to play in improving how data is moved out of HANA:
There are applications that have been designed to use memory well. For those designed for heavy in-memory compute work one possible issue is that it can be hard to get data out of memory. If it is possible to get the data out of memory and keep the high performance level that would be a great thing. But we still need to do that testing.
DDN’s other main announcement, EXAscaler, is its second introduction into the highly scalable, parallel-based file system market, joining its GRIDscaler, which was launched a couple of months ago.
The primary difference is that EXAscaler is built on the open source Lustre management system, which has the source code maintained by Intel. The object, according to Rector, is to create an open source system which will appeal to the growing number of enterprise users that are moving to open source wherever possible.
The one trouble with that is the potential for poor ease of use capabilities that can come with open source applications and tools.
So DDN has taken the latest Version 2.5 of Lustre to create a fully supported distribution of the software, and coupled it with hardware to build a packaged and pre-configured appliance capable of delivering data at 20 GBytes/sec, which Rector acknowledges is a downgrade from what is possible with software alone and the sufficient hardware:
Yes, it is a performance hit, but most end users won’t reach that 20 Gbyte/sec level, so it doesn’t really matter. And we are doing it as an appliance because many enterprise IT guys will not have worked with Lustre, so the more we can do for them, the better.
My take
Another example of how Big Data has little to do with the bigness of the data but the ability to make sense and order out of it, so that meaningful results can be achieved that bit faster.
And as appliances are making inroads into the complexity-reduction side of the issue, I would not bet against a cost-effective IME `box’ appearing as demand grows.