RichRelevance crafts recipe for ingesting big data

Profile picture for user kmaciver By Kenny MacIver March 8, 2015
Summary:
Recommendations engine service provider to the giants of omnichannel retail turns to Pentaho’s Hadoop-friendly ETL tool to streamline client data on-boarding.

[caption id="attachment_724264" align="alignright" width="300"]Marc Hayem-RichRelevance Marc Hayem[/caption]

Analytics has always been the sexy bit of data management. That’s where the nuggets of insight are teased to the surfaced and millions made by understanding why diapers sell beer or who is newly pregnant or how to route a jet so it burns 25% less fuel. But, behind that, there has always been the grunt work of extracting data from multiple, disparate sources, cleansing it of partial or bogus records, transforming it into a consistent and usable format, and loading it into the target analytics engine.

The trouble for companies who are trying to exploit the new opportunities presented by big data is that the necessary extraction, transformation and loading (ETL) work is amplified many times over and across multiple dimensions.

San Francisco-based RichRelevance may be little-known outside of its core audience of consumer retail, but it has progressively opening up Amazon-like, live recommendation engines to many of the world’s largest department stores, a crowd often more associated with bricks and mortar than cutting-edge online. Its 200 clients — a blue-chip bunch that includes Saks, Sears, Urban Outfitters, Williams-Sonoma and JCPenney in the US and John Lewis, Monsoon, L’Oréal and Quelle in Europe — are aggressively using its hosted data analytics engine and arsenal of around 125 propensity algorithms to generated near real-time recommendations to online customers for products they are (by all probability) about to buy next.

The Amazon link is hardly a fluke. The founders of the company, most notably David ‘Selly’ Selinger, moved from roles in data mining and personalization at the online giant in 2006 to build a company that could help other online retailers to “catch up,” according to Marc Hayem, RichRelevance’s VP of platform transformation. But it actually found the greater sense of urgency lay not with Amazon’s online rivals but with traditional retail chains who saw themselves “in an arms race with online.”

The persuasion of personalization

But as the business has grown RichRelevance has hit a fundamental problem: the task of on-boarding clients and “ingesting” the ever-larger amounts of catalog data from which their recommendations are drawn.

Hayem outlines the challenge:

We host about 400 retail [web] sites on our multi-tenant platform at our 12 data centers spread across Asia, Europe and the US — two of those are data science centers, large Hadoop clusters which run the propensity models, asking things like is this customer likely to buy this jacket. The others are proximity data centers, providing recommendations to those customers, typically in 60-70 milliseconds, based on what is happening in their retail session.

To enable that, each retailer supplies a new version of their product catalog for uploading each day — or, in some cases, multiple times a day. These are relatively unstructured [files], comprising prices, product names, descriptions, universal products codes and so on. The fact is many retailers never delete a single product, so we end up with some pretty massive catalogs. Sears’ catalog, for example, has millions of products.

The issue for RichRelevance has not necessarily been the size of those imports but the fact that retailers provide it with very diverse file formats — and file formats that might vary from those previously supplied or even agreed upon. “There are always surprises,” says Hayem. “So it’s a complex problem. As always in IT, the devil is in the detail.”

For the nine-year old, 200-employee company, the signing of each new client or some change to the supplied file formats from an existing client has meant that highly skilled engineers have been pulled off its product development to work on its ‘feed-herder’ — the mechanism for ingesting whatever files retailers provide. As Hayem explains:

As the company grew there came a time where we had more engineers writing this kind of stuff than we really wanted. We have actually a product to evolve and we couldn’t do it at the pace we wanted because we were basically using with some of our best engineers to write those feeds in a pretty custom way every time.

Spinning YARN

To break the logjam and bring some level of automation to the process, RichRelevance went looking for an ETL tool — not one that had a heritage in classic business intelligence, but rather one that was both flexible and optimized to ingest feeds into the kind of Hadoop infrastructure that sits at the heart its business.

Its choice, Pentaho Data Integration (PDI) from the eponymous open source BI platform vendor, has allowed it to pass the task of on-boarding new and changed catalogs from engineering to client service teams, with a simple graphical user interface simplifying the process, says Hayem.  A critical factor in that decision that was the Pentaho product’s ability to perform well in a Hadoop environment, he outlines.

All our infrastructure is Hadoop-based; we don’t really have anything else. So something that is very important about Pentaho’s Data Integration product is that it allows us to ingest files in the Hadoop cluster where they execute.

What its decision also showed is the evolution of some Hadoop environments into products that a CIO might recognize as enterprise-class. Importantly for Hayem, PDI supports YARN, the Apache Software Foundation sub-project that takes Hadoop beyond its batch processing nature to enable multiple data processing engines (such as interactive SQL, real-time streaming, data science and batch processing) to handle data stored in a single platform. As he explains:

Support for YARN was a big decision factor for us, because it allows us to scale these jobs in our Hadoop cluster. There aren’t too many ETL products today that support YARN. Why is that important: well, with the first version of Hadoop, there was little choice but to use MapReduce, which means individual jobs will take all the resource they can find in the Hadoop cluster. That is fine when you are doing something like search, but not with these kinds of extraction jobs. We want to allocate them to a certain amount of memory or resources; moreover, we want them to happen immediately rather than be queued as would typically happen with MapReduce. YARN basically allows us to use our Hadoop capacity in what might be seen as a normal computing fashion.

No doubt with such considerations in mind, Hitachi Data Systems (HDS), the data storage subsidiary of Japan’s Hitachi, swooped on Pentaho in February. In what is billed by the companies as “the largest private big data acquisition transaction to date” (though no figures are public yet), Pentaho will continue to follow its current business model and retain its own brand under the leadership of CEO, Quentin Gallivan. (See Diginomica’s interview with Gallivan.)

Scale and speed

The benefits of the Pentaho implementation have been pretty easy to quantify for RichRelevance — both in terms of people and costs. It has enabled its engineers to return full-time to evolving the core product. And it has allowed the company to create a clear delineation between engineering and consulting — and so start changing for the time consultants spend on implementations.

Vitally, it has also accelerated the process for on-boarding catalog feeds, says Hayem:

It’s definitely a big gain for us. Rather than specialist engineers writing those feeds, we have five or six people in the client services team who are now able to create them using a visual tool. So we can scale much better and add files much faster.

My take

Personalization is the new black for many big retailers. And for good reason. Recommendation engines feeding into an individual’s retail session are shown to drive significantly greater sales. In a survey spanning UK, Germany and France, tech market analyst IDC has found that between 40% and 60% of online shoppers cite personalized product recommendations as influencing on their purchasing habits. Indeed, that data shows that big spenders online are the most likely to respond positively to such recommendations: 40% of those spending over €600 ($650) in an online session bought at least one product that was recommended to them.

The challenge for many retailers is to achieve such personalization at vast scale and speed, and with enterprise-level robustness. And tools such as Pentaho are among the few (at least in the view of customers like RichRelevance) which are able to satisfy that demand in big data environments — perhaps something that traditional BI vendors need to respond to if they are to play a bigger role in the new world of omnichannel, personalized retail.

Image credits: Featured image © Infinity — Fotolia.com; Portrait image — RichRelevance