DJ's Junk Drawer: Netflix shows off how it does Hadoop in the cloud

Saturday, January 12, 2013

Netflix shows off how it does Hadoop in the cloud

Netflix shows off how it does Hadoop in the cloud:
Netflix is the undeniable king of computing in the cloud — running almost entirely on the Amazon Web Services platform — and its reign expands into big data workloads, too. In a Thursday evening blog post, the company shared the details of its AWS-based Hadoop architecture and a homemade Hadoop Plaform as a Service that it calls Genie.
That Netflix is a heavy Hadoop user is hardly news, though. In June, I explained just how much data Netflix collects about users and some the methods it uses to analyze that data. Hadoop is the storage and processing engine for much of this work.

As blog post author Sriram Krishnan points out, however, Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod cluster of Elastic MapReduce instances, there’s another equally sized cluster for extract-transform-load (ETL) workloads — essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various “development” clusters as needed, presumably for ad hoc experimental jobs.
And while Netflix’s data-analysis efforts are pretty interesting, the cloud makes its Hadoop architecture pretty interesting, too. For starters, Krishnan explains how using S3 as the storage layer instead of the Hadoop Distributed File System means, among other things, that Netflix can run all of its clusters separately while sharing the same data set. It does, however, use HDFS at some points in the computation process to make up for the inherently slower method of accessing data via S3.
Netflix also built its own PaaS-like layer for Amazon Elastic MapReduce, which it called Genie. This lets engineers submit jobs via a REST API and without having to know the specifics of the underlying infrastructure. This is important because it means Hadoop users can submit jobs to whatever clusters happen to be available at any given time (Krishnan goes into some detail about the resource-management aspects of Genie) and without worrying about the sometimes-transient nature of cloud resources.
We’ve long been pushing the intersection of big data and cloud computing, although the reality is that there aren’t really a lot of commercial options that mix user-friendliness and heavy-duty Hadoop workload management. There’ll no doubt be more offerings in the future — Infochimps and Continuuity are certainly working in this direction, and Amazon is also pushing its big data offerings forward — but, for now, leave it to Netflix to build its own. (And if you’re interested in custom-built Hadoop tools, check out our recent coverage of Facebook’s latest effort.)