Monday, May 13, 2013

Snake oil peddlers

Great article on Quartz about big data. It includes some data about data processing size on clusters at Yahoo and Facebook. If those guys don't need clusters for "big data", why do smaller companies?

Not saying that some companies don't. But it's a simple question that should be answered before you go down that path. Why do you need a multi-node cluster running MapReduce in order to process a few gigabytes of data? If you can't answer that question, then you probably are just wasting money on servers and even more money on developers to build frameworks on those servers.

Architecture should be as simple as possible.

The backing paper has a great summary: "...analytic jobs — in particular Hadoop MapReduce jobs — are often better served by a scale-up server than a scale-out cluster"

...

It seems to me that somewhere along the line, the "cloud" went from being service-oriented to being data-oriented. Having a cluster that provides services that can be accessed is an incredibly useful thing, especially if those services are accessed in a standard way. Both Amazon and Google have infrastructures like this - Google App Engine, although not my favorite, does exactly this, as does Amazon's Web Services.

Those are clusters that run many people's services all on the same hardware. Instead of having a cluster of machines that are all focused on one giant problem, you have a cluster that solves many problems at once.

What it seems many people are selling is the quest for the holy grail of data. Take all the data, run a giant complex algorithm on it, and all of the answers will be made clear - i.e. Snake oil peddlers.

I keep MapReduce on my resume - it's a great paradigm to know. But I am very cautious about how I use that technology - there are just too many companies using Hadoop as a catch-all answer for every problem.

No comments:

Post a Comment