My consulting work centers on the strategic side of analytics, so part of my job is to meet with companies that are just getting started with data. It’s interesting to talk with them in those early stages, as I get to ask (and get asked in return) a lot of important questions that will have long-term impact. I hear some questions so often that I figure I should write down the answers and share them here. This will be the first such post, and the question is:
“Do I need a Hadoop cluster?”
I’ll walk you through some key considerations so you can answer that question for yourself. Much of what I’ll say similarly holds for investing in other Big Data tools; so even if you’re not pondering a Hadoop install – maybe you’re thinking of Cassandra, or MongoDB, or the like – read on. This may save you some money and headaches down the line.
Back to the question at hand: do you need a Hadoop cluster? Sometimes people phrase this as: “how much data do I need to merit a Hadoop cluster?”
On the one hand, I tell these people they’re sharp to phrase it in terms of their data needs. That’s a good start.
On the other hand, I tell them that they oversimplify the issue by thinking only of data size. Consider other important decisions, and ask whether you’d be comfortable moving based on just a single factor. For example: “How large a team should I have to merit renting this office space?” You’d also do well to consider where the rest of your team lives and how they’d get to the office, the quality of the office space and surrounding neighborhood, or whether you should have an office at all. Even after you take into account these and other considerations, you’ll still be left with a judgement call.
See what I mean? It’s both more complex than just “data size,” and also messier than a simple “yes” or “no.”
As with many questions, you need to determine the relevant factors and how to weight them, if you’re going to get a useful answer.
When it comes to the Hadoop decision, then, what are those factors? The list here is hardly exhaustive, but these broad strokes should take you pretty far:
Business use cases: If you’re starting with tools, you’re already moving in the wrong direction. Start with business use cases. Understand what problems you want to solve, then go shopping for tools that fit those problems. That means exploring your organization’s data needs, lining up a few different solutions, and seeing which one(s) will best address the most use cases.
Incumbent solutions: Unless you started collecting and analyzing data just last week, you’ve likely identified those use cases already and have implemented solutions – homegrown or third-party – to address them. Are you happy with how those solutions are working? Do compute jobs complete in a reasonable amount of time? Are you able to solve your known problems with those solutions? If you’re hearing lots of “yes” here, then Hadoop will be a tougher sell. Why pay for a cluster, pay to train your team, pay to modify internal routines to run on Hadoop … just to end up right where you were before?
Computational needs: Related to that previous point, let’s say you currently run all of your data crunching on a single, hefty machine. If your jobs are taking longer and longer to complete, or you project you’ll soon outgrow that single machine, don’t run for the cluster just yet. Ask yourself: “what if we just buy a bigger machine?” Some consider it a sin to throw hardware at the problem; but if a new, fatter machine will get your business through the next year or two, it’s a fair investment. Then again, if you project your compute needs are about to grow by a wide margin, well beyond what’s possible with a single machine, that’s starting to sound like a Hadoop possibility. One of Hadoop’s strong points is the easy, long-range scalability. Just pop another box or two into the cluster, and you’ve instantly got more power.
Hadoop model compatibility: Does your problem fit Hadoop’s compute models? In my Hadoop talks, I used to say: “if you can’t express your problems in terms of MapReduce, Hadoop won’t be a good fit.” MapReduce and Spark now share space in Hadoop’s computation layer, but the general idea still holds: if, deep down, you can’t decompose that problem into small, independent units of work – if you need a single path of code to see the entire dataset at once – you won’t make use of Hadoop’s distributed computing muscle.
Your budget: How much money do you have to spend right now? Hardware keeps getting better every year, but a decent, production-grade cluster is still far from free. Don’t forget that you’ll also need to have Hadoop administrators, and optionally some third-party support. Add in training for your developers and analyst staff, plus the time and dollar cost to convert any existing processes to Hadoop jobs, and you’re looking at a price tag beyond the cost of the cluster hardware. Do you have the money to spend on this right now?
Whether you are cloud-able: One of the perks of using a cloud-based Hadoop cluster – whether you build it yourself, or use a hosted service such as Elastic MapReduce – is that it costs you nothing to simply get rid of it should you determine the use cases no longer fit. That’s quite a savings of dollar cost and headaches compared to trying to ditch the hardware for a self-hosted, on-site cluster. (The cloud option may also cost less than building and running an on-site cluster, but the numbers don’t always work out that way. The main benefit to the cloud-based cluster is the instant “walk away” option.) The catch is that regulatory issues may prohibit you from using cloud-based services, so check with your legal department before you make this move.
Your data size today … and tomorrow … and beyond: Finally, we get to data size. While I try to convince people that size is not the only factor to consider, it’s still an important one. What are your projections for your company’s data growth? Hadoop really shines when you anticipate large, year-over-year increases in data volume. If you project only a mild increase in your storage needs, and your data fits on a single machine’s disk (or even a small RAID array), then Hadoop is a less attractive prospect.
The list goes on, but this is a good start. The sharp-eyed readers will note that the common thread here is business need. Not every factor will apply to every situation; but if you’re considering a Hadoop investment for your business, use your business plans as a guide and you’ll keep yourself on the right path.
Hopefully I’ve convinced you to have a long think before you rush out to build a Hadoop cluster. Hadoop has certainly earned its reputation as an important data platform, and with good reason, but it’s not the best fit for every company.
What other criteria did you use to decide whether to use Hadoop (or Cassandra, or any other data tool)? Tell me, and I’ll summarize the results in a future post.
Are you still working through this Hadoop-or-not decision for your company? As a consultant, I can guide you on this and other strategic matters around data. [Contact me](/contact/) to get started.