(This is the second post in the "I get this question a lot" series. Feel free to check out the first one, "Do I Need a Hadoop Cluster?")
"Data scientist" is supposed to be the sexy job of the century, and "data science" -- or Big Data, or analytics, or whatever term you prefer -- is supposed to advance every business by leaps and bounds. There's truth in both statements, but I sometimes meet people who insist on going about this the hard way. Two key (and closely related) mistakes companies make are to assume they need only data scientists, and to try to hire several data scientists before there's enough work for them to do.
These problems are flip-sides of the same coin, and you can solve them both by understanding the various roles of a data science team. In the same way an operating theatre needs more than surgeons, and a trading firm needs more than traders, a successful data science effort needs more than data scientists. A proper analytics team involves several roles, each with different responsibilities.
What, then, are the roles you need for a well-rounded data science team and successful analytics efforts?
Champion/Sponsor - Someone must take leadership-level responsibility and be a driving force to make analytics an important issue in the company. They must also stand firm if other company leadership is dubious as to the value or possibilities. This person is most likely the company's CEO or CTO.
Data Strategist - Helps the Champion/Sponsor and Data Lead (described next) set the strategic vision for the use of data in the company, explaining how to align data analysis with business need and laying out the road map of how to proceed.. This role is important for any analytics effort, and critical for firms embarking on their first data journey. Depending on their experience, the Champion/Sponsor or Data Lead may fulfill this role. They may also also choose to engage an outside consultant to take the helm.
Data Lead - Responsible for building and managing the data team. They work with the Champion/Sponsor and Data Strategist to align analytics to the company mission and set direction.
Data Scientist - Runs analyses, develops algorithms, and otherwise transforms raw data into actionable insights.
Data Engineer - Builds and maintains data pipelines, to manage data delivery, storage, and quality. This role is critical to a successful data practice, as worthwhile analysis is impossible without it.
Tool Admin - Manages those Hadoop clusters, Cassandra installs, and other special systems for storing and crunching data. (Despite what some folks wish to believe, managing such a system in a production-ready state involves far more than simply, "follow the default install docs.") This work may fall on existing members of your IT ops team, but could just as easily exist under a separate umbrella over time.
IT Staff (developers and ops) - Works with Data Engineers and Data Scientists to collect data, implement data-related app features, and otherwise weave data findings into your organization's existing homegrown applications and IT infrastructure.
Everyone here works to put data to use for the Customer, who represents some business function or business unit (or even several business units). While the Customer is not part of the analytics team, per se, their needs ultimately determine why the company even needs an analytics team and what that team will do.
If you've been counting along with me, the Data Scientist is one of seven key roles and sits in the middle of this list. Data scientists are responsible to the roles ahead of them in the list, and are customers of the roles that follow them.
That last point bears special mention. It's why the list of roles includes your IT staff. Your first forays into data may be small, isolated affairs that have minimal interaction with the rest of your IT staff; but over time, expect your data efforts to become regular patrons of your existing infrastructure and tap into your app/dev stack.
Seven roles sounds like quite a menagerie. Some might say that is simply too many people for a company's first analytics project. Maybe. But remember, these are roles, not people. (Likely, though, "IT Staff" already exists as a group of people.) In the early proof-of-concept stages one person will likely take multiple roles. You simply won't have enough data work to engage seven people on a full-time basis. Also, taking such a lightweight approach will permit you to move quickly and nimbly, both of which are key elements of proof-of-concept projects.
Take care in how you assign these roles to people. Consider business need, but also align according to skills and incentives. I've seen it done the other way -- the Data Scientist who spends most of their time playing Data Engineer, to the point they can't do the analysis work; the company that wants a Data Scientist who will also manage the production-grade Hadoop cluster -- and it's a recipe for unhappiness.
Most of all, prepare for growth: make it easy for someone to spin out one of their roles to another person when the time comes.
Let's say your company has identified a Customer's need and is ready to explore its first data effort. You'll probably go far on a single, well-rounded Data Scientist and a Champion/Sponsor, plus a little help from the in-house IT Staff. When I say "well-rounded" Data Scientist, I emphasize someone who has the skills to fulfill the Data Engineer role and handle some minor Tool Admin responsibilities. They can split the Data Lead role with the Champion/Sponsor.
Expect the company to quickly develop an appetite for data. This will saturate the first Data Scientist with work, and it may be tempting to simply hire more Data Scientists to handle the work load. Not so fast! See whether your lone Data Scientist has been filling any of these other roles, and bring in new people accordingly.
Most likely, you'll first need to spin out the Data Engineer work from the Data Scientist. You'll also need more involvement from your IT staff, both the ops crew to setup regular data exports and your developers to add new data collection taps in your homegrown apps. Also, over time, the Tool Admin duties will spin out from their existing owner (be it the first Data Engineer or the IT ops crew) into a separate person. Finally, the first Data Scientist would ideally have leadership potential such that they could manage the analytics team as it grows.
It can be exciting to launch data science efforts in your company. Just remember that you'll need more than data scientists to make this happen, and you'll need to grow the team in the proper order. Build a well-rounded team, both in terms of technical roles and leadership abilities, to improve your chances of success.
I offer consulting services on just this matter -- everything from data strategy to serving as interim analytics lead -- and would be keen to hear how we could work together. Contact me to get started.
Many thanks to Ken Gleason, Tim Knight, and Joshua Ulrich for reviewing this article.
How Do You Know If Your Company Needs Hadoop?
Let's walk through the decision of whether your company would benefit from building a Hadoop cluster.
"On Leadership" -- New O'Reilly Radar Post
Moving from a technical to a leadership role