My work covers a broad spectrum, and I have experience with many technical tools. I prefer a solutions-centric approach to solving problems, but I sometimes receive several inquiries about particular roles or technologies. I'll highlight some of those here, for easy reference.
R and Hadoop
I'm co-author of Parallel R: Data Analysis in the Distributed World, a book on strategies for R parallelism. Half of the book explores using Hadoop as a means to drive R in big-data and big-compute scenarios. Since then I've delivered several talks on the Hadoop/R strategies, covering both the low-level technical detail and high-level strategic vision of blending Hadoop and R.
Hadoop is a very powerful platform. When you run R through Hadoop, you get large-scale data analysis.
Are you curious about using Hadoop to drive your large-scale R analyses? Please contact me. In particular, I could help you:
- determine whether a mix of R and Hadoop will help solve your problems. (Hadoop is not a solution to every problem, nor is R always the best choice for working under Hadoop.)
- design your proof-of-concept. (Problems during these early stages risk derailing the entire effort.)
- answer certain technical questions concerning the implementation. (Even if the big picture is a fit, there are still some technical issues to consider.)
I'm interested in data infrastructure in general, and Hadoop in particular. Elastic MapReduce (EMR) is the Amazon Web Services hosted Hadoop platform. I've delivered talks and tutorials on this topic (using my own teaching materials) and could help you put EMR to work in your company.
Many companies want Hadoop's power, but the cost of an on-site, self-managed cluster can be quite a shock. A cloud-based, on-demand cluster dramatically reduces Hadoop's barrier to entry.
Do you want the power of Hadoop, but find an on-site cluster is not cost-efficient? Are you considering Hadoop in the cloud? Let me know. I could help you:
- confirm that Hadoop will solve your problem. (Hadoop isn't always the right tool)
- determine whether you'd be better off using EMR, a self-managed cluster on EC2, or a local (in-house) Hadoop cluster
- design your EMR proof-of-concept project
I have several years' experience designing and developing software. A topic that's of special interest to me is the design of asynchronous, message-driven systems. In the past, I've had practical, hands-on experience with JMS and other Java-based messaging systems (including ActiveMQ). More recently, I've developed and delivered tutorials on SQS, the Amazon Web Services async messaging service.
Done well, message-driven systems can support a variety of robust, scalable applications and they also offer "free" concurrency. (Contrary to popular belief, async messaging is not just for trading systems.)
Are you exploring a message-driven architecture for your applications? Contact me. I know where asynchronous messaging works, and where it doesn't. That means I could help you:
- understand where async messaging fits in your world, or whether it would fit at all. (It's powerful when it works, and powerfully messy when it does not.)
- design disconnected application architectures
- apply SQS, and see whether it would be a suitable alternative to a self-hosted messaging product