Tuesday, March 06, 2012

Filling The Pipeline

My past few work engagements have been centered around cloud computing and big data - doing stuff from managing large data centers to machine learning to map/reduce clusters. When I was at VMworld 2011 last year I took the opportunity to ask the "Big Compute and Big Data Panel" about leveraging vector processing hardware such as NVIDIA's Tesla to do data processing. The five panelists (Amr Awadallah of Cloudera, Clint Green of Data Tactics, Luke Lonergan of EMC, Richard McDougall of VMware and Paul Kent of SAS) largely agreed on a few main sticking points in vector processing for massively parallel systems:
  • The toolset is still relatively immature (maybe three years behind general CISC architectures)
  • The infrastructure has not yet reached commodity level
  • Big Compute works well with vector processing clusters, but not big data, since the latter is all about locality rather than in-memory processing
  • Commodity GPU processing is greatly constrained by memory paging - there's too much latency in transferring large in-memory datasets to GPU memory.

AMD had a few interesting announcements over the past few weeks that may pave the way for making cloud and big data/compute clusters more efficient and more "commodity." The first is their acquisition of SeaMicro, whose emphasis is around massively parallel, low-power computing cores with high-speed interconnects. This addresses one big issue brought up during the panel - that interconnects on big data clusters are going to become a prevailing issue as data needs to be transferred across nodes more rapidly to keep otherwise idle compute resources busy. CPUs can't crunch data sets if the data takes forever to arrive over the wire.

The next big announcement, which may be a huge sleeper hit, is AMD's unified memory architecture that's supposed to arrive in 2012. The slide on AnandTech shows that in AMD's 2012 product line the "GPU can access CPU memory," which is a HUGE development in vector processing. Imagine a data set being loaded in 64 GB of main memory, having 8 CPU cores clean the data using branch-intensive algorithms and then that same in-memory dataset being transformed by 512 stream processors. That kind of compute power without the need to stream data across a PCI-E bus could be a really, really big deal.

Still, the issue that remains are the tools that are available to make this happen. Very likely a developer would need to write generic C code to do the branching and then launch a separate OpenCL app to transform it but still share memory pointers so that nothing has to be swapped or paged out. In a world full of enterprise software developers, this kind of software engineering agility isn't exactly easy to find. If Cloudera were able to unleash this kind of power, AMD would have a big hit on their hands. Maybe AMD needs to start looking towards Cloudera as the final stage in the pipeline - an open-source framework that unlocks the potential of their infrastructure.

No comments:

Post a Comment