Tales of an Indie Developer: vector processing

Showing posts with label vector processing. Show all posts

Monday, January 05, 2015

Telling a Tale for Ten Years

Exactly ten years ago I started this ridiculous blog as a way to collaborate with other game developers. At the time I thought I would dig deep and push out at least one title. This eventually led to the "Desktop Distractions" studio concept, and the nascent title Deskblocks. I also worked within the CrystalSpace engine as an entrant to the PlaneShift team. I just couldn't give these projects traction however, so I gave it up and moved on to tinkering.

Even though the driving force behind the blog had faded away, and even though no one else reads this blog (aside from the State of .NET Integration Frameworks post), I kept updating it. Writing - even if it exists only for your own edification - really does help with communication and critical thinking regardless of what you write about. Even though my day job has nothing to do with garage door openers, my posts on my garage door security system helped me organize the build in a way that informed the Hack Clock project. Back in 2006 I began investigating vector processing, and the resulting frameworks have helped me think about and design microservice architectures. Of course, there was plenty of venting as well with my favorite software companies being dissolved or SuSE Linux winning and failing and winning and failing again. All of this writing helped me when performing comparative analysis at work, or designing parallel architectures, or watching trends in software development.

It is hard to believe a decade has slipped by. It doesn't even seem real. I don't think I've evolved much since that one cram session in a crystal chair, but I'm glad to have my collected ramblings to reflect back on.

Monday, December 02, 2013

Massively Parallel Compute as a Service

Back in the Spring of 2012, I asked several panelists at VMWorld to weigh in on vector processing with GPUs as a big data/big compute solution. The response was a resounding "not yet," as the infrastructure has not yet reached commodity level and GPU processing was greatly constrained by memory paging. It now seems like both obstacles are being removed.

Amazon Web Services is now offering EC2 instances that offer up virtualized instances of NVIDIA's Kepler GPUs as "G2" instances. This supports H.264 encoding, OpenCL, CUDA and OpenGL toolsets which allows for more mature toolsets to build apps targeted to these vector processing instances. This kind of support allows for commodity toolchains and commodity infrastructure to allow for massively parallel processing on demand.

Memory paging should soon be addressed by NVIDIA via CUDA 6, and should also be addressed by AMD with its upcoming Kaveri architecture. Once memory addressing is unified, the swapping of memory regions should become unnecessary and allow for memory to be addressed locally without pagination. This simplifies application development, virtualization and hardware architectures considerably.

I believe that very soon we will see vector processing at scale garner as much attention as map/reduce clusters currently do. Massive data parsing has been commoditized, and now we have an opportunity to commoditize massive algorithmic crunching.

Tuesday, March 06, 2012

Filling The Pipeline

My past few work engagements have been centered around cloud computing and big data - doing stuff from managing large data centers to machine learning to map/reduce clusters. When I was at VMworld 2011 last year I took the opportunity to ask the "Big Compute and Big Data Panel" about leveraging vector processing hardware such as NVIDIA's Tesla to do data processing. The five panelists (Amr Awadallah of Cloudera, Clint Green of Data Tactics, Luke Lonergan of EMC, Richard McDougall of VMware and Paul Kent of SAS) largely agreed on a few main sticking points in vector processing for massively parallel systems:

The toolset is still relatively immature (maybe three years behind general CISC architectures)
The infrastructure has not yet reached commodity level
Big Compute works well with vector processing clusters, but not big data, since the latter is all about locality rather than in-memory processing
Commodity GPU processing is greatly constrained by memory paging - there's too much latency in transferring large in-memory datasets to GPU memory.

AMD had a few interesting announcements over the past few weeks that may pave the way for making cloud and big data/compute clusters more efficient and more "commodity." The first is their acquisition of SeaMicro, whose emphasis is around massively parallel, low-power computing cores with high-speed interconnects. This addresses one big issue brought up during the panel - that interconnects on big data clusters are going to become a prevailing issue as data needs to be transferred across nodes more rapidly to keep otherwise idle compute resources busy. CPUs can't crunch data sets if the data takes forever to arrive over the wire.

The next big announcement, which may be a huge sleeper hit, is AMD's unified memory architecture that's supposed to arrive in 2012. The slide on AnandTech shows that in AMD's 2012 product line the "GPU can access CPU memory," which is a HUGE development in vector processing. Imagine a data set being loaded in 64 GB of main memory, having 8 CPU cores clean the data using branch-intensive algorithms and then that same in-memory dataset being transformed by 512 stream processors. That kind of compute power without the need to stream data across a PCI-E bus could be a really, really big deal.

Still, the issue that remains are the tools that are available to make this happen. Very likely a developer would need to write generic C code to do the branching and then launch a separate OpenCL app to transform it but still share memory pointers so that nothing has to be swapped or paged out. In a world full of enterprise software developers, this kind of software engineering agility isn't exactly easy to find. If Cloudera were able to unleash this kind of power, AMD would have a big hit on their hands. Maybe AMD needs to start looking towards Cloudera as the final stage in the pipeline - an open-source framework that unlocks the potential of their infrastructure.

Friday, December 12, 2008

And So It Begins - Vector Processing Standards Battle!

I'm not sure why I'm such a fan of this topic... maybe just because I enjoy watching the inevitable march towards entirely new CPU architectures.

Nvidia just released something that's been on Apple's wishlist for a while: OpenCL 1.0. Finally a "standard", royalty-free specification for developing applications to leverage vector processing units currently available on GPUs. While the processors on such high-end video cards aren't geared towards general computing per se, they absolutely blaze through certain workloads - especially those that work through sequential processing pipelines.

Microsoft's competing specification for some reason is available for DirectX 11 only, which makes absolutely no blimmin' sense to me. This basically means that your specification is limited to Vista... which rather defeats the concept behind a "standard." Not only do you get vendor lock-in, you get implementation lock-in. Sweet.

Can you imagine what this might do for Nvidia tho? Picture it now: tons of cheap commodity motherboards laid end-to-end, each with six PCI-E slots filled to the brim with Nvidia cards and running OpenCL apps on a stripped-down Linux distro. Supercomputing clusters for cheap.

Although I imagine the electricity bill might suffer.

Sunday, March 30, 2008

Intel Not Killing VPU After All

Looks like Intel isn't killing the VPU after all, but instead birthing it. Larrabee, their GPU/HPC processor, is supposedly an add-in proc slated for 2009/2010. Although I'm going to put myself out on a limb and say it will probably become part-and-parcel of their mainline CPU and, instead of being a discrete co-processor, will quickly be absorbed as additional cores of their consumer processor line. But I digress.

Additional information about Larrabee continues to trickle out, but it definitely seems to introduce vector processing instruction sets to be used by general computing, not just as a GPU.

Even if this comes out as a daughterboard or discrete chipset, it should be a compelling reason to pick up a good assembly programming book and start hacking again. How long will it take (non-Intel) compilers to optimize for the vector instruction sets?

Sunday, February 17, 2008

Make a VPU Socket Already! Get It Over With!

My lands. If the floating point unit took this long to become mainstream I'd be using a Core 2 Duo with a math co-processor still.

Not to go all Halfhill or anything, but it has appeared that a vector co-processor or VPU's on-die were an immediate need for at least the past two or three years. Both Nvidia and AMD are bringing GPU's closer to the CPU, and it at once appeared that AMD's multi-core platform has included VPU/FPU/integer math/memory controller/CISC/RISC/misc./bacon&swiss together to take many types of tasks and integrate them under one die's roof.

And now that Nvidia has wisely acquired AGEIA and their PhysX platform it seems a general purpose vector processing platform is getting closer. A standalone PhysX never took off on the consumer marketplace, and purchasing another Radeon or GeForce just for physics processing (as both AMD and Nvidia were touting at electronics expo's) never caught on either. But a generalized, open-platform physics API that takes advantage of unused GPU cycles would definitely catch on. Spin your GPU fan faster and get real-time smoke effects... sign me up.

Nvidia has been extremely forward-thinking with their Linux drivers, and I hope they continue to be trend setters with the PhysX API. The PhysX engine and SDK was made freely available for Windows and Linux systems prior to Nvidia's acquisition, but hardware acceleration currently only works within Windows. Since Nvidia is porting PhysX to their own CUDA programming interface, it seems entirely probable that the Linux API would plugin to Nvidia's binary-only driver. And why not release the PhysX API under GPL? They could port to CUDA (whose specification is already open, available and widely used) then reduce their future development efforts by letting a wide swath of interested engineers maintain the codebase as needed.

Widely available drivers, development kits and APIs will help drive hardware sales in an era where Vista w/ DirectX 10 adoption isn't exactly stellar. I won't invest in being able to run Crysis in DX10 under native resolution for a 22" LCD, but I will invest to get more particle effects or more dynamic geoms. At that point you're adding to the whole gameplay proposition instead of polishing up aesthetics, with continuously diminishing results.

Saturday, November 18, 2006

AMD's Multi-Core Not Just Multiple Cores

Hopefully this will put an end to my incessant postings on this topic. It appears AMD is indeed developing a processor with vector units, and using them in their actual processor pipeline instead of using it just for graphics processing. This means their CPU will be able to crunch straight-line algorithms much easier (like Folding@Home) by virtue of a unit on-die that caters to that type of processing.

Sunday, October 22, 2006

Nvidia's CPU

Now that ATI and AMD are one, Nvidia is working on a CPU as well. There are a lot of people saying they're working on a "GPU on a chip," but I seriously doubt it. It seems a lot of people are wagering this will go toward the embedded or low-power market, just like when VIA acquired Cyrix. But to do so would be extremely narrow-minded of ATI and Nvidia, even considering the boom of embedded devices and the increasing horsepower needed by set-top boxes.

I'm much more inclined to think that Nvidia is working with Intel (which would be a huge surprise, considering how they've been fighting with SLI on nForce vs. Intel chipsets) to compete with AMD & ATI working on bringing vector processing units to CPU's.

I incessantly bring this topic up, but I had to mention this since it seems that my predictions are actually going to happen. I don't necessarily think this will cause GPU's-on-CPU's to happen, although it might be a by-product. Instead I think this is going to allow for increased parallelization, faster MMX instructions (or 3DNow! if that's your taste) and a movement of putting the work of those physics accelerator cards back on the CPU die.

We may see a CPU with four cores, each with integer and floating point units then a handful of separate vector processors (a la IBM's cell) along side them. Given how the Cell reportedly absolutely sucks for many types of algorithms, this may give developers the best of both worlds. Rapidly branching and conditional logic can be done on the integer units with branch prediction and short instruction pipes while long, grinding algorithms can go to the vector processor. This has already worked for projects like Folding@Home, and could work for many similar algorithms.

Or AMD/ATI and Nvidia could just go for a stupid, embedded GPU's sitting on die with the CPU. But they'd be passing up something much cooler.

Thursday, September 07, 2006

Exactly How Many PCI-E Slots Does One Need?

I'm pretty much hammering this subject into the ground, but then again so are component makers.

Recently we saw Ageia's physics accelerator come to market as an add-on expansion card, astride your existing video accelerator and sound card. Next we saw the Killer NIC reviewed by IGN, a "network interface accelerator" which promises to offload the assembly of TCP and UDP packets to take load off the CPU. It's actually an embedded Linux instance, assembling your TCP stack instead of Windows. Windows just sends the Killer NIC raw datagrams, then the Killer NIC does all the work to disassemble the datagrams into bare signals over the wire.

Now there's supposedly an accelerator in the works, the Intia Processor, dedicated to nonplayer character AI acceleration. It would offload AI processing from the CPU and instead have a dedicated API for pathfinding, terrain adaptation and line-of-site detection.

So let's count the possible gaming accelerators here:

Sound (i.e. Sound Blaser X-Fi)

Video (i.e. NVIDIA GeForce)

Video (additional NVIDIA GeForce for SLI)

Physics (Ageia PhysX)

Networking (Killer NIC)

AI (Intia Processor)

So that eats at least six slots, but if you have double-wide video cards, more like eight. Oh yeah, and you'll spend nearly $1,500 on just the above hardware accelerators alone.

This is a cycle that repeats itself in the computer world, tho. Popular software algorithms become API's. API's become baked into hardware. The hardware becomes to inflexible and becomes firmware. The firmware isn't flexible so it goes into software. And on and on we go.

But this hasn't necessarily happened for OpenGL or graphics acceleration. Why? Because there you're not baking an API into silicon, you're using a different type of processing unit altogether to solve a different genre of problems. Multiple vector processing pipelines can streamline linear algorithms much more effectively that complex instruction set CPU's. The same analogy can be made for floating point units when they were introduced along side elder CPU's that could only handle integer math.

Not only that, we're looking at chip makers shy away from raw clock speeds and instead looking at cramming as many CPU cores as possible on a single die. This allows for every CPU stamped out of the factory to be a multiprocessor machine. Once API's become more threadsafe, this crazy specialized hardware will instead just be an API call sent to one of the four idle CPU's on someone's desktop.

I think the eventual result of this crazy hardwareization (feel free to use that one) is that people are going to want to fit their algorithms into one of three processing units: the CPU pool (by "pool" I mean a pool of processors upon which to dump your asymmetric threads), the VPU (vector processing unit) or the FPU. The CPU & FPU have already become as one but as any C/C++ programmer can attest to, the decision to use integer based math as opposed to floating point math still rests heavily on the programmer's mind.

Given people are already compiling & running applications on a GPU that have nothing at all to do with graphics, it's only a matter of time before chipset manufacturers come up with a way to capitalize on the new general purpose processing unit.

Tuesday, July 25, 2006

AMDTI? ATIMD? The Vector and the CPU

I hate to say I told you so, but...

AMD and ATI are now one. It's the talk of the town: AMD has taken over ATI, to make... umm... AMDTI?

The hardware ramifications aren't well known... who knows if this will spawn a new core-logic chipset, a new southbridge, a new means of fabricating ATI/AMD procs, new GPU's and physics accelerators, or just new marketing. However... if the core reason they've joined together is for intellectual property, we may start to see some vector units in CPU's. Who knows.

From my perspective, this smells like a bad deal. ATI's Linux support, while existant, is shallowly so. One need only to try to manage a dual-screen setup or hack an xorg.conf file to realize how kaboshed ATI's Linux driver support is. Contrast it to the robustness of Nvidia's unified driver model and there is simply no comparison. Nvidia wins by not inventing their own wacky configuration schema, but instead streamlining their X11 integration with the mainline accepted standards and augmenting it only when necessary.

And yet, AMD has proven to be a big Linux supporter. So maybe their ingestion of ATI will be a driving force for them to get their collective butts in gear. Here's to hoping ATI will be the big winner of all this, and that they get in line with AMD's more competitive practices.

Monday, March 20, 2006

GPU - The Next FPU?

What was the difference between a 468/SX and a 486/DX2? The additional floating point processor. Hellz yeah.

In the budding days of personal computers and 486 goodness, CPU's were strictly integer machines. Floating point math was simply too much for the little chicklets that didn't even need active cooling. I recall that back in the olden days of ISA cards and SIMS you could purchase separate FPU's that you could hammer into your brute-force socket... this granted you massive power that your Excel spreadsheets never previously dreamed of.

I recall how up-in-arms everyone was that Quake actually demanded a freakin' FPU. Who'd id think we were, Rockafeller?

Now that graphics cards have become cheap, plentiful and pretty freakin' powerful vector processing has become all the rage. Projects like BrookGPU and GPGPU have made it (relatively) easy to create userspace apps run on a GPU just as it would a CPU. This means your GeForce 7800 could work on certain serialized tasks, crunching numbers while it remained idle between SecondLife sessions.

Of course, GPU's are just good for vector processing, which means only a certain type of algorithms are really suited for it. You don't need to look much farther than critiques of IBM's Cell architecture to see how people feel about a rash of vector processing units versus a CPU that can span and branch effectively.

It was interesting to see that Nvidia and Havok have teamed up to offload physics to the GPU. Bear in mind this doesn't work with an actor's physics since it branches and can't be effectively streamlined or predicted, but does extremely well with cloth and particle physics.

Nvidia could be realizing that Ageia's physics accelerator is a more than vaporware, and is stepping up their offering to get into that product space. However, from the specs it appears that only Ageia's PhysX (*groan*) will be able to handle physics asynchronously. Although, I guess if you run physics calculations like shaders you could send multiple calculations down each pipe on the graphics card itself.

Should be interesting to see how the space pans out. The GPU is quite a hoss nowadays, and needs extremely fast access to memory, so it may not be incorporated as a side-by-side processor or on-die any time soon. But still... one has to wonder if multiple fast vector units are the next big thing to go on a proc.

Tales of an Indie Developer