Nice :) I used to like AMD's CPUs a lot and in theory their 12-core CPUs sound pretty great compared to Intel's 6-core + HT option, but I've been surprised to find that, at least in the systems I've worked on, the AMD CPU shows what looks like comparatively massive memory latency, which absolutely kills performance, which results in generally worse performance for multi-threaded applications. I was quite disappointed :/
It makes sense that, if the limiting factor is memory latency‚ you wouldn't gain by going from 12 virtual cores to 12 real cores. Do you suppose the thing is memory-bandwidth limited, though? If it's just latency, then AMD could probably improve the situation quite a lot by adding "hyperthreading".
DRAM bandwidth is pretty tight on modern systems! You get around 8GB/sec per DDR3 channel (2 DIMMs per channel, so fully populated that channel has 8 GB on it with expensive 4GB DIMMs or 4GB with cheaper 2GB DIMMs), so if you're running a bandwidth-hungry code you'll rapidly exhaust the 25 GB/sec available from a 3-channel i7. At Cray the rule of thumb for the cheaper systems was "one byte per second of memory BW per FLOP per second of compute" or "a GB/sec per GFLOPS". The other rule of thumb was "it should take about a second to read your memory". Those rules aren't directly applicable to non-FP non-tightly-parallel jobs, but it's fun to consider
( ... )
Oh you know, it's been pretty hard to find a multi-socket Opteron that *doesn't* spread its RAM map! Which is total death for average latency, but avoids worst-case behavior from NUMA-ignorant OSes, so is the only reasonable option for CPUs running Windows or anything short of 2.6.32 or thereabouts.
IIRC they're spread on a per-4KB page basis, so you can in theory at least find the pages in your mem map that are local (and ergo 3x shorter latency than remote pages)! I never finished testing this on the system I was curious about, though, so I could be misunderstanding.
(Sketch of test: allocate more than a single socket worth of RAM; test individual page access latency. How to do that is left as an exercise; TSC overhead makes it hard-ish to accurately measure single page latency. Histogram the results and see if you have a fast peak down in the 50-90 ns range, plus another smeared peak out around 150-200 ns.)
That's 4.8× what I paid for my last computer, exclusive of the monitor. (Although the case is one I picked up on the sidewalk after someone threw it out; it says "Preparado año 2000" on a sticker on the back.) I wonder if a four-machine cluster with cheaper CPUs and RAM could outperform it for the stuff you're doing. It would certainly be less prone to crippling failure.
On the other hand, an X25-V or a Radeon HD 5670 would sure be nice...
I recently put a 1.6 GHz Atom D510 in a cast-off ATX case with one 2GB DIMM and a similar storage system to what I specced above (2 spindles plus one SSD). Since I'm primarily constrained by (1) physical space and (2) conceptual overhead, fitting as much capacity into a single case and "unit of management" is a pretty good optimization for me right now. The Atom is dual core, HT, basically in-order, and draws around 10-15 Watts (including chipset). I suspect one Atom core @ 1.6 GHz is about 1/6 of a Core i7 core @ 2.8 GHz. It's also missing several interesting features, and is severely limited WRT IO bandwidth.
DRAM is basically uniform cost per byte as long as you stay at the sweet spot (which is currently 2GB DIMMs both in DDR2 and DDR3), so if the goal is to provide N GB of DRAM, it's almost certain to be cheapest to buy the smallest motherboard that provides N/2 DIMM sockets rather than 2 motherboards that provide N/4 DIMM sockets.
hmm, my off the cuff "1:6" was pretty accurate -- compiling 2.6.35-rc3's sched.c on an X200s Core i7-640LM 2.13 GHz takes { 2.301, 2.303, 2.310 } seconds, on my Atom D510 { 10.073, 10.093, 10.079 } seconds, for a ratio of 4.38; extrapolating to 2.8 GHz gives 5.75x. However the 640LM has 4/6 Turbo Boost while the i7-930 has 1/1/1/2 so the ratio may be lower than that. Complicating things further is HT, of course; I don't have a good parallel benchmark at hand (unfortunately "make -j8" had a run-in with Amdahl's law and rarely gets better than 80% utilization on reasonable configs). The i7 quad-core at work shows really suprisingly good scalability up to 8 threads -- IIRC 4 threads was showing less than 70% of 8 thread throughput.
Comments 8
Reply
Reply
Reply
IIRC they're spread on a per-4KB page basis, so you can in theory at least find the pages in your mem map that are local (and ergo 3x shorter latency than remote pages)! I never finished testing this on the system I was curious about, though, so I could be misunderstanding.
(Sketch of test: allocate more than a single socket worth of RAM; test individual page access latency. How to do that is left as an exercise; TSC overhead makes it hard-ish to accurately measure single page latency. Histogram the results and see if you have a fast peak down in the 50-90 ns range, plus another smeared peak out around 150-200 ns.)
Reply
On the other hand, an X25-V or a Radeon HD 5670 would sure be nice...
I guess you've already made the purchase, though.
Reply
DRAM is basically uniform cost per byte as long as you stay at the sweet spot (which is currently 2GB DIMMs both in DDR2 and DDR3), so if the goal is to provide N GB of DRAM, it's almost certain to be cheapest to buy the smallest motherboard that provides N/2 DIMM sockets rather than 2 motherboards that provide N/4 DIMM sockets.
Reply
Reply
Leave a comment