The Mulitcore Speed Limit

But it is Law…

Amdahl’s Law is interesting, but what does it really tell us? If you are not sure of the details, read a few links and come back.

The Law is a good place to start talking about multicore performance. And really, isn’t that the whole reason we are here? The idea is to make the computer do more stuff in the same amount of time. The end of Moore’s Law, limits on leakage current and clock frequency all break the happy climb up the performance curve. Suddenly multiple processors sound like a good idea.

The problem with multiprocessor performance is a naive assumption that two processors can do twice as much. Amdahl figure this out in 1967, which is a pretty sharp thing to do in 1967.

The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor…At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.
-Gene Amdahl, 1967

If I put two engines in a car, it will not go twice as fast, or four engines four times as fast.

So when the EE Times article about a new 200 processor chip from a hot startup crosses the web, the little bull shit warning light in your head should be on.

Is this the whole story?

Amdahl was smart, but he was only talking about software. The software that runs on a parallel processor machine is tightly tied to the machine. If there is a solution out there that is not closely tied to the hardware I have not seen it. The way of most embedded systems is tightly tied to the machine. Tight coupling is a bad thing. An implementation that is portable would be really nice.

The hardware presents its own limits to performance. Gene was only talking about software. His assumption in 1967 was the processor would never be idle because it could not get data. I am talking about memory access. If the hardware has two, four or eight full speed processors going to a single memory, there is an obvious bottleneck.

Put multiple memories for each processor you say. Well, I do not want to add the cost. Plus we have added another sequential path. How do programs and data get transferred between those memories.

No, no, just add big caches. Well, how  does the system keep the data in one processor cache correct with data in another? Here we have the classic two editors, two writers problem. Processor 1 reads data and starts calculating. Processor 2 reads the same data and starts a different calculation. Processor 2 writes, then processor 1. Now the result is wrong.

But, there is hope,  the hardware can catch such errors. Well, sometimes. Let’s say we catch this error, now processor 1 has an invalid item, what should it do? Start the calculation over? Maybe, or maybe not. Perhaps the changed value does not matter, or it is the next value in a sequence, or some other case. Anyhow, the performance advantage just went out the window.

As an example of this sort of hardware assist consider the ARM solution. The key to the new dual core ARM Cortex A9 systems is the level two cache controller. That is the real multicore piece of the system. Basically it is a hardware assisted coherency system. The cache has hardware to see if one core wrote data that a second core then reads. This keeps the memory consistent between the two processors. To work, the level 1 cache must be write through. Even with this very helpful piece of hardware, the system can break itself unless the designer builds in awareness of the multiple processors. We have to handle the error cases. Encapsulating things the right way becomes very important, very early in the design.

The problems of memory management in multiprocessor systems is big and ugly. Redesigning the whole memory system to do transactions seems a bit like overkill, but maybe it works.

Where from here?

Performance deserves a post all to itself, so it will get one.

One example shows how difficult this can be. There is hope. Looking at the paragraphs above, just thinking about the problem, has given us some requirements for the ideal multicore system.

  • We don’t need to scale to 200 processors, something that works with 2-8 is OK for now.
  • Coupling between the hardware and software should be minimized, but can not be avoided.
  • Application level awareness of multiple processors is bad.
  • We must measure the performance. (Not really above, but yep, we do.)

So, please, have faith. I am going someplace with all this. It will take me a bit to get there, but it will be worth the trip.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s