Highly threaded software is extremely difficult to write. It can even be impossible, depending on your task.

Suppose you need to do four complicated math problems:

1) Complex problem #1 gives you variable W which depends on known values. Thus, W can be calculated right away.

2) Complex problem #2 gives you variable X which depends on known values. Thus, X can be calculated away.

3) Complex problem #3 gives you variable Y which depends on variable X. Thus, this problem cannot be solved until X is known.

4) Complex problem #3 gives you variable Z which depends on variables W and X. Thus, this problem cannot be solved until W and X are known.

- In a single thread CPU, it will do step 1, then, 2 then 3, then 4, using 100% processor power each time.
- In a dual thread CPU, it will do steps 1 and 2 at the same time each on a different thread (using ~100% processor power). Then it will do steps 3 and 4 at the same time each on a different thread (using ~100% processor power).
- In quad thread CPU, it will try steps 1 and 2 at the same time each on a different thread (using ~50% processor power) but threads 3 and 4 cannot start since they don't know the value of W and X (using 0% processor power).
- Now imagine trying to do that with 20 threads of your E5-2670v2 processor. 2 of the 20 threads can operate (~10% of the processor power) but 2 others are idle until W and X are known and the other 16 threads have absolutely nothing to do at the moment.

Thus having more threads doesn't help in this example. A good programmer might be able to break up step 3 and step 4 into smaller bits that might be partially calculated while step 1 and 2 are being done. But lots of times this just isn't physically possible. Trying to break up all problems into exactly 20 pieces for your 20 thread chip is quite difficult work. Plus, what happens if you can break the problem up into exactly 20 pieces but one piece ends sooner? That thread has to sit idle waiting for the others to catch up.

And this doesn't even get into the overhead of transferring data from thread to thread (or worse, from CPU to memory). You might be able to perfectly split the problem into 20 exactly equal parts, but the CPU might be faster than the memory if the data set is large enough and the CPU has to just sit there waiting.

Finally, you have a 10 core CPU that runs 20 threads. In an ideal world a core could run two different threads at the same time if they use different resources in the core. But often, if the math is similar, the two threads need the same part of the core. Thus, only 10 threads out of your total of 20 can actually be ran simultaneously if that is the case.