MSimon wrote:Then I suggest you to investigate how much trouble in FP performance has been caused by that great x87 FPU forth-like stack in the past and why SSE2 is so much improvement over it in terms of performance even for scalar computations.
Oh. I'm sure. I've done math on the 8051 using the stack and if you don't think out the problem in advance and really think it through you can get in a LOT of trouble.
But if you do THINK about what you are doing and order the operands correctly data stacks really simplify things a lot.
Hey, no quelling about that. Stack based architectures ARE simple and quite effective if your HW is poor. That is why many early compilers used stack based intermediate code ("p-code") and usually based compiled result on stack operations. Converting standard expressions to stack ops is trivial. Been there, done that...
But it is SLOW.
But if you just want to brute force your way through problems without a lot of deep thinking modern processors are excellent.
Well, you have to be solving problems of different scale then.
And let me re-iterate - we have the best tools possible (hardware and software) for the way things are currently done. Long pipelines are real speed ups for long routines.
What long routines have to do with that?
And then all that wonderful branch predicting so you can be ready for a branch.
Branch prediction is not about "being ready". It is about "ignore the branch" (and redo if it went the other way).
It seems to me that a two stack architecture which is always ready to branch would be simpler. And if you design your processor right a return instruction can be included with most other types of instruction. So you are already fetching from the stack while doing your add (or whatever) no branch predictor required.
Sorry, but you do not seem to have a clue....
Actually, I do not blame you. I have observed than most of my coworkers do not really know how modern out-of-order CPU works... In fact, it is not quite easy to understand.
OK, just for starters: Something like "already fetching from the stack" is trivial. OOO CPUs with branch prediction actually EXECUTE up to hunderds of instructions ahead of the branch. Well, some of them - those that have data available. While finishing instructions before the branch. While renaming registers to reduce dependencies.
Is it simple? No way. It is hard to design as hell. But as you correctly say, we are approaching the speed of light. Simple recipes for performance do not work anymore. You cannot bump frequency. What you CAN do is to increase parellelism and avoid bottlenecks. And OOO excels at both tasks.
We are really more stuck by our way of thinking than by what is possible.
Really?
And your point about caches? Well taken. I really like small simple processors and devoting the rest of the available silicon to on chip RAM. It speeds things up a lot.
Let me note that the average cache cell is 5-10 times more expensive than DRAM cell and requires more power....
Of course, if all you need to do is some simple 8051 class taks, your approach is fine.
If you want to simulate Polywell, you better stick with high-performance computing. Get latest Intel or AMD CPU and good C++ compiler.