Programming languages

Discuss life, the universe, and everything with other members of this site. Get to know your fellow polywell enthusiasts.

Moderators: tonybarry, MSimon

blaisepascal
Posts: 191
Joined: Thu Jun 05, 2008 3:57 am
Location: Ithaca, NY
Contact:

Post by blaisepascal »

MSimon,

I will fully admit that I haven't really tried Forth on a modern Forth system. The last time I tried to seriously use Forth was in the 1980's. If modern Forth systems have advanced the art significantly, I haven't seen it because I haven't been looking.

The few times I've looked at Forth since then, I've been hampered by old documentation. Starting FORTH and Thinking FORTH are now very old books, and do not reflect the changes I found in the system when I tried to use it. I'm not just talking about "library" stuff, but more like: how do I edit a program so the Forth runtime can see my changes?

What modern Forth system would you recommend? What modern Forth documentation would you recommend?

Please keep in mind that this would be for exploration and learning purposes only -- my work is done in C#, and is not amenable to integration with or conversion to Forth. As such, cost is a serious concern, as is platform. I am running Linux on an amd64 chip at home.

If I can get my hands on a Forth system to try, I promise I'll do some programming katas using it to get a feel for how to work in it.

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

OK. I have two anecdotes. Go to Forth Inc for more.

Or look up the work on the software for the first shuttle arm.

The speed of light limitation says that many small cores each running a part of a process will be faster.

In any case it doesn't matter. I'm going to do what I'm going to do and we shall see where it leads.

Chuck Moore has a cute multi-core (144 cores) machine where each core runs at 650 MHz and consumes 4.5 mW per core. Each pin effectively has its own processor. So what is the peak (all cores running) processing power? (650E6 X 144 - at a cost of 650 mW) even if you divide it by 5 to account for 18 vs 64 bits it is pretty impressive. 20 GHz peak in 650 MHz chunks.

http://www.greenarraychips.com/home/doc ... 4-1-10.pdf

http://www.greenarraychips.com/home/doc ... index.html

Power is managed by data flow. No data - no current consumption for a given core. That is the only power management scheme.

And he has moved the design in a direction I would have liked - wider instruction word (5 bits was too cramped) and more RAM per core.

The previous chips he designed were too cramped and oddly working even for my strange tastes.
Engineering is the art of making what you want from what you can get at a profit.

Luzr
Posts: 269
Joined: Sun Nov 22, 2009 8:23 pm

Post by Luzr »

MSimon wrote: The speed of light limitation says that many small cores each running a part of a process will be faster.
- if you can divide the process to many small parts, which is not usually possible

- and if bandwidth issues are resolved. once small cores start competing for main memory data, you are back in trouble that can only be resolved with OOO execution

Certainly, for some tasks, such approach is superb. For most, it is not.

BTW, if you like to play with a lot of cores, you might be interested in modern GPUs... If you can fit software, these are true monsters....

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

Luzr,

I forgot to mention that the GA 144 was done on a 180 nm process. Just think of what they could do with 90 nm.

And compare what they have done at 180 nm with what Intel is currently delivering with 45 nm.
Engineering is the art of making what you want from what you can get at a profit.

Luzr
Posts: 269
Joined: Sun Nov 22, 2009 8:23 pm

Post by Luzr »

MSimon wrote:Luzr,

I forgot to mention that the GA 144 was done on a 180 nm process. Just think of what they could do with 90 nm.

And compare what they have done at 180 nm with what Intel is currently delivering with 45 nm.
Actually, what have they done? They had put 144 very simple low performance cores on signle die. So what?

For SOME tasks, such CPU can have tremendous gains. For most, it will perform much worse than Intel Core2.

And for ALL of tasks that it would be faster than Core2, it would be much much slower than modern GPU (which have similar number of cores, but all of them FP).

BenTC
Posts: 410
Joined: Tue Jun 09, 2009 4:54 am

Post by BenTC »

MSimon wrote:
BenTC wrote:
MSimon wrote:Well no matter. If you want ultimate speed design a FORTH CPU in FPGA.
What size FPGA do you need? (also, how do you rate FPGA size?)
eg 5000 LUT, 12000 LUT?
So a 32 bit machine might be around 2,000 LUTs. That would allow you to implement a simple 16 cycle 32X32 multiplier. (2 word adder and 32 bit shifter). i.e. you multiply by 2 bits at a time. Now given carry speed in the adders it might take 32 clock cycles. But it saves a LOT of gates vs a full up multiplier.

If you can run the chip at 20 MHz (very conservative) you get a full 32X32 multiply with a 64 bit product about every 2 uSec. Adequate for most control situations. A divider could run at about the same speed.

That means a PID loop could run at a 100 KHz rate. More than adequate for a 10KHz process.

http://www.xilinx.com/support/documenta ... /ds015.pdf

The XC4036 looks nice for a 32 bitter.

Costs:

http://search.digikey.com/

I also like the TI DSP chips.
With my limited knowledge I am trying to compare the XC4036to the Lattice XP2-5 (scroll down to selection guide).
This indicates that CLB and LUT terms are somewhat equivalent and this (slide 3.16) indicates the XC4000 series has two LUTs per CLB. Looking at Table 1 (p6-157) on the datasheet you referenced, I am guessing "Logic Cells" == LUTs, so: You mention implementing multipliers. Would the multipliers mentioned on the XP2-5 mean you would need less LUT real-estate, and would it run faster?
In theory there is no difference between theory and practice, but in practice there is.

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

They had put 144 very simple low performance cores on signle die. So what?


I don't know 650 MHz for 4.5 mW seems high performance to me. And if they sell for $3 a pop in quantity (1 cent a pin cost and 2 cents a a pin profit) you put 10 of them on a board and you have what? 1 Thz peak processing power for $30. 6.5 watts peak power. Seems like you might be able to do something useful with that much HP.

And since a processor only fires up when it has something to do... You can probably pass on fans.

And that is with 180 nm technology. Think of what they could do at 90 nm.
Engineering is the art of making what you want from what you can get at a profit.

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

Ben,

Here is a very good book on stack processors. Free.

http://www.ece.cmu.edu/~koopman/stack_c ... index.html

Read it. Ask me questions. Then we can go back and talk implementation.
Engineering is the art of making what you want from what you can get at a profit.

hanelyp
Posts: 2261
Joined: Fri Oct 26, 2007 8:50 pm

Post by hanelyp »

One difficulty with running umpteen processors in parallel is inter processor communication, and depending on the approach, shared memory access and coordination. Somehow or another you need hardware support for that to work well.

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

hanelyp wrote:One difficulty with running umpteen processors in parallel is inter processor communication, and depending on the approach, shared memory access and coordination. Somehow or another you need hardware support for that to work well.
Data flow architecture solves the intercommunication problem. Since these devices are mainly intended for control - each pin will have its own program.

Also there are high speed serial channels for interchip communication and shared memory.

BTW the chips are unclocked. So how do you handle regular event timing? You feed a clock to a pin and the processor only wakes up on clock edges.

A lot of the ideas in this chip were first seen in something called the Transputer a dual stack machine.

http://www.cs.bris.ac.uk/~dave/transputer.html
For some time in the late 1980s many considered the transputer to be the next great design for the future of computing. While INMOS and the transputer did not ultimately live up to this expectation, the transputer architecture was highly influential in provoking new ideas in computer architecture, several of which have re-emerged in different forms in modern systems.

http://en.wikipedia.org/wiki/Transputer
T9000:

http://hsi.web.cern.ch/HSI/dshs/publica ... per_1.html

Note the CERN address.
Transputer systems have been successfully used in real-time High-Energy Physics (HEP) systems for some time, for example in the ZEUS, UA6, and OPAL experiments [1]-[6].

Data acquisition and triggering systems for experiments at the Large Hadron Collider (LHC) and other demanding applications will require large-scale parallel systems [7]-[9]. In these applications, systems will primarily be based on high-speed point-to-point serial links and switches rather than shared buses. The INMOS T9000 Transputer [10] and the associated C104 packet routing chip [11] are commercially-available integrated circuits which can be used to build large networks of the type required for future experiments.

The main objective of the ESPRIT project GP-MIMD is the design and construction of large-scalable parallel computers using the T9000 and C104. As part of this project a 54-node T9000 network, using C104 packet routing switches, has been integrated into an existing 32-node T805 real-time data acquisition system in the CPLEAR experiment [11]-[12]. The T805 system was also developed as part of the GP-MIMD project.

The T9000 network operated as a processor farm running the standard CPLEAR off-line event reconstruction program. Initial experience with this prototype system are presented, together with computational and communications performance measurements.

http://hsi.web.cern.ch/HSI/dshs/publica ... l#HEADING1
I do have a leg up because I have studied similar architectures from Chuck Moore for about three or four years. Plus I was a Transputer fan in its day. What I find interesting is that every few years Chuck comes out with a new way of thinking and you have to drop ALL your preconceived notions to make good use of it. Not bad for a guy in his 70s. I find that my thinking usually takes a few years to catch up to his.
Engineering is the art of making what you want from what you can get at a profit.

Luzr
Posts: 269
Joined: Sun Nov 22, 2009 8:23 pm

Post by Luzr »

MSimon wrote:
They had put 144 very simple low performance cores on signle die. So what?


I don't know 650 MHz for 4.5 mW seems high performance to me. And if they sell for $3 a pop in quantity (1 cent a pin cost and 2 cents a a pin profit) you put 10 of them on a board and you have what? 1 Thz peak processing power for $30. 6.5 watts peak power. Seems like you might be able to do something useful with that much HP.
I quite doubt it. What actually can you do with 144 simple cores? Are there any practical applications?

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

Luzr wrote:
MSimon wrote:
They had put 144 very simple low performance cores on signle die. So what?


I don't know 650 MHz for 4.5 mW seems high performance to me. And if they sell for $3 a pop in quantity (1 cent a pin cost and 2 cents a a pin profit) you put 10 of them on a board and you have what? 1 Thz peak processing power for $30. 6.5 watts peak power. Seems like you might be able to do something useful with that much HP.
I quite doubt it. What actually can you do with 144 simple cores? Are there any practical applications?
Nothing.
Engineering is the art of making what you want from what you can get at a profit.

JohnFul
Posts: 84
Joined: Sat Feb 27, 2010 7:18 pm
Location: Augusta, Georgia USA

Post by JohnFul »

I quite doubt it. What actually can you do with 144 simple cores? Are there any practical applications?
You can use 35,000 cores to apply texture files to CGI images and create a movie called Avatar.

http://wellington.scoop.co.nz/?p=19750

J

kunkmiester
Posts: 892
Joined: Thu Mar 12, 2009 3:51 pm
Contact:

Post by kunkmiester »

You'd stack them up until you have enough to accomplish a task. I'm thinking that'll eventually be the future--break the chip down to as many cores as possible, and arrange them as needed for what you need to do.

Multi-threading is becoming more common, as it takes hold there will be more support put into chips to support however the software ends up doing it.
Evil is evil, no matter how small

MSimon
Posts: 14334
Joined: Mon Jul 16, 2007 7:37 pm
Location: Rockford, Illinois
Contact:

Post by MSimon »

kunkmiester wrote:You'd stack them up until you have enough to accomplish a task. I'm thinking that'll eventually be the future--break the chip down to as many cores as possible, and arrange them as needed for what you need to do.

Multi-threading is becoming more common, as it takes hold there will be more support put into chips to support however the software ends up doing it.
One other thing such a model gives you: if the silicon (or carbon nanotubes) is fast enough you can replace hardware with software.

The complexity goes into the software which is easier to build, test, and change. Respins don't take months. They take minutes.

Bit banging an SPI port for instance. Software UARTs. etc.

When I did some work on I2C a long time ago I used a parallel port (I'm dating myself) and a 100 MHz Pentium to do I2C development. And Forth as my development language. IIRC I could adjust the timing in .4uSec increments. Good enough to get the job done.

Now think about being able to run a bunch of A/Ds, D/As, EEPROM, memory, parallel ports, displays, UART, I2C bus, etc. simultaneously with a master core as the conductor. And since it is data flow the master core massages the data and then just waits for the next batch of data to start up again. And as each "sub core" completes its task and passes on the data it goes back to sleep until called on. Or wakes up when given something to do.

Intermediate cores can be used to do things like low pass filtering, slope detection, change detection, zero crossing detection, decimation, rate of change computation, Hilbert transform, or whatever.
Engineering is the art of making what you want from what you can get at a profit.

Post Reply