A 1000-processor chip powered by an AA battery (2016)

amelius · on Feb 15, 2021

One of the big difficulties with conventional processors is the memory hierarchy, where the entire memory is viewed as a shared memory space. There is no information in the article on how these processors communicate, so I'm guessing that this processor side-stepped the entire shared memory issue by offering only local memory combined with a simpler form of communication (e.g. message passing).

vidarh · on Feb 15, 2021

This is reminiscent of Parallella [1]/Epiphany (which unfortunately failed to gain traction). They didn't release any designs with core counts like that, but the entire point of the design was to enable large numbers of cores with in-core memory and predictable latency for accessing the memory of cores elsewhere in the grid.

Epiphany did have (slower) access to external memory too, but the challenge of these architectures is finding problems that requires too much branching to be able to take proper advantage of GPUs, yet so parallel that relatively low single-core performance is outweighed by throwing more cores at it and/or where the low power usage of the Epiphany would be worth it, and it seems they just didn't find enough of a market for it.

I have two of the kickstarter boards sitting around. Wish they'd gotten larger scale versions out.

[1] https://www.parallella.org/

UncleOxidant · on Feb 15, 2021

I have two of those boards too. They were working on a 1024 core version, but alas, they didn't survive as a company to make it.

myself248 · on Feb 15, 2021

Also arguably spiritual descendants of the Transputer?

vidarh · on Feb 15, 2021

Yes, there are definitively similarities in concept. The whole idea of an on-chip network between cores stems from the Transputer as far as I know.

cromwellian · on Feb 15, 2021

Reminds me of the PS/2 CELL Architecture, with the SPUs having limited local memory, and needing to DMA stuff around in a streaming fashion to build up larger programs.

rbanffy · on Feb 15, 2021

The Cell was a pain to program because of that hateful memory architecture. The anemic PowerPC CPUs didn't help much either.

Nothing really prevents an architecture where part of the address space is core-local (with low latency) and not directly accessible from other cores, and the rest points to a (much higher latency) shared memory pool. I have a feeling this would be much nicer to program than a Cell or the average GPU.

jabl · on Feb 15, 2021

> Nothing really prevents an architecture where part of the address space is core-local (with low latency) and not directly accessible from other cores, and the rest points to a (much higher latency) shared memory pool. I have a feeling this would be much nicer to program than a Cell or the average GPU.

Many GPU's have this kind of local scratchpad memory. Nvidia GPU's even allows the programmer to partition the local SRAM between the scratchpad memory (called shared memory in CUDA docs) and L1 cache.

rbanffy · on Feb 15, 2021

It's a shame nobody came up with an OS that runs entirely on a GPU.

breatheoften · on Feb 15, 2021

Seems like the way to go ... maybe even 3 layers of memory -- core local memory, memory that provides shared reader and writer consistency, and memory that provides shared reader, single writer consistency (for ownership transfers). Maybe the second form doesn't actually need to exist at all?

Minimizing ownership transfer cost across hardware elements seems like a fundamental concept to me that future architectures will need to specifically optimize in order to maximize the benefits that can be reaped from "task specific" hardware.

The more cores we add the more we have to gain by allowing them to be different from each other I think ...

djmips · on Feb 15, 2021

Nevertheless people were able to harness the Cell for good performance. It took a re-think and I am not denigrating your experience but our experience was that we benefited everywhere when we reorganized our data and algorithms around the SPUs.

als0 · on Feb 15, 2021

That was the PS3.

tornato7 · on Feb 15, 2021

Yep, at KnuEdge we built some processors with thousands of cores in that style architecture. It's a bit different to reason about but you can do a lot of wild computing that way. Too bad the company folded before any of our processors were available.

nextaccountic · on Feb 15, 2021

Yes, "Cores operate at an average maximum clock frequency of 1.78 GHz, and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data."

bee_rider · on Feb 15, 2021

This makes sense, everyone loves debugging MPI programs, so of course we'd want to replicate that experience on the desktop.

djmips · on Feb 15, 2021

You might have to be dragged kicking and screaming to the multicore because CPUs aren't getting faster.

bee_rider · on Feb 15, 2021

The distinction here is between distributed and shared memory, not between single and multicore programmming. Distributed memory is more often used on clusters. I would say distributed memory is generally considered easier on the hardware, harder on the programmer (although one could argue that a message passing communication scheme is more explicit than fork/join, so the programmer has more control).

superrad · on Feb 15, 2021

Should we be training on TIS-100[0] to be ready for when these chips become a necessity.

[0]https://en.wikipedia.org/wiki/TIS-100

kitd · on Feb 15, 2021

Cores ... transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data

This sounds like transputers. IIRC they had direct memory linkage too.

thawkins · on Feb 15, 2021

yep each transputer had 4 serial links, which could be connected in matrixes, and allow data to be moved across the array directly. They used a language called OCCAM, and there was an OS called TAOS that allowed you to run various transputer topologies.

TickleSteve · on Feb 15, 2021

TAOS also ran on many other architectures (68K, x86, ARM). It implemented a virtual processor (VP1) and evolved into Elate but didn't get much recognition despite its novelty.

TAOS was definitely ahead of its time, but rarely gets mentioned unfortunately.

rbanffy · on Feb 15, 2021

Dick Pountain wrote a lot of articles for BYTE about that.

This is one: https://sites.google.com/site/dicknewsite/home/computing/pri...

kitd · on Feb 15, 2021

OCCAM!! That was it! Thx I was trying to remember the name of the language they used.

I wonder whether OCCAM would be usable here, or if not, then something based on the lessons learnt from it.

TickleSteve · on Feb 15, 2021

OCCAM implemented CSP (Communicating Sequential Processes) that Go took a lot of lessons from.

(https://en.wikipedia.org/wiki/Communicating_sequential_proce...)

Zenst · on Feb 15, 2021

Some more detailed information here: https://en.wikichip.org/wiki/uc_davis/kilocore

No idea how it has progressed or if sample units became accessible - which would be nice.

karmakaze · on Feb 15, 2021

In addition to the memory access I was wondering how they got to 1000 cores and not 1024.

> The chip is designed as a massively parallel processor array, with 992 cores arranged as a grid 32 by 31. Eight additional cores are found along with 12 memory modules of 64 KB SRAM each (for a total of 768 KB). Communication between cores is done via a dual-layer source-synchronous circuit-switched network and a very-small-area packet router (see wormhole routing). The circuit-switched network supports communication between adjacent and distant processors, as resources allow, with each link supporting a maximum rate of 28.5 Gbps. Maximum throughput is 45.5 Gbps per router. Both network types contribute to an array bisection bandwidth of 4.2 Tbps.

AtlasBarfed · on Feb 15, 2021

Process/Fab errors?

24 cores of failure out of 1000 would be an error rate around 2%, which would help in yields.

nsteel · on Feb 15, 2021

That's a good reference.

I had assumed there were 1000 instruction memory SRAMs in there but of course that'd be impossible considering the leakage numbers and there's actually just 12 x 64KB SRAMs shared by all the cores:

     Per core
        640 bytes (128x40-bit) local instruction memory
        512 bytes (256x16-bit) local data memory
    768 KB SRAM on-die
        12 shared SRAM memory modules, 64 KB each

Those local per-core memories are tiny and presumably implemented with registers. But can you really do much with just 128 instructions? The following makes it sound like they might be instruction caches:

> Instructions may come from the local instruction memory or they may be fetched from one of the independent memory module

But then goes onto say:

> the KiloCore's processors do not contain traditional caches.

Hopefully their paper has more info.

The survey that the "World’s First 1,000-Processor Chip" claim is based on (http://vcl.ece.ucdavis.edu/misc/many-core.html) doesn't seem to include any network processors, some of which are technically programmable. But even if it had, they wouldn't have found any of the information required for the table and definitely not a datasheet!

code-scope · on Feb 15, 2021

Maybe one can build/run Tensorflow stack on top of this with just customize the tensor sw on each node?

Or is the 16 bits CPU too limited for training operations?

klyrs · on Feb 15, 2021

This is kinda wild. It seems like one could implement this in a large FPGA today and reasonably expect to hit a similar clock frequency.

DrNosferatu · on Feb 15, 2021

...from 2016. back in grad school, I did a comparison between the (then new) XeonPhi and GPUs to run my own N-Body engine using the same OpenCL implementation - the rest is history: GPUs still rule today for high intensity computational needs. Look at Machine Learning: GPUs are so good at SIMD, that if your domain can benefit from faster computation, /i you should make it SIMD i/, if it's not already. ;)

klelatti · on Feb 15, 2021

It's sad that OpenCL development seems to have lost its way. I have some OpenCL workloads where a running on a CPU is faster and some where GPUs are much quicker. The beauty of OpenCL is that I can choose between the two with zero code changes.

But CUDA and ML have steamrollered everything!

DrNosferatu · on Feb 15, 2021

Indeed, OpenCL runs on everything - if you can set it up... :/

Actually, at the time, what I did was a 3-way comparison between GPU, XeoPhi and classic CPU (was it 8 cores?) running the same OpenCL code. Anyways, the "many-core" specimen, XeonPhi, benched at the geometric mean between classic CPU and GPU - so closer to classic CPU! (we were using beta hardware, our intel specialist - nice, helpful guy actually - went mute after I got those results :D ) On top of that, reimplementing the same algo in CUDA gave ~10% advantage over OpenCL on NVidia hardware. I wonder how it would fare on a FPGA of the same price/TDP?

And more relevant for 2021: - I wonder if we would get the same kind of results using Vulkan?

- Speaking of which, does anyone know of any good books / tutorials / runnable examples of Vulkan in a compute-oriented application?

Cheers!

DrNosferatu · on Feb 15, 2021

PS: Does anyone have any experience with ArrayFire? https://arrayfire.com/ (I believe it's from the people that did the [GPU]Jacket for Matlab)

MayeulC · on Feb 15, 2021

I heard of Hallide quite recently, and find it promising: https://halide-lang.org/

Not only you can decide on CPU vs GPU, you have more control over the kind of parallelism and intermediate results.

logimame · on Feb 15, 2021

I really hope SYCL (https://en.m.wikipedia.org/wiki/SYCL) can gain some traction (since Intel also jumped on the GPU competition and are backing a substantial amount of it)

oblio · on Feb 15, 2021

We're probably a programming language revolution away from these kinds of CPUs being practical. Isn't parallelism complicated and practically dangerous in most mainstream languages?

Is there any language where you can just go:

  parallel-for:

    do-my-stuff

and have the programming language Do The Right Thing with no further hassle? The right thing usually being blocking execution until every parallel do-my-stuff finishes execution, in my experience.

In most programming languages I've seen this is a 20+ line boiler plate full of caveats and risks.

hvidgaard · on Feb 15, 2021

It's simple to do what you write, multiple languages already do just that. It's the class of problems that is called "embarrassingly parallel" (https://en.wikipedia.org/wiki/Embarrassingly_parallel). However, it turns out that it places some rather serious restrictions on the kind of calculations you can perform this way. In general for two pieces of code to be executed in parallel, they must be independent of each other. I.e for the following

    1: tmpA := A(input)
    2: tmpB := B(tmpA)
    3: tmpC := C(tmpB)
    4: result := D(tmpC)

it's impossible to calculate a line before the previous line has finished. You cannot do this calculation concurrently. On the other hand if it was like this:

    1: resA := A(input)
    2: resB := B(input)
    3: resC := C(input)
    4: resD := D(input)

You can calculate all 4 concurrently and achieve a great speed-up if you have resources to do it. Except if B, C, or D somehow access the result of a previous line. For mathematical notation this is simple, but for general programming languages there are plenty of ways to do this.

We know that an optimal optimization, i.e. finding the fastest way a program can be executed is known uncomputable. Not difficult, not brute forceable, but impossible to compute with traditional computers. The best we can do is identify _some_ concurrency optimizations, but so far it has not been a fruitful adventure for general programming languages. So what we can do is use patterns and constructs that rely on dividing the problems into concurrent pieces and use proper guarding on shared state. This is however notoriously easy to get wrong.

oblio · on Feb 15, 2021

I know about "embarrassingly parallel" problems, and the thing is, they're only "embarrassingly parallel" to solve in math.

When programming the level of friction is from annoying to high unbearable, depending on the programming language. It should be trivial to implement these solutions, as trivial as the sequential solutions.

It's not. Until this is the case in mainstream programming languages, I doubt we'll have CPUs with 1000 cores in our smartphones or laptops. I mean, we'll have them but they'll be useless because 99% of software today is practically single-threaded. The reason multiple cores are good for laptops, for example, is because we're multi-tasking between multiple applications, not because those cores are frequently used by day-to-day apps. I'm talking about random small apps, not compute intensive ones that have to use multi-threading intensively (rendering, compiling, what have you, professional level software).

rbanffy · on Feb 15, 2021

> but they'll be useless because 99% of software today is practically single-threaded.

I used to joke developers should get workstations with SPARC Niagaras or Xeon Phis for that reason: core count is going up and a CPU with a dozen small cores is much cheaper to build than one with 4 beefy ones. Now some of the low-end chips Intel is pushing have 4 SMT2 cores. HEDT is on the 16 SMT2 core range and, if our software continues to be single threaded, there will be a lot of silicon being used for nothing more than spreading heat.

astrange · on Feb 15, 2021

That wouldn't be good if they're targeting battery-powered devices, because you usually don't want to turn on extra cores there. Mobile programs can actually go faster if you make them single-threaded - either the device doesn't have more cores available to run your extra threads, or your process isn't high-priority enough to spend more power on.

colejohnson66 · on Feb 15, 2021

Isn’t the point of high core counts for workloads that can actually be parallelized? Like compilation or graphics rendering?

rbanffy · on Feb 15, 2021

A lot of tasks can be parallelized if you think hard enough. Your CPU is busy reordering instructions so they keep as many execution ports busy as it can so that it can retire as many instructions per core cycle as possible. SMT was invented to keep those execution units busy by running more than one instruction stream at a time.

The incentive to do it when most computers have two SMT2 cores is small, but as the average device starts getting 4 or 8 SMT2 cores, or 8 asymmetric cores, the incentives to make things run faster in parallel get better and better.

vidarh · on Feb 16, 2021

A lot more things can be parallelised than we do today.

Consider this: We mostly parallelize compilation today at the unit of a single compilation unit.

We do this for two reasons: It's simple, for languages where separate compilation is possible, and it's good enough when you have a small-ish (smaller than the number of compilation units) number of fast cores.

But a lot of compilation can be reduced into far more fine-grained parallel steps.

E.g. parsing can be turned into a chain of array operations along the line of finding possible start/ends of tokens, finding possible starts and ends of quoted segments, finding possible starting points for matches against various rules based on the possible token starts etc. You can further build syntax trees as arrays of vectors. If you have many enough cores it may pay off to subdivide the work to extreme levels like that.

But it's hard to reason about. We're not used to it, and most of us are no good at conceptualising it. Some are.

For a real-life example of a compiler that actually runs on GPU's and aims for massive parallelisation (it compiles a subset of Dyalog APL), check out "co-dfn" by Aaron Hsu. It's very terse and complex to read, and frankly I've only spent time looking at it very superficially, but it's an amazing piece of work. Now, if it was easier for mere mortals to actually write code like that, and we had a more compelling reason (many, but slower cores), we'd probably see a lot more parallelisation. In the short run it'd be extremely painful, but in the long run it might well benefit us.

tenebrisalietum · on Feb 15, 2021

Imagine an OS where each new thread or process gets allocated its own core. You'd be limited to 1k threads/processes on this CPU, but what would otherwise be the benefits/drawbacks?

hvidgaard · on Feb 16, 2021

.net has TPL, and most languages with locking constructs has similar libraries. Even if you have to roll it by hand, it's the hard part is, for embarrassingly parallel problems, not more difficult than other problems programming professionals encounters. In my experience anyway.

chmod775 · on Feb 15, 2021

> We know that an optimal optimization, i.e. finding the fastest way a program can be executed is known uncomputable. Not difficult, not brute forceable, but impossible to compute with traditional computers.

I'm really curious about the proof of this, since there's only a finite number of ways you can re-arrange instructions in a program. Likewise there's only a finite number of ways you could shard them across threads.

For this to be true you'd clearly need to phrase the problem in a way that allows you to have an infinite number of possibilities, then prove you can't arrive at the optimal solution through some means that doesn't involve checking an infinite number of them. (Edit: Or prove that it is impossible to decide on which one is the fastest at the time the optimizer runs. When does the optimizer run?).

Was the assumption that you'd also be (infinitely) unrolling loops or maybe 'rewriting' the program into something equivalent but faster? What does "finding the fastest way a program can be executed" mean? Just re-ordering instructions and distributing them across threads, or something more?

hvidgaard · on Feb 16, 2021

Optimization is not only rearranging instructions. You can make a program of twice the size that runs twice as fast but does the same computation. There is no upper bound.

But even so, minimizing the program size is uncomputable as well. Given a Turing Machine M, create a program P(i) that simulates M for i steps and return true if the machine reaches an accepting state. If you provide an optimizing program OPT, then OPT(P) would be "return false;" if and only M does not halt. I.e. the optimizer would solve the halting problem which we know is not possible.

chmod775 · on Feb 16, 2021

> Optimization is not only rearranging instructions.

And not only finding parallelizable instructions either, I assume.

Thanks for the clarification, because such a proof would have surprised me a lot.

Denzel · on Feb 15, 2021

Finding the optimal execution time of an arbitrary program is equivalent to the halting problem. [1] If you narrow the “arbitrary” constraint to only include well-structured, analyzable, guaranteed-to-terminate programs, then you can at least start to approximate a solution. Finding the true optimal case, even under those conditions, would be computationally expensive.

[1]: https://en.wikipedia.org/wiki/Halting_problem

chmod775 · on Feb 15, 2021

This wasn't about anything related to optimal execution time. That an optimizer needs to know whether the code it optimizes terminates (or how long it runs) would also need proof if you'd want to go that route.

The assumptions and optimizations the optimizer is allowed to use and what kinds of programs we're talking about, and what even is considered an "optimal program" is also still unclear (Edit: I'm assuming lowest number of instructions executed serially?).

Please do refer me to a paper. Please do not link me vaguely related Wikipedia articles.

vidarh · on Feb 15, 2021

It's equivalent to the halting problem because there exists a set of problems for which for any input that optimiser creates the optimal solution for, there exists another input for which that problem is not optimal. The parallel to the halting problem is that with access to the output of the optimiser, you can always construct a problem where whatever the optimiser produces can be obstructed by producing an input that makes the optimised program non-optimal for the given input.

A trivial example of such a program is a function that sorts it input and returns the sorted result, as no sort is optimal for all inputs.

chmod775 · on Feb 16, 2021

> It's equivalent to the halting problem because there exists a set of problems for which for any input that optimiser creates the optimal solution for, there exists another input for which that problem is not optimal.

You're trying to prove that no optimal solution can exist, not that it's impossible to find one. Which is fine, because it's a stronger claim.

But it heavily depends on your definition of optimal. If you define optimal as "lowest average number of instruction across all possible inputs", then there exist optimal sorting algorithms.

If you define optimal as: "A solution is optimal only if for every possible input there exists no solution that requires a lower number of instructions.", then yes, there can be no optimal sorting algorithm.

I would argue that in practice only the first definition is useful.

Further, you're assuming that "optimizations" that replace an existing algorithm with an equivalent one are allowed. Which brings me back to one of my original questions: What optimizations was hvidgaard allowing to be me made when he claimed an optimal solution is incomputable.

vidarh · on Feb 16, 2021

> You're trying to prove that no optimal solution can exist, not that it's impossible to find one. Which is fine, because it's a stronger claim.

I did not prove no optimal solution can exist. I gave an example of a subset of programs which would resist optimal optimisation. For a lot of programs an optimal version will exist. E.g. "return false" is obviously trivial to generate an optimal version of.

[I'll also note hvidgaard provided a far simpler proof this problem is equivalent to the halting problem that invokes the halting problem directly as part of the solution by forcing the optimiser itself to solve the halting problem to determine the valid resolution, but my example shows that the optimiser can not solve the general case even if the program being optimised is guaranteed to halt]

> But it heavily depends on your definition of optimal. If you define optimal as "lowest average number of instruction across all possible inputs", then there exist optimal sorting algorithms.

The commenter you replied to gave a definition: "finding the fastest way a program can be executed".

However, I believe your definition won't help (even if I ignore the "lowest average number of instructions" - nobody cares about the number of instructions executed, but at about wall-clock time or space complexity, or possibly the size of the generated code), because the input to the program and the input to the sorting routine here are not necessarily the same. You can extend the sort problem to a more complex program that can incorporate the element that reads back in its own binary and perturbs the inputs fed to the sort component. Indeed to "borrow" from hvidgaard, you have no guarantee that the program in totality will halt at all, but even if you disregard not halting, as long as the program can read its own code, the problem remains.

> I would argue that in practice only the first definition is useful.

In practice that definition is very often totally useless, because we very often have knowledge about the expected input data for which the way we express an algorithm and the specific choice of algorithm is the only way most programming languages allow us to express the necessary expectations.

Effectively most optimisers aren't advanced enough anyway that this is a consideration, because we're hitting space-time complexity tradeoffs for the cost of running the optimisers already when trying to do unambiguously beneficial optimisations.

> What optimizations was hvidgaard allowing to be me made when he claimed an optimal solution is incomputable.

Given the claim was stated without any constraints, it is natural to assume the claim was general. That is, that a valid output of the optimiser is a program that given the same input produces the same output. The point of the claim was to point out the sheer complexity of aiming for auto-parallelisation beyond smaller transformations.

chmod775 · on Feb 16, 2021

I do not understand why you are trying to drag the discussion from mathematically provable statements into whatever territory this is.

In any case your response is not very readable - some sentences are cut off halfway through or mangled. Maybe you could fix it up and I'll give it a second reading afterwards.

> Given the claim was stated without any constraints

The original commenter was suggesting there is some mathematical proof for the statement: "finding the fastest way a program can be executed is known uncomputable."

Given the context it would seem likely that the optimizations they were talking about are specifically about finding which instructions can be executed in parallel.

Consider their next sentence: "The best we can do is identify _some_ concurrency optimizations."

We won't be getting anywhere like this either way. This started with me asking for clarification and a reference, and you keep trying to present me with something that is neither. I can appreciate you wanting to add some insights, but only hvidgaard itself would know what they meant to say. Edit: They posted a response 3 hours ago which I did not see until now. That settles that.

vidarh · on Feb 16, 2021

I've fixed the incomplete statement, but it does not make a material change. If you don't understand the argument, I'm not sure how to restate it to make it clearer.

> The original commenter was suggesting there is some mathematical proof for the statement: "We know that an optimal optimization, i.e. finding the fastest way a program can be executed is known uncomputable."

And they have provided a proof for that, which was short and concise.

> Given the context it would seem likely that the optimizations they were talking about are specifically about finding which instructions can be executed in parallel.

I'd rather - given that it matches their followup proof - assume that the statement is intended to be taken as written rather than assuming additional constraints that they did not state anywhere.

> Consider their next sentence: "The best we can do is identify _some_ concurrency optimizations."

Yes, because we can not create an optimal program. This is not attaching any limitations to the hypothetical optimiser from the previous statement.

> This started with me asking for clarification and a reference, and you keep trying to present me with something that is neither.

I stated a proof by giving an example that can not be optimally optimised according to the criteria hvidgaard stated. I'm sorry if it wasn't clear enough for you. It was more complicated than hvidgaards very elegant proof, granted, though it also goes further, in showing that even for programs that do terminate there exists a subset which there is no solution that meets hvidgaards criteria.

> I can appreciate you wanting to add some insights, but only hvidgaard itself would know what they meant to say.

hvidgaard has also stated a proof, by showing that you can devise a program that requires the optimiser to solve the halting problem to return the optimal program, that just like mine made no assumptions about the set of allowed optimisations.

I don't know what you're still arguing here given that whatever you think of my sort example, hvidgaards own proof for their statement is simple an unambiguous. This is also not at all a new discovery either, but a well settled topic of computer science.

chmod775 · on Feb 16, 2021

I did not see hvidgaards' response until now. It's fairly recent (3h) and way up the thread.

And yes, I'm well aware that the issue is settled for general optimizations.

At risk of repeating myself for a third time: I was curious whether, when only using optimizations that involve re-ordering instructions/finding instructions that can be executed in parallel, there is proof that there exists no optimal solution.

If that's still not clear, I'll quote myself from my original comment:

> Was the assumption that you'd also be (infinitely) unrolling loops or maybe 'rewriting' the program into something equivalent but faster? What does "finding the fastest way a program can be executed" mean? Just re-ordering instructions and distributing them across threads, or something more?

Quite disappointingly that wasn't what they were speaking of, because I would've been quite surprised by such a proof for reasons outlined in my original comment.

hvidgaard · on Feb 16, 2021

There always exists _an_ optimal solution for any given algorithm. But question is if you can compute such an optimal solution. Or even if you can compute if a solution is optimal.

I would say that a reasonable definition of an optimal solution to a problem, excluding concurrency for now, is the same as minimizing the number of steps needed to complete the algorithm. Take my proof from the other comment, it states that if you have a perfect optimizer, you have solved the halting problem.

Adding concurrency to the optimization, i.e. the OPT function can return a program that executes concurrently in which ever way possible on some given hardware. It will still return "return false;" if and only if the original algorithm never terminates.

chmod775 · on Feb 16, 2021

Yes. I should've said "proof that the optimal solution is uncomputable". My bad.

hvidgaard · on Feb 17, 2021

Then I do not understand what you're asking. My proof is just that. If you have a non terminating TM and a program simulating it after i steps and return true only if it's in an accepting state, the optimal solution to that would always be "return false" for TMs that do not ever terminate, which is solving the halting problem. This goes for any optimization you can think of.

chmod775 · on Feb 17, 2021

You already answered my question earlier, perhaps by accident. It was mostly about clarifying which optimizations are being talked about. I'll paraphrase myself here:

I was curious whether, when only using optimizations that involve re-ordering instructions/finding instructions that can be executed in parallel, there is proof that the optimal solution is incomputable.

In context it appeared that may be what you meant, but you clarified that you didn't.

Under these constraints one might say it appears somewhat obvious that the optimal solution, i.e. a solution that executes the most instructions in parallel is computable. It's done by building the dependency graph of instructions, then ordering/distributing the instructions across execution units. For anything more than that you will run into the halting problem, but anything more would involve changing/removing/adding instructions or otherwise making changes to the program logic, which is not allowed.

One might complain that the optimizer now in some cases can't know whether two instructions depend on one another without running the program (writing, then reading unknown memory addresses for example), but any attempt to optimize cases where we can't statically infer or where the memory addresses are only sometimes the same would result in changing or adding instructions anyways (inserting a branch)[1], so as far as we are concerned these always depend on one another. We're not allowed to add/change instructions after all.

A valid criticism of this class of optimizers is that they fail spectacularly for self-modifying programs or programs that read their own instructions. A contrived example would be a program that reads some parts of its own code. We can't really do anything about this since we're not allowed to rewrite these instructions to make them still work correctly - if we even can. It's more useful to just exclude that class of program.

Anyways, a short proof that it is possible might look like this: There is only a finite number of permutations of the original program that satisfy these constraints (re-ordered/parallelized instructions), generating them all is trivial and choosing the one with the most instructions executed in parallel is as well.

Which brings me back to:

> The best we can do is identify _some_ concurrency optimizations

So my point was really, that assuming "concurrency optimizations" only includes re-ordering instructions and distributing them across execution units using whatever technology, intuitively it would seem that we can identify all of them.

But that's not what you meant. You included things such as completely replacing the algorithms with equivalent ones in "concurrency optimizations".

[1] There is an assumption here that one could disprove by example. The assumption is this: You can statically infer two values are the same when they are always the same. If this doesn't hold and the instructions using them could be run in parallel but aren't, I'm just wrong.

hvidgaard · on Feb 18, 2021

That is true, if you restrict yourself to only reorder and execute in parallel, you can probably optimize for number of instructions. That is just not particular interesting as you miss many optimizations, and you certainly cannot find the optimal optimization.

yoshuaw · on Feb 15, 2021

MSVC's C++ Parallel algorithms library gets close [1], and so does Rust's Rayon crate [2]. C# has parallel for loops [3], and apparently so does PowerShell [4]

Though none of these are quite as convenient as I think they could be. I suspect there is a path for Rust to add parallel iteration syntax in the future [5]. There are quite a few steps needed to get there, but the result could look something like this:

    let mut listener = TcpListener::bind("127.0.0.1:8080")?;
    println!("Listening on {}", listener.local_addr()?);
    
    par for stream? in listener.incoming() {
        println!("Accepting from: {}", stream.peer_addr()?);
        io::copy(&stream, &stream)?;
    }

[1]: https://devblogs.microsoft.com/cppblog/using-c17-parallel-al...

[2]: https://docs.rs/rayon/1.5.0/rayon/

[3]: https://dotnettutorials.net/lesson/parallel-for-method-cshar...

[4]: https://devblogs.microsoft.com/powershell/powershell-foreach...

[5]: https://blog.yoshuawuyts.com/parallel-stream/#future-directi...

eptcyka · on Feb 15, 2021

What'd be the concurrency primitive used in the parallel block? Is scheduling done per loop or globally? All of these questions are trivial if there's a runtime and become very cumbersome if the language has ambitions to be usable without any standard libraries.

volta83 · on Feb 15, 2021

If you use the standard library, its parallel run-time gets automatically linked. This is often the right choice when building binaries for a platform with an operating system, etc.

If you do not include the standard library _AND_ if you use these features, then you'd need to provide your own parallel run-time or your application won't compile.

There are many valid implementations of these run-times tuned for different applications, these run-times can be as simple or complex as you'd like, in some cases some run-times only work on particular bare metal targets, and in most cases you just pull a library with the implementation that you'd want to use.

Since a valid implementation of the run-time is to just run all tasks sequentially, you can provide a tiny (often "zero-size") run-time for those environments on which code-size trumps everything.

You can also provide the standard parallel run-time, but this might require operating-system like functionality, like thread, mutex, etc. support. If your embedded target does not provide those, you'd need to provide implementations of these yourself.

The important thing is that you don't need to change any code that uses the run-time, e.g., all the app code can still use fork-join parallelism, but you control how that's executed by swapping the modular runtime component.

yoshuaw · on Feb 15, 2021

Yes, exactly. Rust already allows for switching global allocators through `#[global_allocator]`. But now also allows plugging allocators inline through the allocator-api methods.

I'd imagine something similar could be introduced for threadpools / runtimes.

Someone · on Feb 15, 2021

I don’t think you can do that well without some global state and code to handle it (aka ‘a runtime’).

It’s not like separate parts of a program can each optimize for the number of threads and their memory for their goals. Optimization/tuning has to happen globally because the resources used (threads, memory) are shared between those parts.

I think most solutions have some global state that magically decides how many threads to run in parallel (more for I/O bound code), how many to assign to each part (which don’t even have to be perfectly separate. Part P might parallellisme a loop where each iteration calls into part Q, which also runs a parallel loop. If each of these decides to run one thread per CPU for N threads in total, the result still c/would be that N² threads run in parallel, likely with suboptimal results)

AtlasBarfed · on Feb 15, 2021

I love GPars for Groovy, but unfortunately it isn't CompileStatic compatible or Groovy 3.0 compatible.

nicoburns · on Feb 15, 2021

Rust's Rayon library is pretty good for these use-cases https://github.com/rayon-rs/rayon

    use rayon::prelude::*;
    fn sum_of_squares(input: &[i32]) -> i32 {
        input.par_iter() // <-- just change that!
             .map(|&i| i * i)
             .sum()
    }

foerbert · on Feb 15, 2021

I'm not sure this is the best link, but I don't have anything better in mind and it was the first thing that popped up, so...

Anyway, there's a language called ParaSail[0] that's doing stuff in that direction. A little bit back the guy behind it got hired by AdaCore (Ada compiler folks) and they seem to have adopted it and continuing with it. I haven't heard too much about it though.

[0]https://adacore.github.io/ParaSail/

jackpeterfletch · on Feb 15, 2021

Occam has exactly this. SEQ executes statements sequentially. PAR executes them in parallel, though the language is based on CSP and Cooperative Multitasking, so the idea of it doing 'the right thing' with a PAR is a little different.

Think of Occam programs as big Factorio factories.

https://en.wikipedia.org/wiki/Occam_(programming_language)

Alot the ideas that came out of Occam and CSP have been brought into various coroutine libraries though. Kotlin Coroutines, Goroutines, etc. So its definitely something that is explored.

jacquesm · on Feb 15, 2021

I was looking for this comment before posting. Thank you.

rbanffy · on Feb 15, 2021

> Is there any language where you can just go:

Nothing prevents a map function to run in parallel.

There are other things that can happen in parallel, however - the whole instruction flow is reordered in the CPU before it is executed and if you have enough execution ports and enough in-flight instructions to keep them fed, you can accomplish a lot in a single clock cycle before your source code even needs to acknowledge that instructions don't happen exactly in the sequence they are written.

flohofwoe · on Feb 15, 2021

You can get that without a programming language revolution by reformulating the problem a bit.

For instance 3D-API shading languages do just that under the hood, just without the "parallel-for". Instead you tell a 3D-API what code should run per-vertex or per-pixel in (usually) a traditional sequential C-style language, and the GPU driver and hardware care about the parallelization and scheduling.

Kuinox · on Feb 15, 2021

Yes, SQL. Write your query and it will be magically parallelised without you knowing it.

gpderetta · on Feb 15, 2021

#pragma omp parallel for

oblio · on Feb 15, 2021

What's the operational cost of doing that?

gpderetta · on Feb 15, 2021

What do you mean exactly?

You need a C/C++/fortran compiler that supports openmp, which is most of the mainstream ones (I believe MSVC still only support an older standard, still more than enough for parallel for). Compilers that do not support openmp are supposed to ignore the pragma and the code will still run fine (although serially) and be fully compatible.

Other than that and possibly passing the correct command line flag, openmp is fairly low friction. A parallel for can provide significant speedup for embassingly parallel problems and these days it even support custom random access iterators in C++.

Of course it is more involved to express less embarrassingly parallelizable problems, but it is still doable.

gens · on Feb 15, 2021

While reading the article i thought of Erlang.

Multiple independent programs sending messages to each other fits well into multiple independent cores sending messages to each other.

Someone · on Feb 15, 2021

In scala (https://docs.scala-lang.org/overviews/parallel-collections/o...):

Serial:

  list.map(_ + 42)

Parallel:

  list.par.map(_ + 42)

In C# (https://docs.microsoft.com/en-us/dotnet/standard/parallel-pr...)

Serial:

  for (int i = 0; i < matARows; i++)

Parallel:

  Parallel.For(0, matARows, i =>

(Function takes a lambda that gets called for each value)

Swift has DispatchQueue.concurrentPerform (https://developer.apple.com/documentation/dispatch/dispatchq...). That is a bit less clear and less flexible (won’t iterate over an array’s values, only over its indexes) but also fairly easy to use.

There probably are many more languages with similar constructs.

snak · on Feb 15, 2021

Yes, I do agree C# does have many constructs (Parallel For/ForEach/Invoke...) but there's many things to consider that are not designed with thread safety in mind (e.g. generic collections, List<T>, Dictionary<T>, etc... must be replaced by their System.Collections.Concurrent respectives) and many other risks that parallel computing poses.

  int x = 0;
  Parallel.For(0, 99999, z => {
    x++;
  });
  Console.WriteLine(x);

Something as simple as an integer sum won't produce the expected output unless you're using Interlocked increments.

And that definetely does not match what oblio means with "have the programming language Do The Right Thing with no further hassle".

api · on Feb 15, 2021

Rust is much safer for parallelism and has a lot of safety rules around closures, so this would probably be achievable in Rust without much pain via a clever template library.

The problem is that not all problems are easy to parallelism at the algorithmic level, and for some it may be impossible.

Alongside massively multi core chips I would also love to see more efforts to push the envelope on single threaded performance.

cfstras · on Feb 15, 2021

Java has this since 8:

    Arrays.asList("hello", "world").parallelStream()
        .forEach(s -> System.out.println(s));

If you want to calculate stuff, you can

    List<String> suffixed = Arrays.asList("hello", "world").parallelStream()
         .map(s -> s + "_suffix")
         .collect(Collectors.toList())

leprechaun1066 · on Feb 15, 2021

q has this out the box: https://code.kx.com/q/basics/peach/

Faaak · on Feb 15, 2021

It's builtin into prolog's logic and its derivates, alas they are rarely used in the industry :'(

mhh__ · on Feb 15, 2021

D does this as a library i.e. just add parallel to your foreach aggregate.

As to languages built around this, there are quite a few, e.g. Intel have Data Parallel C++

legulere · on Feb 15, 2021

This architecture is called a manycore processor. There are several programming paradigms that are researched for it, among others actor languages, where the idea is to run parts of your program on different cores that communicate with each other.

https://ne.wikipedia.org/wiki/Manycore_processor

4gotunameagain · on Feb 15, 2021

MATLAB has parfor, but that's barely a language..

bipson · on Feb 15, 2021

ADA is quite good at this.

But not every problem (i.e. the Algorithm for the problem) can be divided easily without any coordination (i.e. communication) between the chunks. This overhead can quickly become significant.

There is a reason why desktop software (particularly the computationally intensive) is having a hard time exploiting multi-core benefits (besides the programming language and habits in general).

tonyedgecombe · on Feb 15, 2021

C# has it: https://docs.microsoft.com/en-us/dotnet/api/system.threading...

Of course when you start doing things in parallel there are automatically a number of caveats and risks.

prennert · on Feb 15, 2021

Dont have many languages already some sort of map implementation that parallelizes pure function calls? Even python has ThreadPoolExecutor.map and ProcessPoolExecutor.map.

It gets difficult once you have side effects and/or synchronization (join) of the results are required.

PartiallyTyped · on Feb 15, 2021

In python, with numpy, you can use `np.vectorize`, there exists a jax equivalent that can enable accelerators to do your parallel stuff. Cuda exists, but it isn't as simple as `parallel-for`. For trivial for-loops, you can use OMP.

rand_r · on Feb 15, 2021

Unfortunately np.vectorize doesn’t do anything smart.

From the docs:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

PartiallyTyped · on Feb 15, 2021

Ah right, yes I remember. Perhaps .vmap from pytorch?

kevin_thibedeau · on Feb 15, 2021

Hardware description languages are parallelized by default.

jiofih · on Feb 15, 2021

Pony? https://www.ponylang.io/

spiritplumber · on Feb 15, 2021

Parallax's Spin lets you do this.

GTP · on Feb 15, 2021

>Isn't parallelism complicated and practically dangerous in most mainstream languages?

Yes, but functional languages solve this problem.

dang · on Feb 15, 2021

Discussed at the time: https://news.ycombinator.com/item?id=11935999

Bluestein · on Feb 15, 2021

"The chip is the most energy-efficient 'many-core' processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor."

Bluestein · on Feb 15, 2021

(Reminds me of this other chip - built by a "giant" of computing - that was designed to be -very- energy efficient, while having a very large number of very simple cores ...

... ran a special FORTH if I am not mistaken ...

I cannot remember where I found a very nice presentation the gent gave about it ... :/ )

palsecam · on Feb 15, 2021

http://www.GreenArrayChips.com/ by Chuck Moore, father of Forth.

ajb · on Feb 15, 2021

There were previous iterations as well. I think this was all funded by some patent they licenced to Intel.

The most astonishing part was that he rolled his own VLSI CAD system: https://mschuldt.github.io/www.colorforth.com/vlsi.html

brandonmenc · on Feb 15, 2021

A video of him demonstrating that system:

https://www.youtube.com/watch?v=Dbd7Xu0ibJM

Bluestein · on Feb 15, 2021

Bluestein · on Feb 15, 2021

So kind :)

Thank you.-

Right on the dot ...

(I found it fascinating, when I first saw it :)

agumonkey · on Feb 15, 2021

There was two talks of Moore on GA chips. Multi core forth processors were (and still are) so inspiring.

Bluestein · on Feb 15, 2021

That's it, indeed :)

karmakaze · on Feb 15, 2021

Yeah, thought it was interesting that the subject chip didn't have a heat sink/fan and one supporting? chip on the board behind it did.

peter_d_sherman · on Feb 15, 2021

This might make a good candidate architecture to become a future GPU (see discussion here):

Nyuzi – An Experimental Open-Source FPGA GPGPU Processor

https://news.ycombinator.com/item?id=26132726

FpUser · on Feb 15, 2021

This is very interesting stuff. I'd love for these kind of chips to be sold in a form of a coprocessor cards for PCs for example. For some type of servers and other uses it'd be a blessing.

SuchAnonMuchWow · on Feb 15, 2021

It does exist in various forms. One I'm familiar with is Kalray MPPA (massively parallel processor array): https://www.kalrayinc.com/technology/ Their current generation of processors have 80 cores exposed to the userspace (the previous one had 256), and they are mainly used as PCIe acceleration boards for all kinds of applications.

But as another poster said, the difficulty with those kinds of architecture is the memory hierarchy: you don't want all the cores to access the DDR directly, as that would be a massive bottleneck. And this becomes a larger issue as you increase the number of cores.

jpm_sd · on Feb 15, 2021

Those are called GPUs! Or, if you like, GPGPUs.

Readily available in The Cloud(TM) and everything.

https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html

FpUser · on Feb 15, 2021

Nope. GPUs are SIMD. KiloCore is however 1000 totally independent cores.

captainbland · on Feb 15, 2021

My understanding is that on Nvidia cards, each SM is basically an independent SIMD processor. So a RTX 3090 is like a processor with 82 independent but super-wide cores. I think the truth is actually even more complicated still with multiple warps being able to execute on a single SM with some memory subsystem limitations but it's something like that. The equivalent is true with AMD GPUs, too, just substitute the terminology where appropriate.

Either way still not close to 1000 independent cores.

desertrider12 · on Feb 15, 2021

In a way, there are many more independent cores in an Nvidia GPU than #SMs, since the SIMD (warp) is only 32 threads wide, and >1k threads can be active at a time per SM. Each warp has its own independent program counter. So if you really wanted to, you could have a kernel with blocks that are Nx32, and code like:

if(blockIdx.x == 0) doThing1(); else if(blockIdx.x == 1) doThing2(); ...

etc., and the doThings could run at the same time, even on the same SM. Not super practical but it's allowed.

guerrilla · on Feb 15, 2021

I can't escape the nagging question: is it really 1,000 or actually 1,024 and simplified for the press release?

Bluestein · on Feb 15, 2021

Useful comment upthread :)

- https://news.ycombinator.com/item?id=26143976

kristianp · on Feb 16, 2021

Clearly there's enough runway in conventional low core-count processors that this isn't a viable path to take any time soon. Will it become the way forward in a decade or so?