As someone who uses supercomputers, I'm not sure I entirely understand the marke...

vidarh · on April 16, 2013

You can't do individual branching for every single strand of computation for a typical GPU. With their chips they are trying to create a "third" category between highly parallel, but few individual instruction threads, and normal CPU's.

And the $99 is not for a mass produced card or chip, but for an initial small production run computer that includes one of the parallela chip.

You pay for a dual-core ARM computer based on the Zynq SoC (which means you get an FPGA in the deal) with an Epiphany chip. Getting a Zynq dev-board for that price in itself makes it worth it for a lot of people.

trotsky · on April 16, 2013

I think this may just be a novel way to sell a dev board for their custom silicon and get some of that heavy kickstarter press coverage.

If you figure that what they're really trying to do is get people familiar with it and see how well it might augment one of their existing ARM products it starts to make a lot of sense.

For instance I have a low end 4 bay arm based nas. It's insanely modest specs (1.6ghz single core + 512MB ram) actually are quite sufficient for most nas tasks. But it's really more like a home server platform as they have all sorts of addons that include things like CCTV archiving, DVR, ip pbx - you get the picture. But if you really start treating it like a general purpose server you quickly realize that some common workloads perform horribly on that arm core and it's frustrating.

It can easily push 800mbps or so with nfs smb or cifs, but if you want to rsync+ssh you're looking at less than a 10th of that because of the various fp needs of that chain. Native rsync with no ssh/no compression does somewhat better but still poorly due to its heavy use of cryptographic hash functions for delta transfers.

There are plenty of other examples - file system compression, repairing multipart files with par2 (kind of like raid for file sets). Face detection, file integrity hashing) And if it could do on the fly video transcoding (don't even think about it) it could happily replace another full system i have running plex server.

There's probably a lot of devices that the designers default to arm but have to skip features that are heavy fp. If somebody in the firm has played around with a chip you can just drop in and not change your soc or toolchain that starts sounding pretty good i'd guess - and likely still far cheaper than an atom soc.

sliverstorm · on April 16, 2013

Interesting observations. I know AMD is making an ARM chip (http://www.anandtech.com/show/6418/amd-will-build-64bit-arm-...). Have they said anything about FP?

wuest · on April 16, 2013

It's an interesting (read: niche) market, to be sure. It's the same market as someone who might buy 8 raspberry pi boards, or 4 ODroid U2 machines for the purpose of learning about parallel computation.

The Epipchany chip (the coprocessor on these boards) is supported as of GCC 4.8, so we also may see some novel ways to offload work to this chip in the future.

lgeek · on April 16, 2013

You can buy an AMD 7750 for $99, which theoretically gives you in the range of 800-900 GFLOPS (single precision).

eleitl · on April 17, 2013

Each TigerSHARC-like DSP core has 32 kByte embedded memory, and memory-like access to other core's memory space. The system has GBit Ethernet, runs Linux and takes only 2-3 W power.

This is ideal cluster material.

rys · on April 16, 2013

You can get close to 1TF in the GPU space for $99 these days, but at much lower power efficiency. The Adapteva hardware looks to be around 18GF/W, whereas a GPU is about half that at $99.

mrb · on April 16, 2013

The AMD Radeon HD 7790 ($150) does 21 GFLOPS/Watt. So I would say Adapteva is definitely comparable to GPUs in this approximate price range.

marshray · on April 17, 2013

And that's just at the Kickstarter price. GPUs have had decades of competition to squeeze the extra costs out and build up economies of scale.

amalag · on April 16, 2013

Comparing it with a GPU is a natural discussion to have. I believe there is more information on their website to answer the question. But maybe someone else can explain the programming paradigm difference. http://www.adapteva.com/introduction/

mkl · on April 16, 2013

Each of the 64 cores can independently run arbitrary C/C++ code, which is much more flexible than a GPU. Each core has 32KB of local memory, which can also be accessed by the other cores, and there's 1GB of external memory too.

Specs: http://www.adapteva.com/products/silicon-devices/e64g401/

Architecture: http://www.adapteva.com/wp-content/uploads/2012/10/epiphany_...

SDK docs: http://www.adapteva.com/wp-content/uploads/2013/04/epiphany_...

jareds · on April 16, 2013

I wonder how much work it would take to port cpuminer to this platform for use in mining bitcoins and litecoins. Probably not worth it for bitcoins with the new FPGA hardware but could be interesting for litecoin depending on the memory access speed to main memory.

josephagoss · on April 16, 2013

You mean the new ASIC hardware? The FPGA are ancient in the crazy world of Bitcoin, although a Litecoin FPGA will emerge in the next few month considering Litecoins value.

jareds · on April 16, 2013

Looks like I was about a year behind.

wladimir · on April 16, 2013

Interesting architecture. I like how well-documented everything is. Usually, either the low-level ISA for accelerator chips is not documented at all (like with GPUs), or detailed documentation is only available under NDA, and only proprietary development tools are available (like with FPGAs).

The topology reminds me of this paper "The Landscape of Parallel Computing Research: A View from Berkeley" http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-18...

lgeek · on April 16, 2013

> the low-level ISA for accelerator chips is not documented at all (like with GPUs)

Let me show you the AMD Southern Islands ISA specs: http://developer.amd.com/wordpress/media/2012/10/AMD_Souther...

wagtail · on April 16, 2013

Exactly. That's the real point. Are you going to be able to run these together and get on the TOP500. Lord no! But, can you learn how to parallel program on one? Sure, and that's no small matter.

ludston · on April 16, 2013

Could it be good for path finding and other NP Hard problems? It could do good things for the AI world if that's the case.

jholman · on April 16, 2013

First of all, which is the NP-hard pathfinding problem you're talking about? When I hear "pathfinding" I think "shortest path", which is (deterministic) polynomial time (exact class dependant on exactly which variant of the problem, but even Floyd-Warshall is Θ(|V|³)).

---

Anyway, no. If you have an NP-hard problem, and you want an exact answer (i.e. you are denying yourself approximate solutions), and you want to solve it for large inputs, unless you have either proven that P=NP by construction (heh), or you have a non-deterministic computing machine (heh), you're basically screwed. Going parallel isn't going to help, any more than a hypothetical billion-GHz serial CPU is going to help. Asking this question suggests a fundamental lack of understanding about what is interesting (or rather, infuriating) about NP-hard problems.

Parallel processing models give you, at best, linear speedup. If your problem is O(too-big), and your input is large, linear speedup doesn't help, no matter how much linear speedup you have.

1amzave · on April 16, 2013

> Parallel processing models give you, at best, linear speedup.

Most commonly, yes, though in practice superlinear speedups can occur in some situations. (Not that this negates your overall point, just a nitpick.)

jholman · on April 17, 2013

Can you give one or more examples?

1amzave · on April 17, 2013

Say you're running a problem of size N on a processor with a cache of size N/2. Adding another processor with another N/2 cache means your whole problem now fits in cache, so you'll probably suffer fewer cache misses, and thus could end up running more than 2x faster.

Wikipedia also points out possibilities at an algorithmic level: https://en.wikipedia.org/wiki/Speedup#Super_linear_speedup

jholman · on April 18, 2013

Hmn. I don't find those examples compelling, but maybe I'm just missing something.

I don't have a thought about the wikipedia backtracking example.

Regarding cache-vs-RAM, or regarding RAM-vs-disk, I see no reason you cannot take the work that the parallel units do, and serialize it onto a single processor. Let me consider the example you gave.

Initially, you have two processors with caches of size N/2, and a problem of size N on disk. You effortlessly split the problem into two parts at zero cost (ha!), load the first part on procA, the second part on procB. You pay one load-time. Now you process it, paying one processing-time, then store the result to disk, paying one store-time. Now you effortlessly combine it at zero cost (again, ha!), and you're done.

In my serial case, I do the same effortless split, I load (paying one load), compute (paying one compute), then store (paying one store). Then I do that again (doubling my costs). Then I do the same effortless combine. My system is 1/2 as fast as yours.

In short, I think "superlinear speedup" due to memory hierarchy is proof-by-construction that the initial algorithm could have been written faster. What am I missing?

1amzave · on April 18, 2013

OK, bear with me as we get slightly contrived here...(and deviate a little from my earlier, somewhat less thought out example).

Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).

For a more concrete example somewhat related (though not entirely congruent) to the above scenario, see http://www.ece.cmu.edu/~ece845/docs/lardpaper.pdf (page 211).

jholman · on April 18, 2013

Hmn. I think the reason there's superlinear speedup in the paper you linked is because the requests must be serviced in order. If you only care about throughput, and you can service the requests out-of-order, then you can use LARD in a serial process too, to improve cache locality, and achieve speed 1/Nth that of N-machine LARD. But to serve requests online, you can't do that reordering, so with one cache you'd be constantly invalidating it, thus the increased aggregate cache across the various machines results in superlinear speedup.

So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!

eleitl · on April 17, 2013

It runs MPI and has a GBit Ethernet interface. They've already bult an 8-node demo cluster.

The Zynq 7020 arguably has enough spare capacity and I/O ports to implement a 3d torus with ~GByte/s throughput for each link.