As someone who uses supercomputers, I'm not sure I entirely understand the market of this product. It's really cool and I'd love to have one to tinker with, but due to its high parallelization, I see no benefit of using this over a graphics card. I'm not sure if $99 can get you a GPU that reaches 90 GFlops though... perhaps that's where the benefit lies.
EDIT: After reviewing their website, I notice they state
> One important goal of Parallella is to teach parallel programming...
In this respect, I can see how this is useful. Adapting scientific software to GPUs can be difficult and isn't the easiest thing to get into for your average person. This board, with its open-source toolkit and community could make this process a lot easier.
You can't do individual branching for every single strand of computation for a typical GPU. With their chips they are trying to create a "third" category between highly parallel, but few individual instruction threads, and normal CPU's.
And the $99 is not for a mass produced card or chip, but for an initial small production run computer that includes one of the parallela chip.
You pay for a dual-core ARM computer based on the Zynq SoC (which means you get an FPGA in the deal) with an Epiphany chip. Getting a Zynq dev-board for that price in itself makes it worth it for a lot of people.
I think this may just be a novel way to sell a dev board for their custom silicon and get some of that heavy kickstarter press coverage.
If you figure that what they're really trying to do is get people familiar with it and see how well it might augment one of their existing ARM products it starts to make a lot of sense.
For instance I have a low end 4 bay arm based nas. It's insanely modest specs (1.6ghz single core + 512MB ram) actually are quite sufficient for most nas tasks. But it's really more like a home server platform as they have all sorts of addons that include things like CCTV archiving, DVR, ip pbx - you get the picture. But if you really start treating it like a general purpose server you quickly realize that some common workloads perform horribly on that arm core and it's frustrating.
It can easily push 800mbps or so with nfs smb or cifs, but if you want to rsync+ssh you're looking at less than a 10th of that because of the various fp needs of that chain. Native rsync with no ssh/no compression does somewhat better but still poorly due to its heavy use of cryptographic hash functions for delta transfers.
There are plenty of other examples - file system compression, repairing multipart files with par2 (kind of like raid for file sets). Face detection, file integrity hashing) And if it could do on the fly video transcoding (don't even think about it) it could happily replace another full system i have running plex server.
There's probably a lot of devices that the designers default to arm but have to skip features that are heavy fp. If somebody in the firm has played around with a chip you can just drop in and not change your soc or toolchain that starts sounding pretty good i'd guess - and likely still far cheaper than an atom soc.
It's an interesting (read: niche) market, to be sure. It's the same market as someone who might buy 8 raspberry pi boards, or 4 ODroid U2 machines for the purpose of learning about parallel computation.
The Epipchany chip (the coprocessor on these boards) is supported as of GCC 4.8, so we also may see some novel ways to offload work to this chip in the future.
Each TigerSHARC-like DSP core has 32 kByte embedded memory, and memory-like access to other core's memory space. The system has GBit Ethernet, runs Linux and takes only 2-3 W power.
You can get close to 1TF in the GPU space for $99 these days, but at much lower power efficiency. The Adapteva hardware looks to be around 18GF/W, whereas a GPU is about half that at $99.
Comparing it with a GPU is a natural discussion to have. I believe there is more information on their website to answer the question. But maybe someone else can explain the programming paradigm difference.
http://www.adapteva.com/introduction/
Each of the 64 cores can independently run arbitrary C/C++ code, which is much more flexible than a GPU. Each core has 32KB of local memory, which can also be accessed by the other cores, and there's 1GB of external memory too.
I wonder how much work it would take to port cpuminer to this platform for use in mining bitcoins and litecoins. Probably not worth it for bitcoins with the new FPGA hardware but could be interesting for litecoin depending on the memory access speed to main memory.
You mean the new ASIC hardware? The FPGA are ancient in the crazy world of Bitcoin, although a Litecoin FPGA will emerge in the next few month considering Litecoins value.
Interesting architecture. I like how well-documented everything is. Usually, either the low-level ISA for accelerator chips is not documented at all (like with GPUs), or detailed documentation is only available under NDA, and only proprietary development tools are available (like with FPGAs).
Exactly. That's the real point. Are you going to be able to run these together and get on the TOP500. Lord no! But, can you learn how to parallel program on one? Sure, and that's no small matter.
First of all, which is the NP-hard pathfinding problem you're talking about? When I hear "pathfinding" I think "shortest path", which is (deterministic) polynomial time (exact class dependant on exactly which variant of the problem, but even Floyd-Warshall is Θ(|V|³)).
---
Anyway, no. If you have an NP-hard problem, and you want an exact answer (i.e. you are denying yourself approximate solutions), and you want to solve it for large inputs, unless you have either proven that P=NP by construction (heh), or you have a non-deterministic computing machine (heh), you're basically screwed. Going parallel isn't going to help, any more than a hypothetical billion-GHz serial CPU is going to help. Asking this question suggests a fundamental lack of understanding about what is interesting (or rather, infuriating) about NP-hard problems.
Parallel processing models give you, at best, linear speedup. If your problem is O(too-big), and your input is large, linear speedup doesn't help, no matter how much linear speedup you have.
Say you're running a problem of size N on a processor with a cache of size N/2. Adding another processor with another N/2 cache means your whole problem now fits in cache, so you'll probably suffer fewer cache misses, and thus could end up running more than 2x faster.
Hmn. I don't find those examples compelling, but maybe I'm just missing something.
I don't have a thought about the wikipedia backtracking example.
Regarding cache-vs-RAM, or regarding RAM-vs-disk, I see no reason you cannot take the work that the parallel units do, and serialize it onto a single processor. Let me consider the example you gave.
Initially, you have two processors with caches of size N/2, and a problem of size N on disk. You effortlessly split the problem into two parts at zero cost (ha!), load the first part on procA, the second part on procB. You pay one load-time. Now you process it, paying one processing-time, then store the result to disk, paying one store-time. Now you effortlessly combine it at zero cost (again, ha!), and you're done.
In my serial case, I do the same effortless split, I load (paying one load), compute (paying one compute), then store (paying one store). Then I do that again (doubling my costs). Then I do the same effortless combine. My system is 1/2 as fast as yours.
In short, I think "superlinear speedup" due to memory hierarchy is proof-by-construction that the initial algorithm could have been written faster. What am I missing?
OK, bear with me as we get slightly contrived here...(and deviate a little from my earlier, somewhat less thought out example).
Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).
Hmn. I think the reason there's superlinear speedup in the paper you linked is because the requests must be serviced in order. If you only care about throughput, and you can service the requests out-of-order, then you can use LARD in a serial process too, to improve cache locality, and achieve speed 1/Nth that of N-machine LARD. But to serve requests online, you can't do that reordering, so with one cache you'd be constantly invalidating it, thus the increased aggregate cache across the various machines results in superlinear speedup.
So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!
EDIT: After reviewing their website, I notice they state
> One important goal of Parallella is to teach parallel programming...
In this respect, I can see how this is useful. Adapting scientific software to GPUs can be difficult and isn't the easiest thing to get into for your average person. This board, with its open-source toolkit and community could make this process a lot easier.