One thing that Nvidia has going for it is the stickiness of CUDA. Developers don...

singhrac · on May 26, 2023

Sorry, I have to point out: Tensorflow is not comparable to CUDA. Tensorflow is a (arguably) high level library that links against CUDA to run on NVIDIA GPUs, as does PyTorch (the main competitor).

Comparatively few people have “deep” experience with CUDA (basically Tensorflow/Pytorch maintainers, some of whom are NVIDIA employees, and some working in HPC/supercomputing).

CUDA is indeed sticky, but the reason is probably because CUDA is supported on basically every NVIDIA GPU, whereas AMD’s ROCm was until recently limited to CDNA (datacenter) cards, so you couldn’t run it on your local AMD card. Intel is trying the same strategy with oneAPI, but since no one has managed to see a Habana Gaudi card (let alone a Gaudi2), they’re totally out of the running for now.

Separately, CUDA comes with many necessary extensions like cuSparse, cuDNN, etc. Those exist in other frameworks but there’s no comparison readily available, so no one is going to buy an AMD CDNA card.

AMD and Intel need to publish a public accounting of their incompatibilities with PyTorch (no one cares about Tensorflow anymore), even if the benchmarks show that their cards are worse. If you don’t measure in the public no one will believe your vague claims about how much you’re investing into the AI boom. Certainly I would like to buy an Intel Arc A770 with 16GB of VRAM for $350, but I won’t, because no one will tell me that it works with llama.

somethoughts · on May 26, 2023

With respect to the incompatabilities with PyTorch and TensorFlow - given that the AMD and Intel GPU drivers are more likely to be open sourced - do you believe the open source community or a third party vendors will step in to close the gap for AMD/Intel?

It would seem a great startup idea with the intent to get acqui-hired by AMD or Intel to get into the details of these incompatibilities and/performance differences.

At worst it seems you could pivot into some sort of passive income AI benchmarking website/YT channel similar to the ones that exist for Gaming GPU benchmarks.

eslaught · on May 26, 2023

Drivers are only the lowest level of the stack. You could (in principle) have a great driver ecosystem and a nonexistent user-level ecosystem. And indeed, the user-level ecosystem on AMD and Intel seems to be suffering.

For example, I recently went looking into Numba for AMD GPUs. The answer was basically, "it doesn't exist". There was a version, it got deprecated (and removed), and the replacement never took off. AMD doesn't appear to be investing in it (as far as anyone can tell from an outsider's perspective). So now I've got a code that won't work on AMD GPUs, even though in principle the abstractions are perfectly suited to this sort of cross-GPU-vendor portability.

NVIDIA is years ahead not just in CUDA, but in terms of all the other libraries built on top. Unless I'm building directly on the lowest levels of abstraction (CUDA/HIP/Kokkos/etc. and BLAS, basically), chances are the things I want will exist for NVIDIA but not for the others. Without a significant and sustained ecosystem push, that's just not going to change quickly.

pmoriarty · on May 27, 2023

"NVIDIA is years ahead not just in CUDA, but in terms of all the other libraries built on top."

How big an effort would it take to get those libraries to work with AMD drivers?

singhrac · on May 26, 2023

I think this is what George Hotz is doing with tiny corp, but I have to admit I have little hope. Making asynchronous SIMD code fast is very difficult as a base point, let alone without internal view of decisions like “why does this cause a sync” or even “will this unnecessary copy ever get fixed?”. Unfortunately AMD and especially Intel don’t “develop in the open”, so even if the drivers are open sourced, without context it’ll be an uphill battle.

To give some perspective, see @ngimel’s comments and PRs in Github. That’s what AMD and Intel are competing against, along with confidence that optimizing for ML customers will pay off (clearly NVIDIA can justify the investment already).

pca006132 · on May 26, 2023

This kind of software development is hard and expensive. I do not think that this can enable you to make enough income from benchmark website or YT channel, considering most people are not interested in those low level details.

_fjb4 · on May 26, 2023

Theoretically the ARC should work with llama.cpp using OpenCL, but I haven't seen benchmarks or even a confirmation that it works.

oktwtf · on May 26, 2023

This has always been in the back of my mind anytime AMD has some new GPUs with nice features. Gamers will say this will be where AMD will win the war. But I fear the war is already won on the compute that counts, and right now that’s CUDA accel on NVIDIA.

AnthonyMouse · on May 26, 2023

This has been the case for a while because AMD never had the resources to do software well. But their market cap is 10x what it was 5 years ago, so now they do. That still takes time, and having resources isn't a guarantee of competent execution, but it's a lot more likely now than it used to be.

On top of that, Intel is making a serious effort to get into this space and they have a better history of making usable libraries. OpenVINO is already pretty good. It's especially good at having implementations in both Python and not-Python, the latter of which is a huge advantage for open source development because it gets you out of Python dependency hell. There's a reason the thing that caught on is llama.cpp and not llama.py.

dogma1138 · on May 26, 2023

AMDs problem with software goes well beyond people they can’t stick with anything for any significant length of time and the principal design behind ROCm is doomed to fail as it compiles hardware specific binaries and offers no backward or forward compatibility.

CUDA compiles to hardware agnostic intermediary binaries which can run on any hardware as long as the target feature level is compatible and you can target multiple feature levels with a single binary.

CUDA code compiled 10 years ago still runs just fine, ROCm require recompilation every time the framework is updated and every time a new hardware is released.

AnthonyMouse · on May 26, 2023

That's all software. There is nothing but resources between here and a release of ROCm that compiles existing code into a stable intermediate representation, if that's something people care about. (It's not clear if it is for anything with published source code; then it matters a lot more if the new version can compile the old code than if the new hardware can run the old binary, since it's not exactly an ordeal to hit the "compile" button once or even ship something that does that automatically.)

dogma1138 · on May 27, 2023

It’s a must, published source code or not it doesn’t help.

First there is no forward compatibility guarantee for compiling and based on current history it always breaks.

Secondly even if the code is available a design that breaks software on other users machine is stupid and anti user.

Plenty of projects could import libraries and then themselves be upstream dependencies for other projects, many of which may not be supported.

CUDA is king because people can and still do run 15 year old compiles CUDA code on a daily basis and they know that what they produced today is guaranteed to work on all current and future hardware.

With ROCm you have no guarantee that it would work on even the hardware from the same generation and you pretty much have a guarantee that the next update will break your stuff.

This was a problem with all AMD compilers for GPGPU and ROCm should’ve tried to solve it from day 1 but it still adopted a poor design and that has nothing to do with how many people are working on it.

AnthonyMouse · on May 27, 2023

> Secondly even if the code is available a design that breaks software on other users machine is stupid and anti user.

Most things work like this. You can't natively run ARM programs on x86 or POWER or vice versa, but in most languages you can recompile the code. If you have libraries then you recompile the libraries. All it takes is distributing the code instead of just a binary. Not distributing the code is stupid and anti-user.

> This was a problem with all AMD compilers for GPGPU and ROCm should’ve tried to solve it from day 1 but it still adopted a poor design and that has nothing to do with how many people are working on it.

It isn't even a design decision. Compilers will commonly emit machine code that checks for hardware features like AVX and branch to different instructions based on whether the machine it's running on supports that. That feature can be added to a compiler at any time.

The compiler is open source, isn't it? You could add it yourself, absent any resource constraints.

dogma1138 · on May 27, 2023

No most thing’s definitely don’t work like this. I don’t expect my x86 program to stop working after a software update or not to work on new x86 CPUs that’s just ridiculous.

Also if you expect anyone to compile anything you probably haven’t shipped anything in your life.

ROCm is a pile of rubbish until they throw it out and actually have a model that would guarantee forward and backward compatibility it would remain useless for anyone who actually builds software other people use.

AnthonyMouse · on May 29, 2023

> I don’t expect my x86 program to stop working after a software update or not to work on new x86 CPUs that’s just ridiculous.

Your x86 program doesn't work on Apple Silicon without something equivalent to a recompile. Old operating systems very commonly can't run on bare metal new hardware because they don't have drivers for it.

Even the IR isn't actually machine code, it's just a binary format of something that gets compiled into actual machine code right before use.

> Also if you expect anyone to compile anything you probably haven’t shipped anything in your life.

Half the software people run uses JIT compilation of some kind.

paulmd · on May 26, 2023

The only real remaining fronts in the war are consoles and smartphones, and NVIDIA just signed a deal to license GeForce IP to mediatek so that nut is being cracked as well, mediatek gives them mass-market access for CUDA tech, DLSS, and other stuff. Nintendo has essentially a mobile console platform and will be doing DLSS too on an Orin NX 8nm chip soon (very cheap) using that same smartphone-level DLSS (probably re-optimized for lower resolutions). Samsung 8nm is exactly Nintendo's kind of cheap, it'll happen.

The "NVIDIA they might leave graphics and just do AI in the future!" that people sometimes do is just such a batshit take because it's graphics that opens the door to all these platforms, and it's graphics that a lot of these accelerators center around. What good is DLSS without a graphics platform? Do you sign the Mediatek deal without a graphics platform? Do you give up workstation graphics and OptiX and raysampling and all these other raytracing techs they've spent billions developing, or do you just choose to do all the work of making Quadros and all this graphics tech but then not do gaming drivers and give up that gaming revenue and all the market access that comes with it? It's faux-intellectualism and ayymd wish-casting at its finest, it makes zero sense when you consider the leverage they get from this R&D spend across multiple fields.

CUDA is unshakeable precisely because NVIDIA is absolutely relentless in getting their foot in the door, then using that market access to build a better mousetrap with software that everyone else is constantly rushing to catch up to. Every segment has some pain points and NVIDIA figures out what they are and where the tech is going and builds something to address that. AMD's approach of trying to surgically tap high-margin segments before they have a platform worth caring about is fundamentally flawed, they're putting the cart before the horse, and that's why they keep spinning their wheels on GPGPU adoption for the last 15 years. And that's what people are clamoring for NVIDIA to do with this idea of "abandon graphics and just do AI" and it's completely batshit.

Intel gets it, at least. OneAPI is focused on being a viable product and they'll move on from there. ROCm is designed for supercomputers where people get paid to optimize for it - it's an embedded product, not a platform. Like you can't even use the binaries you compile on anything except one specific die (not even a generation, "this is binary is for Navi 21, you need the Navi 23 binary"). CUDA is an ecosystem that people reach for because there's tons of tools and libraries and support, and it works seamlessly and you can deliver an actual product that consumers can use. ROCm is something that your boss tells you you're going to be using because it's cheap, you are paying to engineer it from scratch, you'll be targeting your company's one specific hardware config, and it'll be inside a web service so it'll be invisible to end-users anyway. It's an embedded processor inside some other product, not a product itself. That's what you get from the "surgically tap high-margin segments" strategy.

But the Mediatek deal is big news. When we were discussing the ARM acquisition etc people totally scoffed that NVIDIA would ever license GeForce IP. And when that fell through, they went ahead and did it anyway. Because platform access matters, it's the foot in the door. The ARM deal was never about screwing licensees or selling more tegras, that would instantly destroy the value of their $40b acquisition. It was 100% always about getting GeForce as the base-tier graphics IP for ARM and getting that market access to crack one of the few remaining segments where CUDA acceleration (and other NVIDIA technologies) aren't absolutely dominant.

And graphics is the keystone of all of it. Market access, software, acceleration, all of it falls apart without the graphics. They'd just be ROCm 2.0 and nobody wants that, not even AMD wants to be ROCm. AMD is finally starting to see it and move away from it, it would be wildly myopic for NVIDIA to do that and Jensen is not an idiot.

Not entirely a direct response to you but I've seen that sentiment a ton now that AI/enterprise revenue has passed graphics and it drives me nuts. Your comment about "what would it take to get Radeon ahead of CUDA mindshare" kinda nailed it, CUDA literally is winning so hard that people are fantasizing about "haha but what if NVIDIA got tired of winning and went outside to ride bikes and left AMD to exploit graphics in peace" and it's crazy to think that could ever be a corporate strategy. Why would they do that when Jensen has spent the last 25 years building this graphics empire? Complete wish-casting, “so dominant that people can’t even imagine the tech it would take to break their ubiquity” is exactly where Jensen wants to be, and if anything they are still actively pushing to be more ubiquitous. That's why their P/E are insane (probably overhyped even at that, but damn are they good).

If there is a business to be made doing only AI hardware and not a larger platform (and I don’t think there is, at that point you’re a commodity like dozens of other startups) it certainly looks nothing like the way nvidia is set up. These are all interlocking products and segments and software, you can’t cut any one of them away without gutting some other segment. And fundamentally the surgical revenue approach doesn’t work, AMD has continuously showed that for the last 15 years.

Being unwilling to catch a falling knife by cutting prices to the bone doesn’t mean they don’t want to be in graphics. The consumer GPU market is just unavoidably soft right now, almost irregardless of actual value (see: 4070 for $600 with a $100 GC at microcenter still falling flat). Even $500 for a 4070 is probably flirting with being unsustainably low (they need to fund R&D for the next gen out of these margins) but if a de-facto $500 price doesn’t spark people’s interests/produce an increase in sales they’re absolutely not going any lower than that this early in the cycle. They’ll focus on margin on the sales they can actually make, rather than chasing the guy who is holding out for 4070 to be $329. People don't realize it but obstinently refusing to buy at any price (even a good deal) is paradoxically creating an incentive to just ignore them and chase margins.

It doesn’t mean they don’t want to be in that market but they’re not going to cut their own throat, mis-calibrate consumer expectations, etc.

Just as AMD is finding out with the RX 7600 launch - if you over-cut on one generation, the next generation becomes a much harder sell. Which is the same lesson nvidia learned with the 1080 ti and 20-series. AMD is having their 20-series moment right now, they over-cut on the old stuff and the new stuff is struggling to match the value. And the expectations of future cuts is only going to dampen demand further, they’re Osborne Effect’ing themselves with price cuts everyone knows are coming. Nvidia smartened up - if the market is soft and the demand just isn’t there… make less gaming cards and shift to other markets in the meantime. Doesn’t mean they don’t want to be in graphics.

jejeyyy77 · on May 26, 2023

Tensorflow is optimized for TPU's which isn't really consumer-grade hardware.

caeril · on May 26, 2023

Unrelated question for the HN experts:

My sibling commenter is shadowbanned, but if you look into their comment history, there are occasionally comments that are not dead. How does this happen?

philipkglass · on May 26, 2023

Somebody clicked on the timestamp of that post and used the "vouch" link to unhide it. I sometimes do that for comments from new accounts that been hidden by some overzealous anti-spam heuristic.

RobotToaster · on May 26, 2023

Helpful to know, I've seen a few hidden posts that seem reasonable but didn't know I could do that.

RobotToaster · on May 26, 2023

Isn't the coral stick a TPU?

giobox · on May 26, 2023

Yes, although availability recently has been pretty bad following the chip shortage, and prices skyrocketed to ~300 dollars. Not sure if situation returning to normal yet. Similar woes to the Raspberry Pi etc.

I needed two for a project and ended up paying a lot more than I wanted for used ones.

For those not familiar, consumer/hobbyist grade TPUs:

https://coral.ai/products/

joseph_grobbles · on May 26, 2023

Google's first TPU was developed a year after Tensorflow. And for that matter, Tensorflow works fine with CUDA, was originally entirely built for CUDA, and it's super weird the way it's being referenced in here.

Tensorflow lost out to Pytorch because the former is grossly complex for the same tasks, with a mountain of dependencies, as is the norm for Google projects. Using it was such a ridiculous pain compared to Pytorch.

And anyone can use a mythical TPU right now on the Google Cloud. It isn't magical, and is kind of junky compared to an H100, for instance. I mean...Google's recent AI supercomputer offerings are built around nvidia hardware.

CUDA keeps winning because everyone else has done a horrendous job competing. AMD, for instance, had the rather horrible ROCm, and then they decided that they would gate their APIs to only their "business" offerings while nvidia was happy letting it work on almost anything.

HellDunkel · on May 26, 2023

Best explanation so far. I am surprised OpenCL never gained much traction. Any idea why?

blihp · on May 26, 2023

The same reason most of AMD's 'open' initiatives don't gain traction: they throw it out there and hope things will magically work out and that a/the community will embrace it as the standard. It takes more work than that. What AMD historically hasn't done is the real grunge work of addressing the limitations of their products/APIs and continuing to invest in them long term. See how the OpenCL (written by AMD) Cycles renderer for Blender worked out, for example.

Something AMD doesn't seem to understand/accept is that since they are consistently lagging nVidia on both the hardware and software front, nVidia can get away with some things AMD can't. Everyone hates nVidia for it, but unless/until AMD wises up they're going to keep losing.

onetimeusename · on May 30, 2023

what did you do to get all your posts automatically dead?

seydor · on May 26, 2023

frameworks can be agnostic to the underlying library. What are formidable alternatives to cuda ?