I was not aware that the PyMC developers have forked and continued Theano: https...

blt · on Dec 16, 2020

Re. the last point, was trying to think of computations where 1) an efficient in-place version is possible, and 2) the most efficient out-of-place version is significantly faster than copying the input and executing the in-place version.

In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.

Any particular example that occurs to you?

albertzeyer · on Dec 16, 2020

In Big-O notation, there will not be any difference, because copying the data will just be O(N), and whatever you do in the op will be at least O(N), so no change.

But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.

Actually, I implemented most of my native ops exactly in this way, i.e. I implemented the inplace version, and the non-inplace version would just additionally copy it and then call the inplace version.

btwillard · on Dec 16, 2020

Hello, I'm the person spearheading this Theano fork! Your comments match my experience with the old Theano very well, so I have to respond.

> Apparently, the main new feature for Theano will be the JAX backend.

The JAX transpilation feature arose as a quick example of how flexible Theano can be, both in terms of its "hackability" and its simple yet effective foundation (i.e. "static" graphs). It's definitely not the main focus of the fork, but it is easily the newest feature that stands out at the user-level.

The points you raised about the old Theano are actually the main focus, and we've already made large internal changes that address a few of them directly. At the very least, nearly all of them are on the roadmap toward our new library named "Aesara".

The `Scan` `Op` and its optimizations are definitely going to change, and I have no intention of sacrificing improvements for backward compatibility, or anything else that would constrain the extent of improvements. I too have dealt with the difficulties involved in writing Scan optimizations (e.g. https://github.com/pymc-devs/symbolic-pymc/blob/master/symbo...) and am painfully aware of how unnecessary most of them are.

> - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. ...

The most important graph optimization performance problems are not actually related to Python performance; they're demonstrably design and implementation induced. That is unless you're talking exclusively about graphs so large they reach the "natural" limits of Python performance by definition. Even then, a nearly one-to-one C translation isn't likely to solve those scaling problems.

For example, the graph optimization/rewriting framework would require entire graphs to be copied at multiple points in the process, and this was almost completely due to some design oddities. We've already made all of the large-scale changes needed in order to remedy this design constraint, so we're well on our way to fixing that. See https://github.com/pymc-devs/Theano-PyMC/pull/158

The rewriting process also doesn't track or use node information very well (or at all), so the whole optimization process itself can take an unnecessary number of passes through a graph. For instance, its "local" optimizations have a "tracking" option that specifies the `Op` types to which they apply; however, that feature isn't even used unless the local optimizations are applied by a `LocalOptGroup`. I've noticed at least a few instances in which these local optimizations are applied to inapplicable `Op`s on each visit to a node. Worse yet, within `LocalOptGroup` those local optimizations aren't applied directly to the relevant `Op`s, even though the requisite `Op` type-to-node information is readily available. In other words, optimizations could be directly applied to the relevant nodes in these cases and dramatically reduce the amount of blind graph traversals performed.

At best, a reimplementation in a language with a better compiler, like C, would largely amount to a questionable brute-force attempt at performance, and the ease of manipulating graphs and developing graph rewrites would suffer. With Aesara, we're going for the opposite. We want a smarter framework and _more_ focus on domain-specific optimizations (e.g. linear/tensor algebra, statistics, computer science) from the domain experts themselves, so code transparency and ease of development really matters. When we need raw performance in specific areas of the code, we'll pinpoint those areas and write C extensions, in standard Python fashion.

> ... When switching to TensorFlow, building the graph felt almost instant in comparison. ...

Last I checked, TensorFlow had almost no default graph optimizations, aside from some basic CSE and minor canonicalization and algebraic simplifications in the `grappler` module, so it absolutely should be instantaneous. More importantly, TensorFlow isn't designed for graph rewriting, and definitely not at the Python level where rapid prototyping and testing is possible outside of Google.

Otherwise, if you're talking about initially _building_ a graph and not calling `theano.function`, there are no optimizations involved. Latency in that case would be something entirely different and well worth reproducing for an issue. If what you were observing was the effect of calling `theano.function`, the latency was most likely due to the C transpilation and subsequent compilation. That's a feature that necessarily takes time, but produces code that's often faster than TensorFlow even today.

In summary, the changes we're most focused on right now are for developers like yourself who have had to deal with the core of Theano, so, please, stop by the fork and help us make a better `Scan`!

albertzeyer · on Dec 16, 2020

Hey! Thanks for the answer!

By graph building, I actually meant graph compilation. In TF the first `session.run`, or in Theano the `theano.function`.

I did not get too much into the internals of the graph compilation + optimization (despite writing a couple of simple own optimization passes), so I don't really know whether sth is done really inefficient, but I can easily believe that. I agree, if sth is inefficient there, it should be rewritten in a more efficient way. But I also think that even if you have it as efficient as it can be, it still would be slow, compared to a C/C++/Rust implementation, easily by a factor of 100 or so. And even in C/C++ it can still be slow, when I consider how much time LLVM or GCC takes in their optimization passes.

Yes, TensorFlow does not have much optimization, although I think the idea was always to extend that. But then, as you say, this also is one of the reasons the graph compilation is so fast. But comparing the runtime performance of Theano vs TF, in most cases, TF was just as fast or faster (which is likely dependent on the specific model; but as far as I remember, that was the general observation by the community). So because of that, I was questioning whether all that heavy graph optimization is really worth it. Numerical stability is another topic, of course. But you can also have some simple logic for that, e.g. implement your own `safe_log`, which checks if the input is `softmax(x)`, and then directly returns `log_softmax(x)`. See e.g. here: https://github.com/rwth-i6/returnn/blob/6cd6b7b3b3d3beb33140...

Btw, graph rewriting in TF is certainly also possible, and not so complicated. But it's not really optimized for that. You cannot rewrite parts of the graph inplace. You would need to create a new copy. (Although, technically, I think it would not be too complicated to allow for more graph rewriting, also inplace. But it was/is just not a high priority.)

About `Scan`: I think the main problem is the API itself. I think it is easier if the underlying op would be `WhileLoop` or so, very similar to `tf.while_loop`. Then everything becomes very natural. However, then you would need some good way to accumulate your outputs, if you actually want to have the logic of `scan`. Sth like `ys = concat(ys, [y])` inside the loop. And then it probably is necessary to have specific optimizations on that to make that efficient. Or introduce sth like `TensorArray`. But in both cases, I think this is easier than working with `Scan` as the underlying op for loops.

Btw, in the blog post, it is written that TF is focusing on dynamic graphs now. While this indeed was an important focus when TF2 was introduced, I'm not sure whether they might take a step back again. Of course this is just speculation. But I think even internally, they are seeing the problems with dynamic graphs, and many groups still use the non-eager mode with static graphs and don't have any intention to switch away from that.