Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Prototype PHP interpreter using the PyPy toolchain - Hippy VM (morepypy.blogspot.com)
157 points by kingkilr on July 13, 2012 | hide | past | favorite | 68 comments


There is also another project named HappyJIT doing much of what is described there. The corresponding paper was presented at last year's Dynamic Languages Symposium [http://dl.acm.org/citation.cfm?id=2047854]

It would be nice if there were comparable benchmark results available, or a discussion of what is different between both approaches/implementations.

EDIT: bb link didn't work, replaced it with ACM portal link.


indeed, the happyjit is not included here in the comparison. I've read the paper (which I of course cannot access via ACM) and their benchmark results were not that good.

They were using an old version of PyPy and did not use some of the advanced features of the JIT generator.


As one of the authors of Happy, your post got me very interested. We wrote the paper over a year ago, we used a version of PyPy from back then. It's possible that we get a bigger speedup with a newer PyPy.

It's interesting that you mention the advanced features. I looked at Hippy and the most interesting JIT feature Hippy uses is _virtualizable2_, which virtualizes all function locals and unboxes them. We tried using it ourselves, but it forces each function to have a static list of variables and no dynamic variable accesses (like $$x). It looks like Hippy falls back to the regular implementation for dynamic variable accesses, where the entire list is stored in a dictionary. Now I'm wondering how much this happens in real-world code, we assumed it does happen enough times.

Also, I'm working on posting a publicly-available version of the paper. I'll post a link when I do that.


For dynamic variable access you do what PyPy (the python interpreter) does for globals. You indeed fall back to a dict, but you keep a dict of cells, that is indexes in the list, not the dict of variables. That way you have extra indirection, but all accesses that are static are efficient, even if you have a dynamic access somewhere (which indeed need to do a lookup, but too bad). I fear this is not the best place to discuss further though, you have my mail.


Right, we can continue this by e-mail.


Please find that some of us find the technical aspects of this conversation very fascinating. It's nice to jump into the mind of another developer and see how they approach problems.


Please don't we are interested ! :)


I found this link, I assume you found the time to upload it?

http://www.ics.uci.edu/~ahomescu/happyjit_paper.pdf


I love this. It's always nice to see PHP being implemented on new platforms (we already have it to some degree running on C++, on JVM, on .NET, now on PyPy). Sadly most of those projects don't really make it to a production-ready state :/

In any case, keep up the work!


It's very different to use the PyPy toolchain that is to run on JVM or .NET. It does not reuse any bit of the Python VM, but uses RPython as the implementation language. You don't share bytecode, you don't share objectmodel. You share the garbage collector and you generate the JIT for your language (unlike reusing the existing one).


This is one of the most fascinating things about PyPy, and one least understood (most likely because the primary output of the PyPy interpreter generator happens to be a Python interpreter written in a static language much like Python called RPython).

I've been toying with the idea of building a little language, just for fun. The mess of building an interpreter has always kept me from it, but as I've spent some time looking at PyPy, I think I might just have a go.


I have more faith in this project, seeing that it was sponsored by Facebook. I was actually saying the other day to some programmer friends that with all the engineering effort Facebook puts into getting PHP to scale for their own system they should just take over the project, rewrite the VM and deal with future features. If anything, I think a PHP fork from Facebook has a better chance of surviving than all the other forks we've seen.


Don't forget Pipp (PHP for the Parrot/Rakudo VM)


Given that in the conventional PHP interpreter, everything is executed from scratch on each request, I wonder how the JIT behaves here, especially since most PHP programs live and die in less time than it seems to take the Pypy JIT to warm up. If/when a version with the web side of the stack is implemented, will JITed code survive across requests?


in short - yes. there is nothing stopping the JIT code from surviving across requests, as long as you keep the process alive.


php over fastcgi is a good example of how to reuse the same php process. I use http://php-fpm.org/ and nginx and it's great.


The issue is not dissimilar to using existing opcode caching.


As mentioned, there is the opcode cache. I'm also thinking about FastCGI setups, where PHP is running the whole time, just waiting to execute code requests.


Is there a good resource for learning how to implement languages using PyPy?



Thank you.


This is very cool. I never thought the improvements in something like PyPy could end up benefiting PHP in such a direct way.


From: http://doc.pypy.org/en/latest/faq.html#what-is-pypy pypy is "a framework for implementing interpreters and virtual machines for programming languages, especially dynamic languages."

One of the big goals of the project is to create a general tool for people to implement interpreters and VMs. The intention isn't to just provide a JIT vm for python, but to allow all sorts of new or existing languages to be (re)implemented easily. Don't be surprised if you start hearing about other languages being implemented in RPython.


Seeing a speedy Ruby implementation in PyPy would be a tipping point, I think. :)


I started one as a toy project some months ago (in April, tells me the filesystem), just for kicks.

I used ply to write the parser (aiming at 1.9.3, which has a lex/yacc parser), and that was the first time I leveraged a LALR parser, so it's very experimental, and as such code quality is lacking to say the least. Also, since ply uses (at least) kwargs, it won't pass through pypy build step. Still, early performance between cpython and pypy gives a clear edge to pypy.

To give you an idea of how much experimental it is, it's not even in my ~/Workspace (which is usually the step before github), but still in ~/Sandbox/pypy/pyby. I actually had a hard time finding it again.


Note that this requires quite some work to actually benefit PHP, because the rest of the interpreter is missing (as well as say web server integration). However, once this is done, JIT improvements will benefit both.


The concept is cool, but I think that once all the tokenization is done on the full construct of the language, it probably won't be substantially that much faster than hiphop C++ converter. This is only a 1.0 with benchmarks based on limited functionality. As the code-base grows, so will the pains to optimize the speed.


I disagree. Have you seen the benchmarks where they showed Python code in pypy running faster than C equivalents? Sure they were contrived, but they support the idea that theoretically any JIT runtime should be faster than ahead-of-time compilation. It makes sense to me, since the JIT runtime has more information with which to optimize the code - pypy knows what's going on with the code while it's executing, whereas the hiphop compiler only has information about the code itself.

In addition, an implementation in pypy should be able to support eval, which is impossible (or perhaps extremely difficult) with hiphop.


They were far from all contrived: many were taken from bottlenecks in real Python applications. Off-hand, at least the django, genshi, and html5lib benchmarks are very much real code, and none of the three benchmarks are optimized for PyPy in any way.


Yes, running Django in pypy gives you enormous speed boosts (I've seen it myself), but that's not what I was talking about.

I was referring to the benchmarks in which they measured Python code in pypy running against pure C code (see some examples below). While I'm not familiar with the benchmarks you're referring to, I doubt anybody implemented all (or even parts) of Django, genshi or html5lib in C.

http://morepypy.blogspot.com/2011/02/pypy-faster-than-c-on-c... http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-a...


See http://speed.pypy.org/ for the benchmarks I was referring to: that's the general collection of benchmarks used for PyPy (several, such as the html5lib one, imported from unladen-swallow), though obviously in specific cases comparisons are made otherwise.


It sounds like he prepared for that: "get as close to PHP as possible, implementing enough warts and corner cases to be reasonably sure that it answers hard problems in the PHP language"


I wonder about the motives behind Facebook's (apparent) decision to continue investing in PHP.

If I understand it correctly, most of their backend services are implemented in some other language, and their PHP code is mostly used as a more flexible templating language. If that's the case, it shouldn't be all too hard to migrate away from PHP, if the chose to do so.

It seems like they spend a lot of engineering effort in optimizing PHP, and probably also a lot of CPU cycles in executing it. I assume there is a tipping point somewhere, when the investment in PHP stops making sense, even given effort to port legacy code.

There are a lot of arguably nicer alternatives to PHP, and I bet they are not benefiting from the one really superior feature of PHP (easy deployment on shared hosting).


This topic has been covered quite extensively on Quota - here's a relevant thread: http://www.quora.com/Why-hasn-t-Facebook-migrated-away-from-...


I am honestly asking, is there a simple reason why, lets say python, can't be easily deployed on shared hosting. Let's take Heroku, probably the easiest way. It will take some persistance: https://devcenter.heroku.com/articles/quickstart + https://devcenter.heroku.com/articles/python and that's just the hello world. So my question is why is like that, can it be solved? What are the technical limitations for it? I know complicated things are complicated and if there would be some simple other way around it, it would already be like that, but then again what is the reason?


> I am honestly asking, is there a simple reason why, lets say python, can't be easily deployed on shared hosting.

The standard execution model for PHP apps is, request comes in, code is executed, code goes away, resources returned. PHP is optimised to suit this model. For nearly anything else, you have one or more long-running processes which take requests, thus skipping the code loading/interpreting/compiling stage, which tends to be expensive.

It's quite easy to do python CGI on shared hosting without concerns, but it's not going to be fast. Once you start looking at things like FastCGI or WSGI, you have a process which, at least some of the time, is persistent.



No: with PHP you can upload a bunch of files over FTP and you're done. With Python you pretty much always need to ssh into the server after pushing the files to the server. For someone who knows nothing, this makes PHP much easier.


Does anyone have any idea what "problems" one "inherits", per the article, in building dynamic languages on bytecode VMs like Parrot and the JVM? It seems to be talking about the fact that the JVM was designed for Java the language, and how that's induced some kind of drag on porting other kinds of languages to it, but I really don't see how that's relevant to Parrot, which was designed to be language agnostic (modulo its bias towards supporting the needs of dynamic languages) from the get-go.


He's just wrong about the JVM, specifically with statements like this:

"the benefits of the JVM only apply to languages that map well onto Java concepts"

With InvokeDynamic (InDy) this is no longer true at all. InDy lets you manage the entire lifecycle of method dispatch at a particular call site in a way that HotSpot can optimize across. This lets you define your own dispatch semantics that can potentially do things like complicated argument conversions (e.g. rolling all of the arguments up into an array) and it will still dispatch as fast as Java. This is true because InDy lets you define your own polymorphic inline method caches at call sites, so you can have a slow lookup for a method handle which is cached and invokes as fast as InvokeVirtual the next time it's called.

HotSpot can also inline code using InDy just like it can inline virtual method dispatch in Java. This lets the JIT compiler optimize across several methods, potentially implemented in many languages, as if they were a single body of code.

InDy will play a particularly important role in JDK8, where it's used to implement lambdas.


I tried to pinpoint down some problems, but let me elaborate on this. As far as Parrot goes, it's not done despite years of development and it has lots and lots of problems, some with the approach, some with the actual implementation. But the truth is you can't do much with it today.

As for JVM, which is definitely "done", you don't get access to low level concepts and you don't get support for most dynamic language concepts. For example in Jython escape analysis does not work at all, because everything escapes via frames (which in python is accessable from application level using sys._getframe). Invokedynamic only helps marginally here - you still need a JIT that can optimistically remove unlikely paths of the execution or a very smart compiler-to-the-JVM, which is unnecessary in PyPy.

The other element is low level stuff. JVM is opaque, you can't code in terms of C structures. RPython as well, but it allows you if you insist and sometimes there are very good reasons to insist. Look in rpython/ directory in the hippy checkout for an implementation of an ordered dict with all it's oddities. In the JVM if such a primitive does not come (and it's unlikely enough), you're out of luck.


To be fair, while there is invoke dynamic, he might be alluding to other aspects of the JVM (and referencing Parrot hints to that).

Eg. class loading in the JVM is (afaik) still static, classes once loaded cannot be modified, which is a problem for languages such as Ruby and JS, the basic data types of the JVM might not match your language, etc.

Not sure if that makes a big difference, in essence you'll have to implement those semantics in either case, be it in RPython for PyPy or Java for the JVM.


Classes can be changed after loading using HotSwap, it's just their method signature can't change:

http://docs.oracle.com/javase/1.4.2/docs/guide/jpda/enhancem...


It'd be pretty neat if there was an R[1] interpreter written in RPython, give any R scripts a free speed boost; that said, R has a pretty hairy grammar, so it might end up being easier to just switch to (for example) Julia.

[1]: http://www.r-project.org/


I dunno, the grammar you use in R is pretty weird, but it can all be expressed in simple function calls. For example "["(x, 2) is equivalent to the more typical x[2]. I've actually been thinking about re-implementing R, and PyPy does sound like the best way (I was also thinking of that Clojure implementation in C, given that R started out as a scheme interpreter. It would be a lot of work, but the rewards are massive. Reimplementing R the language is a lot less work than reimplementing the entire set of libraries.


The function call transformation is good, but one still has to parse the code to be able to do this. I remember reading an article a while ago that described the hoops the author had to go through to get a proper R parser working, though I can't find it now.


Please submit it if you ever do, I'd love to read it.



I'd love to see a Kickstarter project for this so that author can be funded and complete the implementation.


From Kickstarter FAQ:

To be eligible to start a Kickstarter project, you need to satisfy the requirements of Amazon Payments:

  —You are 18 years of age or older.
  —You are a permanent US resident with a Social Security Number (or EIN).
  —You have a US address, US bank account, and US state-issued ID (driver’s license).
  —You have a major US credit or debit card.
So it's a big no-no for a lot of people who happen not to be residents of the US, like me.


http://www.quora.com/What-Kickstarter-alternatives-exist-for...

Not sure if any of these are worth looking into. You definitely lose out on the brand weight that comes with kickstarter.


I have no idea on the details; but is there a member of the PyPy project (or of the Hippy VM project) who does satisfy those requirements, and who you can arrange to proxy the funds through?


That sort of setup might be considered fraud by Kickstarter, or their financial services.


The "arrange" was meant to mean: arrange so that it is completely legal & not considered fraud (etc). I have literally no idea if this is possible though.


Since the results are so dramatic, either Facebook will continue to fund him, or they will have an internal team finish it (and probably release it as open-source like they did with their other PHP tools).


As an author I can say that either of this is unlikely to happen. Facebook invested a lot of time and effort in their homegrown solution and note that this one is incomplete.

It's absolutely not my call to say whether making hippy complete is easier than making hiphop fast. I can provide you estimates on the former, if requested, I cannot potentially provide estimates on the latter.


It would be really nice to see a bigger comparison. Between not only different version of PHP but also include in there Python in PyPy, Ruby , and some Java alternatives for good measure. It seems If I am sticking with PHP this might be nice, but if i need to do some re factoring to get it to work I want to know how it compares to the speedup I might see from re factoring to a different platform.


Is this different from LLVM? Or does it use LLVM? Could someone explain?


It's very different. LLVM is a portable assembler. PyPy is a compiler generation toolchain. Using LLVM means you don't have to write your assembler backend, however, there is much more to the Just in Time compiler than that. LLVM is good at optimizing, but all the optimizations are low level. In order to provide a dynamic language VM, you need to provide optimizations like escape analysis, frame removal, etc. which are out of scope for the LLVM project. On the other hand PyPy gives you that level of optimizations for free (although for example it supports less assembler backends and low-level optimizations might be sub-par).

In short - LLVM is great if you want to write a language like C, PyPy is great if you want to create a dynamic language VM like Python or PHP.


Would be interesting if someone implemented CoffeeScript using this.


I'm looking into this and even talked to Jeremy Ashkenas about it last year. Unfortunately, I've not had the time to get any further than reading the PyPy tutorials:

http://morepypy.blogspot.nl/2011/04/tutorial-writing-interpr...

However, I hope to have some time in the fall in which to pursue this. If anyone else takes it up in the meantime, I'd love to contribute.

Interestingly, a native CoffeeScript interpreter already exists: Poetics, which runs on the Rubinius VM. Very cool project. Another very interesting CoffeeScript project is the MoonScript project, which compiles a variation of CoffeeScript into Lua.


I'm confused. Coffeescript compiles to JavaScript which usually runs in a vm with a jit.


That's why it would be interesting to see it as a language on it's own.


Has anyone tried to write a scheme interpreter?


https://bitbucket.org/pypy/lang-scheme I think it predates the JIT days though, so a bit of work would be required to make it fast (not too much)


<troll(?)> PLEASE, for the love of god, stop improving PHP and strictly PHP-related technologies! I know it's fun and you're doing cool stuff... but you're just like the scientists doing bio weapons research because it's cool stuff to them! Let's just bury this language once and for all! </troll(?)> ...but really, considering the amount of effort and man-hours the Facebook guys invested in improving PHP, with Hiphop and now this, it really saddens me ...so much wasted resources (and I mean HUMAN resources, like hours of people's lives!) almost wasted!


Have you seen any of Facebook's extensions to PHP?


I was just thinking of Hiphop and XHP, and they both seem like a huge amount of work and I've heard about improvements to the interpreter to optimize it for their servers... of course it might all be FUD for competitors, they might actually have everything coded in Common Lisp behind the scenes ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: