Program comprehension is a fascinating (and IMHO very enlightening) field.
For anyone curious, I wrote up some of what I learned while exploring the research a few years ago:
http://www.clarityincode.com/readability/
(I apologise for the less than stellar formatting; I haven’t updated that page for some time. I also apologise to the actual researchers I cited, if I’ve dumbed down their work too much in aiming for a non-expert audience.)
For the eye tracking reported here, I wonder whether the early emphasis on the top part was a combination of trying to figure out the data flow in the between() function and then its significance to the wider program.
I think it would be interesting to compare the results with a similar eye track of a program written with more emphasis on data flow rather than control flow, e.g.,
def between(numbers, low, high):
return [n for n in numbers if low < n < high]
def common(list1, list2):
return [i for i in list1 if i in list2]
x = [2, 8, 7, 9, -5, 0, 2]
x_btwn = between(x, 2, 10)
print x_btwn
y = [1, -3, 10, 0, 8, 9, 1]
y_btwn = between(y, -2, 9)
print y_btwn
xy_common = common(x, y)
print xy_common
It might also be interesting to compare the results with a functional programming language that expresses those ideas more concisely and/or with tools like between() and common() as part of the standard library that programmers would probably be familiar with.
Final thought: How much does the absence of a clearly marked starting point (like a main() function in C) affect how a reader approaches unknown code in Python? If this had been a C program, would the reader have aimed straight for main() and then worked down from there to functions like between() and common()?
I think it is very important to use code from real-life applications, rather then synthetic examples like that, or like ones presented in the original article.
Otherwise the results of that research would be biased towards comprehension of meaningless chunks of code, rather that of a real thing.
That's step 2. Step 1 is using contrived examples with clear differences in structure. You need to start small like this because in order to answer the basic questions, you need to eliminate experimental variables. That way, you can establish basic principles for how people tend to build up an understanding of code.
Then you can move on to what we're really interested in: how do people build an understanding of complex applications with many interacting parts? But you can't go there first; if you do, you won't be able to interpret your results because there will be just as many fundamental questions as experimental variables.
I agree, but isn’t this the curse of most experimental research in programming techniques?
Ideally, we’d compare like-against-like using industrial scale applications, implemented by professional practitioners, controlling for everything except what we’re trying to investigate. Finding opportunities to do that in practice is rather harder, because obviously most real software development projects don’t get implemented twice by identical teams making exactly one significant change in their approach.
What I thought was interesting here was that even though it is only a toy program, there were still some patterns to how the reader explored it that might suggest more general trends. As long as we understand the limitations of the experiment and don’t overgeneralise any conclusions, isn’t some data still better than random conjecture?
Yes, but why can't some regular code (from an open sourced project) be used in the experiments? I'm sure there should be a way to do that.
It is just, that this synthetic code is broken. It doesn't read. There is no flow in it. It just looks from the first glance, like a non-interesting, unimportant piece, that doesn't do anything, so there is no point in reading it.
Does that sound convincing enough, that there is a big difference between reading synthetic examples and actual code?
Yes, it does. There was a study of how well people could remember positions of chess pieces on a board. The general public did equally badly for both random arrangements and those that could actually appear in a game, but skilled chess players did well remembering arrangements that could appear in a game, but were no better than the general public on random configurations.
> isn’t this the curse of most experimental research in programming techniques?
Of most experimental research with humans in general, for that matter. It's a big topic of controversy in psychology, because a lot of quantitative psych results are from synthetic tasks in lab settings, which leads to argument over the extent to which those are accurate proxies for real-world behavior.
I’m not sure if this is what you were alluding to there, but I think there is a real problem with experimental research in this kind of field that often the test subjects are students, who tend to have much less skill and experience than working programmers with a few years of practice behind them.
It’s a specific instance of a more general problem: it seems clear that programmers often work differently depending on their familiarity both with programming generally and with the specific domain they’re working in, but we’ve only scratched the surface in identifying exactly how they work differently, and therefore what practical steps we might take to make things better for programmers in different situations.
This is often the problem with a lot of psychology research, where the only easily available subjects are university students who receive credit for participating in the research (or a nominal payment).
For an experiment like this it's important to have small, self contained code fragments. It wouldn't be fair to expect someone to decipher the whole Linux kernel source in a short amount of time, for example. Given a sufficiently small chunk of code, almost everything looks synthetic. I imagine many real life application have sections that take two arrays, find the numbers within a certain range and then take the intersection like this program does.
Actually linux kernel is very readable. And there is nothing synthetic about it. Just try opening some random page of linux kernel source code, and you'll see. Say: http://www.linux.fm/source/crypto/xor.c/
> I think it would be interesting to compare the results with a similar eye track of a program written with more emphasis on data flow
It would also be interesting to compare with code with decent syntax highlighting. For some reason I don't quite understand, most syntax highlighting out there highlights the wrong thing: keywords. It needs to highlight the site-of-definition of functions, variables, etc. For example: http://david.rothlis.net/code_presentation/distracting_synta...
This is something I'm definitely interested in. I don't have any syntax highlighting in the current experiment to avoid controversy over the "right" way to highlight, and to put the focus on content of the code. For the future, though, this would be an excellent experimental manipulation to try. Thanks!
Some of the initial analysis Mike has done shows that programmers spend the majority of their time looking at identifiers. I think this supports your point.
It's pretty easy to understand why keywords are highlighted, even if it isn't justifiable. 1) Programmers are lazy, 2) regexes are the quickest way to implement basic syntax highlighting, and 3) if your keywords are reserved words (as most are), then you can get infallible highlighting with the expression /(^|\s):listofreservedwords:(/s|$)/.
Emacs, at least, already provides primitive parsing / heuristics that work reasonably well for most languages. See the example in the link I posted above: All I had to do was customise the colour and font-weight of various things already recognised by Emacs: function-name, variable-name, type, etc.
Maybe what I don't understand is all the effort people put into making pretty themes for their editors --there are millions of these themes out there-- and almost all of them insist on highlighting (drawing attention to) the language's keywords of all things. The least important part of any program.
This seems to be massively more readable to me. Looking at the video version of the code I spent quite a few seconds looking at the non-chained comparison ("everyone chains comparison when checking bounds. Why isn't it that way here?"), and the "winner" containers (is there something winning something?).
FWIW, the most important conclusions from exploring the research that I’ve drawn personally are:
1. Establishing familiar patterns is good, and being able to identify deviations from patterns (intentional or otherwise) is probably far more important than we usually assume.
2. You’re going to need to figure out the data flow for some code you’re reading anyway, and a lot of the time the control flow is just noise/accidental complexity, so why not put the data flow front and centre if possible?
Applying these to the example code we’ve been discussing, I might prefer to write it something like this:
x = [2, 8, 7, 9, -5, 0, 2]
y = [1, -3, 10, 0, 8, 9, 1]
print filter ( 2 < _ < 10) x
print filter (-2 < _ < 9) y
print intersection x y
with the proviso that the previous common() function doesn't quite do what an intersection() usually would, which in itself might prompt us to think about whether the previous common() function that is asymmetric in its parameters is actually what we wanted. (Maybe it was, or maybe it was an unintentional error but no-one noticed because the results are the same with this set of sample data.)
I have a theory that part of the attraction/intrigue of functional programming for a lot of working programmers comes from the way it inherently emphasizes these ideas, particularly making data flow almost the default form of presentation.
I also have a theory that maybe there's another mental model to be found/learned/created, somewhere above control flow and data flow but below the general domain model, capturing the flow of observable interactions with the outside world: side effects, interactions/dependencies between concurrent operations, and so on. I suspect that part of the reason we find functional programming more difficult for some things (see the “awkward squad” paper) is that when you remove the implicit control flow, you also remove the implicit ordering of events/effects that you get with imperative programming, and there is a cost to that if you don’t replace it (or at least the parts of it that matter for that higher-level view of what your code is doing) with something else.
“One of the things that stood out to me in watching the video was how much my mind seems to work like a computer.”
One of the things that stood out to me in watching the video was how much his mind seemed emphatically not to work like a computer at all. His process gave the appearance of a network self-training for a little while, then simultaneously training and producing output. The more times an area of the program was visited, the better the training, and consequently the longer it could be retained. Notice how the results of calculations are “picked up” from the source and “dropped” almost immediately into the output, as though they’re heavy and difficult to hold on to!
It's awesome to see that people are interested in my research! I've made a blog post with another video and a few more details: http://synesthesiam.com/?p=218
I'll be very interested in findings on the terseness of code and its effects on readability by experienced coders.
For example, the ternary operator is consider by many to be an elegant solution to simple if statements but from a comprehension (and therefore bug-finding) standpoint, is it superior? Also, how does this effect change as the size of the codebase grows from a single page (as depicted in the video) to a more complex class file.
value = test ? (some_value * multiple) : false_value
# vs
if (test) {
value = some_value * multiple
} else {
value = false_value
}
I too would also like to know this. At first the ternary operator seems difficult to comprehend but after a little bit of mucking around with it it really seems more natural to me. When I'm thinking about a binary question rarely do I ever utter if/else in my internal monologue. I ask myself the question and the response is either a yes or a no.
I would say it is a matter of personal preference. However, first and foremost whichever thing you prefer should be codified in your project's style guide. Consistency trumps everything else by a large margin.
My personal opinion is that this is an excellent use of the ternary operator. When writing software, you want your code to be as simple and short as possible provided it is readable[1]. I find both of them to be equally readable, therefore I prefer the shorter one by a lot. We are talking about 5 lines of code vs 1, which can really add up if this appears commonly all over your codebase. One limited asset every engineer has is monitor real estate, and the more code you can fit on your screen, the better (provided all of that code is the readable variety).
I also find in this particular example, the ternary could get a slight nod for readability as well. This is because with the ternary operator, you start with:
value =
Ok, value is getting assigned to something... then you read the ternary expression. If you are grepping through this code and looking to see what value might be set to, you get to this line and you know you have found it, then you parse the rest of it to figure out what it is being set to.
With the other example, your grepping will lead you to a "value =" that is within a conditional, so now you have to look up and down and explore a little more to see what it might be set to. This is because the ternary operator can only do one small thing, whereas the if statement can do lots of things. Since you are only doing the simple thing, using the simplest possible operator to do that in some sense helps future people reading the code [2].
[1] Which might seem like a spectrum, but I think of it as much more binary. Code is either readable or it isn't. This might seem strange, but my fellow engineers and I spend quite a bit of time grading code submissions to our programming challenges, and of course the "readability" of the code is a key thing we grade for. I think we probably match up our independent up or down votes on readability at least 90% of the time.
[2] If you want to frustrate experienced engineers, have them look through a bunch of code that hasn't been factored down to its simplest form. This is extremely common among inexperienced engineers who refactor code and don't delete a bunch of cruft that now exists due to code or logic changes because "who cares? the code works!", and you get stuff like this:
if ($has_phone_number) {
return TRUE;
} else if ($has_phone_number || $has_email_address) {
return TRUE;
} else if (!$has_phone_number) {
return FALSE;
} else {
return FALSE;
}
return FALSE;
since you asked for my opinion, i think that ? : syntax is a failure. languages where everything is an expression (usually functional languages) have a much better implementation:
value = if (test) {some_value * multiple} else {false_value}
value = if (test) {some_value * multiple}
else if (test2) {second_value}
else {false_value}
which falls in a straightforward way out of the if/else syntax, which the language authors have put a lot more thought into than the rarely used ?: syntax.
value = test ? some_value * multiple
: test2 ? second_value
: false_value;
Which, you know, isn't that bad. But it looks really weird since people don't expect you to use ternary that way, while multi-part if statements are fairly normal.
That's basically the same syntax but with keywords. How is it 'much better'? And if you're doing something complex like nested ternary (please don't) you can use () to mark sections just as well as your method uses {} to mark sections.
it's better because 1) it falls out of the regular semantics of the language. it's just a regular if expression. it's a consequence of a more elegant and unified design, in other words. that ternary syntax by comparison is ad-hoc. to begin with, it's the only ternary operator in those kinds of languages. (thus people call it "the" ternary operator). it shows poor design restraint, imo
2) it's not punctuation. a ternary expression is a 3-input mapping to begin with. punctuation only makes it noisier.
A table is fine, but the equivalent of 'else if' isn't true nesting. I meant making an elaborate tree, which should not be squished into a single statement unless you have a really good way of organizing it.
Assuming I know the ternary operator, I find the first version superior. Why? It only says "value = " once. I know immediately what the action of the code is when I start reading. Starting with an if makes me worry that the two code paths diverge, and also makes it take analysis to realize that 'value' is definitely assigned.
From my personal perspective, readability comes from the "beauty" of the code. Humans instinctively go on their first impressions, so code that looks appealing at first glance is code that is desired to be worked on. The absolute readability difference between the ternary operator and an if statement is inconsequential outside of which better fits the design context, I think.
I always like to think the text editor is my canvas and the characters are my brush. My job is to make something functional, of course, but also something that looks beautiful.
This is actually one of the things that's driving Mike's research. When the latest programming paradigm comes out, usually people argue it's better based on the code being shorter or prettier. Judging programming styles based on beauty is inherently subjective, so Mike is trying to find some quantitative basis for comparing programming systems.
Language has to be an important factor here. If I re-write this in APL --and you know APL-- reading the solution is pretty linear:
R ← data BETWEEN limits
R ← ((data>limits[1])∧(data<limits[2]))/data
R ← a COMMON b
R ← (∨/a ∘.= b)/a
x ← 2 8 7 9 ¯5 0 2
y ← 1 ¯3 10 0 8 9 1
x_btwn ← x BETWEEN 2 10
y_btwn ← y BETWEEN ¯2 9
xy_common ← x_btwn COMMON y_btwn
If you know APL the above pretty much reads like the palm of your hand.
Since APL is unknown to most, here's a quick explanation.
R ← data BETWEEN limits
Dyadic function declaration. Takes two arguments.
R ← ((data>limits[1])∧(data<limits[2]))/data
Let's break this up:
(data>limits[1])
Takes the "data" vector and compares it to the first element in "limits", which happens to be the "low" limit. You get a binary vector as the result with a "1" anywhere the comparison is true and "0" otherwise.
If "limits" is 2 10:
0 1 1 1 0 0 0 0
Now:
(data<limits[2])
Does the same thing with the upper limit:
1 1 1 1 1 1 1 1
Then:
0 1 1 1 0 0 0 0 ∧ 1 1 1 1 1 1 1 1
Performs a logical AND of the two binary vectors, resulting in a new vector:
0 1 1 1 0 0 0 0
Finally:
R ← 0 1 1 1 0 0 0 0/data
Selects elements from the "data" vector based on the values in the binary vector and returns the result vector:
8 7 9
Anyhow, that's why I think that language is important. If I wrote this in Forth the pattern would be very different and the thought process required to understand it more convoluted. Probably true as well for assembly.
I don't think eye tracking and do any good, because recognizing of familiar shapes and then mapping them to a familiar constructs and "structures" comes first.
So, reading a code with familiar shape is one thing (that why Lisp has such emphasis on the form of an expression, and Python put that into extreme), while reading long.chains.of.unfamiliar.methods.is.another.)
Then comes recognition of a familiar zones (areas) of an expression, expecting particular kind of sub-expressions here and there. Then match what you have seen with known whole things.
Lets say it is a recursive process of reduction to something already known by examining shapes, forms, and details.
So, shape matters. Small procedures, around ten lines matters. Naming matters, and, especially, using one-letter, non-confusing (no meaning) names for just a placeholders matters.
Let say that this solved in Lambda Calculus (by a naming strategy), and then in Lisp (by shapping strategy) by accident.)
I find it rather amusing, given the context, that I spent a long time (more than 30 seconds) trying to figure out the match for your unbalanced parenthesis. I don't think I would have done that if it hadn't been a paragraph about Lisp: I kept going back thinking "Wait, I didn't realize I was in a parenthetical expression... does this change the interpretation of what is being said?".
Seconded. Surprising how distracting something so trivial is given recent (unrelated) context. My mind immediately sprang to the conclusion that perhaps this sentence is partially applied - a legacy of an evening spent writing some interesting functional code for treating MIDI as observable sequences.
One of the things I noticed from the Eye Tracking was that this is very similar to how we were trained to work through programming problems in high school level computer science competitions. Some competition problems involve tracing through code and determining output (of usually recursive functions) but the 'proper' method we were taught is almost exactly how he describes his thought process.
100 times more frequently I read code I already read (or wrote) before. It works differently.
I page through, scanning the code, noting its "shape" as I go without actually reading any lines.
Once I've found the place the problem might be, I start hand-executing (or eye-executing?) for problems.
Does it really matter much how we read new code? We read many tims faster than we write; re-reading is the norm.
Funny, i mentioned this a few days ago. I was going to do some googling on other interesting code quality/dev productivity/language metrics, but, uh, never got around to it.
It seems to me that in the first few passes on-sighting or sight reading code (as climbers and pianists call it), you're looking for easy to comprehend structures, and blocking off difficult to decipher, simultaneously, so maybe a bimodal distributions at work here
It looked like a commercial eye tracker to me, so we wouldn't be able to open source it. However, a quick Google search revealed some things that might be promising.
Eye tracker hardware is tricky and therefore usually expensive, but on the software side there are good tools. I've used VisionEgg[0] with an Eyelink II system and found it quite easy to deal with.
Not that expensive. For a couple of hundred dollars you can build very nice 60-90fps IR eye tracking system. You'll need IR camera, IR tele lens (very important), couple of IR sources and open source tracking software. See http://www.gazegroup.org/develop/
There is also alternative approach. If you are just recording and don't need interactivity - all you need, is a regular camera with zoom and a small tripod. You can record on an SD card and synchronize the stream stream time in post-processing. Just flash the monitor screen couple of times at the start and the end of the recording, and play tracking alignment sequence.
"One of the things that stood out to me in watching the video was how much my mind seems to work like a computer."
A more accurate way of thinking about this is: how much the computer works like our mind. Not surprising considering that people build things based on how we understand everything else. Whether it was done consciously or subconsciously.
I read somewhere once that we tend to think of our mind in terms of whatever the current most complex technology is. At one point we thought of the brain as a fantastically complicated plumbing system, and later as a telephone switchboard. I wonder if in another decade or so will the idea that our brains work like computers seem as wrong as the idea that they work like plumbing systems.
I'm surprised no one pointed out that the answer given in the video is wrong. The second line should be "10 0 8 1" and the third "8 9 0". The author writes "10 0 9 1" and "9".
Perhaps it's not relevant to the experiment, but it seems worthy of mention at least. Following along, I was second-guessing myself because my answer didn't match...
... that modern languages, such as Scala, offer advantages as human communication mediums. I describe an experiment, using an eye-tracking device, that measures the performance of code comprehension.
That code is jarring to my eye because of the variable name, my eye tracking would keep going back to 'winners'.
Why winners? What a bizarre variable name, to me anyway. Does anyone else use that? I always go with `matches`, `retVals` or `returnValues` depending on the language/IDE I'm using.
I went back and forth over what to name this variable. I wanted it to convey some meaning (winners are the items that passed or "won" the test), but not give away too much.
Several people have commented on this, however, so I may just change the name to "matches" or something. Thanks for the suggestion :)
The experiment has several variants of each program. In some cases the names make sense, and in other cases they are just random words. One of the goals is to measure how much choosing good names matters.
This is very interesting. Have they considered using multiple languages (functional, oo) as well as including 'garbage' languages (brainfuck) to track differences? I'd also be curious to see how it compares to a standard word problem (two trains leaving chicago etc).
Makes me wonder if future languages will be designed to be read in serial. Perhaps that will decrease the amount of time we need to look at code to understand what it does.
Probably not. I don't think that in the large that helps with readability / maintainability. Even in the small, I find languages which don't allow forward references (Ruby, sh) rather annoying. It forces relatively insignificant implementation details to be read first, rather than starting with the big picture of what the program does.
I would expect similar results. Factor and other stack-based languages are not particularly linear, thanks to, well, factoring. Sure, data flow generally runs from left to right and top to bottom, but word and tuple definitions are just as scattered about as in any procedural language.
Interesting how he starts from the components and then rolls up to the main logic, whereas I would start at the main logic then drill down into the routines it uses.
This video seems incorrect to me, I read code line by line one at a time most of the time. The video seems to be showing that the eye is jumping all around.
Your eyes jump around a lot more than you think they do, no matter what you're looking at. Your brain compensates for the effect somewhat, so it's impossible to notice the full extent of the movement on a conscious level.
Consider that you can conceptualise reading code as a linear narrative even if your eyes tell a different story. Our perceptions of ourselves are often radically different from our actual behaviour. One fine example of this is in phonetics: the two /t/ sounds in “tomato” are quite different, yet we think of them as the same sound.
For anyone curious, I wrote up some of what I learned while exploring the research a few years ago: http://www.clarityincode.com/readability/ (I apologise for the less than stellar formatting; I haven’t updated that page for some time. I also apologise to the actual researchers I cited, if I’ve dumbed down their work too much in aiming for a non-expert audience.)
For the eye tracking reported here, I wonder whether the early emphasis on the top part was a combination of trying to figure out the data flow in the between() function and then its significance to the wider program.
I think it would be interesting to compare the results with a similar eye track of a program written with more emphasis on data flow rather than control flow, e.g.,
It might also be interesting to compare the results with a functional programming language that expresses those ideas more concisely and/or with tools like between() and common() as part of the standard library that programmers would probably be familiar with.Final thought: How much does the absence of a clearly marked starting point (like a main() function in C) affect how a reader approaches unknown code in Python? If this had been a C program, would the reader have aimed straight for main() and then worked down from there to functions like between() and common()?