Robot Has No Heart

Xavier Shay blogs here

A robot that does not have a heart

Tail call optimization in erlang

1
2
fact(1) -> 1;
fact(N) -> N * fact(N - 1).

You’ve all seen the classic recursive factorial definition. Problem is, it’s not really useable. 50000 factorial, anyone? The problem is it needs to create a new stack frame for each recursive call, very quickly blowing out your memory usage. Let’s look at a classic erlang structure, a message processing loop:

1
2
3
4
5
loop() ->
  receive
    hello -> io:format("hello")
    loop().
  end.

That looks mighty recursive also – one would be inclined to think that saying hello a couple of thousand times would quickly chew through memory. Happily, this is not the case! The reason is tail call optimization.

As you can see, the above hello program is really just a loop. Note that when we call loop(), there’s no reason to maintain the stack for the current call, because there is no more processing to be done. It just needs to pass on the return value. The erlang compiler recognises this, and so can optimize the above code by doing just that – throwing away the stack (or transforming it into a loop, whichever you prefer).

With the factorial example, optimization cannot be done because each call needs to wait for the return value of fact(N-1) to multiply it by N – extra processing that depends on the call’s stack.

Tail call optimization can only be done when the recursive call is the last operation in the function.

With this knowledge, we can rewrite our factorial function to include an accumulator parameter, allowing us to take advantage of the optimization.

1
2
3
fact(N)    -> fact(N, 1).
fact(1, T) -> T;
fact(N, T) -> fact(N - 1, T * N).

Or since we recognise that you can redo this with a loop, you could always just write it that way yourself.

1
fact(N) -> lists:foldl(fun(X, T) -> X * T end, 1, lists:seq(1, N)).

I haven’t used erlang enough to make a call as to which is nicer. Probably the first one. I’m a ruby guy at heart, so for old time’s sake here’s a version you can use in ruby, which I think is quite pretty (be warned ruby doesn’t do tail call optimization).

1
2
3
def fact(n)
  (1..n).inject(1) {|t, n| t * n}
end

A Banana a Day

SQL Optimization

Consider the following join (a is a char(32), b is varchar):

1
select * from t1, t2 where instr(t1.a, t2.b) = 0

The actual code I worked with was a bit more complex, but essentially that is it. It works fine, however performance is rather lacking. This is because for each for every row in t2, the DBMS must perform the instr() function for every row in t1 to check if t1.a is in t2.b until it finds a match.

If a field is used in a function, any indexes on that field cannot be used

That’s bad. In this case, since a is fixed length, we can rewrite the query thusly:

1
select * from t1, t2 where substr(t1.a,0,32) = t2.b

This way, the substr is only performed once for each row of b, and the result can quickly be checked against an index on t2.b.

In the particular case I was working on (there were multiple joins to be rewritten), this cut execution time down from 2 minutes to just under a second.

Lonely Hammer Syndrome

When all you have is a hammer, every problem looks like a nail. After optimising the query referred to above, I found out its context. It’s a data collection problem consisting of a server log where each url contains a UID, a csv that contains details about each uid (page title, etc…), and the two needed to be collated. The original process was:

  1. Import both tables into an access db (the server log into a table with one field – whole_line)
  2. Using link tables and SQL string functions, convert the server log into a friendlier format in a table on a local oracle server
  3. Collate the two files using the above mentioned sql
  4. Export collated data to csv (from access)
  5. Append to master file

Basically, we’re going from csv, to access, to oracle, back through access out to csv again. Additionally, many of these steps required manual intervention. This process has to be done every week – how can it be optimised? Of course there are many ways – I looked into Java and JDBC to cut access out of the loop, but thought if I’m going to the effort I may as well cut out oracle as well. Perl is reknowned for its log-munging ability, so I put together a script (rather quickly – thank you regular expressions!) which can now automatically give formatted data in barely a few minutes, compared to up to 60 minutes of manual labour using the old method.

The moral – expose yourself to as many tools as possible. You don’t need to be an expert in them (I had to google pretty much every perl command :S), but if you know the pros and cons of each you can cut down both development and operational time substantially.

A pretty flower Another pretty flower