Friday, November 11, 2005

Ninth SoCal Piggies Meeting

The SoCal Piggies had their ninth meeting at USC (Salvatori Computer Science Center) on November 10th at 7:00 PM. Eight Piggies attended -- Daniel Arbuckle, Steve Williams, Grig Gheorghiu, Diane Trout, Titus Brown, Mark Kohler, Howard Golden and George Bullis.

The first presenter was Daniel Arbuckle, who [WWW] talked about "Python and Unicode". Daniel started by introducing general Unicode concepts such as the Universal Character Set (UCS) and the Basic Multilingual Plane (BMP). A somewhat curious detail about UCS is that, due to a limitation of the UTF-16 encoding, the high 9 bits of the 4 bytes taken by Unicode characters are not being used, so we end up with 1114111 possible characters, which should still be plenty enough for everybody's needs. In any case, the most widely used Unicode characters are the first 65536, conveniently fitting on 2 bytes, which is why many Unicode implementations use only 2 bytes per character (and this includes Python, if you compile it from source and use the default build settings).

In terms of codecs, Daniel talked mainly about UTF-8 and UTF-16, the two most common ones. In short, if you're dealing mainly with latin characters, you're better off using UTF-8, which will offer a more compact encoding in this case. If your app involves other character sets, such as Chinese or Japanese, you're better off using UTF-16, which will usually use only 2 bytes per character, while UTF-8 will tend to end up using more than 2 bytes in this case.

In Python, you use myunicode.encode(codec_name) to turn a Unicode object myunicode into a byte string containing encoded Unicode characters. To go back from a byte string mystring to a Unicode object, use mystring.decode(codec_name). Here is an example showing how to go back and forth between Unicode and normal byte strings:

u"\ud723abcow".encode("utf-8").decode("utf-8") == u"\ud723abcow"

A good strategy of dealing with Unicode in your application is this, in Daniel's words: "decode everything on input, leave it in unicode format while processing, and encode everything on output (preferably using the same codec throughout)". When this is not possible, for example when your code interfaces with a library that sometimes returns Unicode and sometimes returns strings, you can use an adapter function such as this one that Daniel put together:

def adapt_unicode(value, codec):
if isinstance(value, unicode):
return value
try:
return unicode(value, codec)
except TypeError:
return unicode(str(value), 'ascii')

The next presenter was Mark Kohler, who showed us some Python code he put together while taking part in the [WWW] Scrapheap Challenge at [WWW] OOPSLA '05. Dubbed "A workshop in post-modern programming", the Scrapheap Challenge consisted in solving problems in a relatively short amount of time (60 to 90 minutes), using the Internet as the scrapheap, i.e. reusing as much code as you can find on the Internet. The problems were short enough that they could be solved in this manner, yet complicated enough so that solving them from scratch would not be practical in the allotted time. Mark was proud to report that for one of the 3 problems, a Sudoku puzzle solver, his team (each team consisted of a pair of programmers) produced the only solution from all the competing teams. Theirs was of course a Python-based solution, using a [WWW] recipe found via Google on the [WWW] Python Cookbook site. Mark's team used the w3m text-based browser to sanitize the Web page containing the problem, fed the output to grep for filtering the lines containing the actual Sudoku grid lines, munged the grep output via a few lines of custom Python code, and sent the result to the Sudoku solver from the recipe. Nice and fast! We had some fun yesterday making the few custom lines of Python code even more compact by various tricks such as slicing a list while skipping every other character. We could have reduced everything to just one line via a list comprehension, but we stopped short of that :-)

Another fun problem was to build a so-called "integrationometer", a dial that shows how far a local copy of an svn repository has diverged from a the repository. Mark's team's solution was to use svn diff to get the number of lines changed between the local copy and the repository, then use PIL to draw a nice semicircle as a dial that showed the percentage of differing lines in red on a green background. Very neat.

A third problem (actually the first one chronologically) was to provide some way for a user to find questions that had remained unanswered in their email for more than three days. This one was the toughest one to solve according to Mark, and the solution found by the winning team involved some gmail and greasemonkey tricks that made certain assumptions about the Inbox and the Sent mailbox. Generic solutions for parsing mailboxes are not easy to get by; after all, it is not for naught that [WWW] Zawinski's Law states that "Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."

We ended the meeting with an impromptu presentation by Titus Brown, who gave us an update on his progress in writing unit tests for [WWW] twill, his Web app testing tool. Titus handed out some printouts with some nifty code he put together for testing twill itself by starting a server process that he runs twill scripts against. The server process can be anything, in his case it's a Quixote application that enables him to test form submission and other features of twill. Speaking of submission, Titus also showed us the aptly-named 'collar' application (get it? submission...collar...it's funny, laugh), a Quixote-based Web app he created for submitting and reviewing conference papers. He's currently creating a suite of twill tests that will serve as a regression test harness and will make the application more solid across changes in Quixote, changes in cucumber2 (which is an ORM layer that Titus wrote for talking Pythonically to PostgreSQL databases), etc.

The meeting ran pretty long, we were there until 10 PM or so, but it was well worth it. Many thanks to Daniel, Mark and Titus for presenting and to Daniel again for hosting the meeting. We won't have a regular meeting in December, because many people, including Daniel, will take time off, but Titus proposed we meet for dinner one evening, maybe in Pasadena. This sounds like a great idea and we'll discuss details etc. on the mailing list.

Here is the tentative agenda for the next regular meeting, in January 2006:

  • "metakit overview" - Howard Golden

  • "Testing a Python Web app with Selenium" - Grig Gheorghiu