Skip to content

July 2008 Japan Trip, Day 1

23-Jul-08

My favorite place in Tokyo is the Hachiko exit around Shibuya and its surroundings, and it was the first place my friends and I visited once we arrived in Japan and dropped off our things at the hotel. It was a hot muggy afternoon but it was still as crowded as ever, especially the famous scramble crossing:

Shibuya scramble crossing

I spent some time walking around the shopping area and seeing what changed, which wasn't much:

Shibuya streets

And hey, the Condomania store was still there:

Condomania

After meeting up in Shibuya with all the SJEC people we all went to eat at 録 ~Roku~:

Dinner at 録 ~Roku~

and then after that we had a blast at karaoke:

Karaoke

And no blog post about Japan would be complete without some Engrish. Keen-eyed readers would notice a mistake in the mural on the wall above. Here's a close-up:

Mermaid Princess, Peal legend

and here's some of the text on the other walls:

At the bottom of the quiet sea, Beautiful mermaids' fighting starts now

The precious treasure which sleeps to deep sea

Parallelizing FLAC encoding

05-May-08

One thing I noticed ever since getting a multi-core system was that the reference FLAC encoder is not multi-threaded. This isn't a huge problem for most people as you can simply encode multiple files at the same time but I usually rip my audio CDs into a single audio file with a cue sheet instead of separate track files and so I am usually encoding a single large audio file instead of multiple smaller ones. Even so, encoding a CD-length audio file takes under a minute but I thought it would be a fun and useful weekend project to see if I could parallelize the simpler example encoder. The format specification indicates that input blocks are encoded independently which makes the problem embarassingly parallel and trawling through the FLAC mailing lists reveals that no one has had the time nor the inclination to look into it.

However, I was able to write a multithreaded FLAC encoder that achieves near-linear speedup with only minor hacks to the libFLAC API. Here are some encode times on an 8-core 2.8 GHz Xeon 5400 for a 636 MB wave file (some caveats are discussed below):

baseline34.906s
1 threads31.424s
2 threads16.936s
4 threads10.173s
8 threads6.808s

I took the simple approach of sharding the input file into n roughly equal pieces and passing them to n encoder threads, assembling the output file from the n output buffers. In general this is not a good way of partitioning the workload as time is wasted if one shard takes significantly more time to process but for my use case this isn't a significant problem.

The best way to share the input file among the encoding threads is to map it into memory. In fact, memory-mapped file I/O has so many advantages in general that I'm surprised at how little I see it used, although it does have the disadvantage of requiring a bit more bookkeeping. Here is how I use it in my multithreaded encoder (slightly paraphrased):

#include <fcntl.h> /* open() */
#include <sys/mman.h> /* mmap()/munmap() */
#include <sys/stat.h> /* stat() */
#include <unistd.h> /* close() */

int main(int argc, char *argv[]) {
  int fdin;
  struct stat buf;
  char *bufin;

  fdin = open(argv[1], O_RDONLY);
  fstat(fdin, &buf);
  bufin = mmap(NULL, buf.st_size, PROT_READ, MAP_SHARED, fdin, 0);

  /* The input file (passed in via argv[1]) is now mapped read-only to
     the memory region in bufin up to bufin + buf.st_size. */

  /* Note that you can work directly with the mapped input file
     instead of fread()ing the header into a buffer. */
  if((buf.st_size < WAV_HEADER_SIZE) ||
     memcmp(bufin, "RIFF", 4) ||
     memcmp(bufin+8, "WAVEfmt \020\000\000\000\001\000\002\000", 16) ||
     memcmp(bufin+32, "\004\000\020\000data", 8)) {
    /* Invalid input file: print error and exit. */
  }

  for (i = 0; i < num_threads; ++i) {
    shard_infos[i].bufin = bufin + WAV_HEADER_SIZE + i * bytes_per_thread;
    /* bufsize for the last thread may be slightly larger. */
    shard_infos[i].bufsize = bytes_per_thread;
  }

  /* Spawn encode threads (which calls encode_shard() below) passing
     an element of shard_infos to each. */

  ...

  munmap(bufin, buf.st_size);
  close(fdin);
}

FLAC__bool encode_shard(struct shard_info *shard_info) {
  FLAC__StreamEncoder *encoder = FLAC__stream_encoder_new();

  ...

  /* The input file is paged in lazily as this function accesses
     bufin from shard_info->bufin. */
  FLAC__stream_encoder_process_interleaved(encoder,
                                           shard_info->bufin,
                                           shard_info->bufsize);

  ...

  FLAC__stream_encoder_delete(encoder);
}

However, handling the output file is a bit trickier. Since the encoded FLAC data output by the threads vary in size we have to wait until all encoding threads are done before we know the right offsets to write the output data. A convenient and fast way to handle this is to use asynchronous I/O; we know where to write the output data for a shard as soon as the encoding thread for all previous shards finish so we simply wait for the encoding threads in shard order and queue up a write request after each thread finishes. Here I use the POSIX asynchronous I/O API in my multithreaded encoder (again, slightly paraphrased):

#include <aio.h> /* aio_*() */
#include <pthread.h> /* pthread_*() */
#include <string.h> /* memset() */

int main(int argc, char *argv[]) {
  int fdout;
  pthread_t threads[MAX_THREADS];
  struct aiocb aiocbs[MAX_THREADS];
  unsigned long byte_offset = 0;

  /* mmap input file in. */

  ...

  fdout = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC);

  /* Spawn encode threads passing an element of shard_infos to
     each. */

  ...

  /* Wait for each thread in sequence and queue up output writes. */

  /* We need to zero out any aiocb struct that we use before we fill
     in any members. */
  memset(aiocbs, 0, num_threads * sizeof(*aiocbs));
  for (i = 0; i < num_threads; ++i) {
    pthread_join(threads[i], NULL);
    aiocbs[i].aio_buf = shard_infos[i].bufout;
    aiocbs[i].aio_nbytes = shards_infos[i].bytes_written;
    aiocbs[i].aio_offset = byte_offset;
    aiocbs[i].aio_fildes = fdout;
    aio_write(&aiocbs[i]);
    byte_offset += shard_infos[i].bytes_written;
  }

  /* Wait for all output writes to finish. */

  for (i = 0; i < num_threads; ++i) {
    const struct aiocb *aiocbp = &aiocbs[i];
    aio_suspend(&aiocbp, 1, NULL);
    aio_return(&aiocbs[i]);
  }

  close(fdout);
}

The POSIX API is a bit unwieldy for this use case; ideally, there would be a version of aio_suspend() that would suspend the calling process until all of the specified requests have completed. As it is now the simplest way is to loop through the requests as above, especially since the maximum number of simultaneous asynchronous I/O requests is usually quite small (16 on my system).

Also, I found that the OS X implementation of aio_write() did not obey this part of the specified behavior:

If O_APPEND is set for aiocbp->aio_fildes, aio_write() operations append to the file in the same order as the calls were made. If O_APPEND is not set for the file descriptor, the write operation will occur at the abso- lute position from the beginning of the file plus aiocbp->aio_offset.

but it was just as easy (and clearer) to explicitly set the correct offset.

I had to hack up libFLAC a bit to implement my multithreaded encoder. I exposed the update_metadata_() to make it easy to write the correct number of total samples in the metadata field and also to zero out the min/max framesize fields. I also exposed the FLAC__stream_encoder_set_do_md5() function (which it should have been in the first place) so that I can turn off the writing of md5 field in the metadata. Finally, I added the function FLAC__stream_encoder_set_current_frame_number() so that the correct frame numbers are written at encode time.

For comparison purposes I turn off md5 calculation in my multithreaded encoder as well as the baseline one. Since calling FLAC__stream_encoder_set_current_frame_number() causes crashes with vericiation turned on I also turn that off. The numbers above reflect that so they're underestimates of how a production multithreaded encoder would perform. However, the essential behavior of the program shouldn't change much.

Here is a patch file for the flac 1.2.1 source that implements the hacks I described above. Here is the source for my multithreaded FLAC encoder. I've tested it with i686-apple-darwin9-gcc-4.0.1 and i686-apple-darwin9-gcc-4.2.1 on Mac OS X. I got the above numbers compiling mt_encode.c with gcc 4.2.1 and the switches -Wall -Werror -g -O2 -ansi.

bfpp - embed brainfuck in C++

23-Apr-08

Okay, I lied; you can't really embed brainfuck in C++ but you can get pretty close. Here is an example:

#include "bfpp.h"

int main() {
  // Prints out factorial numbers in sequence.  Adapted from
  // http://www.hevanet.com/cristofd/brainfuck/factorial.b .
  bfpp
    * + + + + + + + + + + * * * + * + -- * * * + -- - -- & & & & & -- +
    & & & & & ++ * * -- -- - ++ * -- & & + * + * - ++ & -- * + & - ++ &
    -- * + & - -- * + & - -- * + & - -- * + & - -- * + & - -- * + & - --
    * + & - -- * + & - -- * + & - -- * -- - ++ * * * * + * + & & & & & &
    - -- * + & - ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ * -- & + * - ++ + * *
    * * * ++ & & & & & -- & & & & & ++ * * * * * * * -- * * * * * ++ + +
    -- - & & & & & ++ * * * * * * - ++ + * * * * * ++ & -- * + + & - ++
    & & & & -- & -- * + & - ++ & & & & ++ * * -- - * -- - ++ + + + + + +
    -- & + + + + + + + + * - ++ * * * * ++ & & & & & -- & -- * + * + & &
    - ++ * ! & & & & & ++ * ! * * * * ++ 
  end_bfpp
}

I call this variant "bfpp" as it has some pretty significant differences from brainfuck. First of all, some commands had to be adapted; although + and - remain the same,

  • < and > were changed to & and *,
  • . and , were changed to ! and ~ (mnemonic: ! contains . within it and ~ is kind of like a sideways ,),
  • and [ and ] were changed to -- and ++ (mnemonic: [ and ] are the most complex brainfuck commands [to implement, at least] and so deserve to be mapped to the wider and more prominent operators).

This magic is made possible by the fact that brainfuck has exactly eight commands and C++ has exactly eight overloadable symbolic unary operators. Add some macros to hide the C++ scaffolding behind some delimiters and you have a convincing illusion of an embedded language.

bfpp.h implements a simple (<100 lines) bfpp interpreter and the magic described above, and bf2bfpp.c is a straightforward translator from brainfuck to bfpp. Gotta love C++!

A musical week

03-Apr-08

Just today I got the chance to see the violinist Anne-Sophie Mutter play the three Brahms violin sonatas. Her playing (as well as her accompanist Lamber Orkis') was impeccable, although I found only the third sonata to be really interesting. She also came out to play four (!) encore pieces, also by Brahms. After dazzling performances of three of the Hungarian Dances (the first two were Nos. 7 and 1 but I did not catch the third one) she played the Weigenlied before bidding good night.

Also this past weekend I and a group of fellow pianists played at a party hosted by their former piano teacher. I had the rare chance to hear the formidable solo version of Gershwin's Rhapsody in Blue performed (first part here and second part here) as well as one of Ginastera's more accessible works, the second of the Tres Danzas Argentinas (1:40 in this clip). I with my duet partner played the Waltz and Romance from Rachmaninoff's second two-piano suite and his Rhapsody on a Theme of Paganini, arranged by Rachmaninoff himself for two pianos (first part here, second part here, and third part here, with the famous 18th variation being 5:24 in the second part).

I also finally started looking at new pieces again. I ordered some of Fazil Say's published sheet music (these pieces, among others) and I am also looking at Scriabin's third sonata (first movement here, second movement here, third movement here, and fourth movement here) which has always been a favorite of mine. It is shaping up to be a fine challenge; one particularly obnoxious section has an ossia (a simplified version of a passage) with the footnote:

This passage, difficult to perform at a rapid tempo, was played differently by Scriabin himself[.]

Hopefully this marks an end to the musical slump I've been in; it's been too long since I've prepared new pieces.

Mac Pro woes

27-Jan-08

I woke today to find my computer restarted instead of waking from sleep. Then 10 minutes later it froze with some graphics corruption. Looks like I’ve been hit with the problems described in this thread and on this page. I will try an SMC reset and PRAM reset but if it doesn’t work I might have to exchange this for another one.

On a brighter note this tip solved my gripe with Parallels’ coherence mode.

Mac Pro first impressions

27-Jan-08

I was long overdue for a computer upgrade so I decided to take the plunge and get the newly-upgraded Mac Pro. Here are my thoughts in no particular order.

First, this thing is seriously heavy, weighing 43.4 lbs when removed from the box. It is much quieter than my old Athlon, though.

I was waiting to buy this because I was expecting the 8-core (or “arachnocore” if you prefer) models to go from stupidly expensive ($4000) to moderately expensive ($2800). I bought the stock configuration; not buying RAM or extra hard drives from Apple was a no-brainer, and although I wanted to get the 8800 GT instead of the default 2600 XT it would have pushed the shipment date out over a month. I can always upgrade later which might cost a little more, depending on how much I can sell the 2600 XT for.

Babbage:~ akalin$ for i in `seq 1 8`; do (yes $i >/dev/null &); done
Babbage:~ akalin$ top -o cpu
  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
53600 yes         99.1%  0:25.62   1    13     18  208K   188K   416K    18M
53612 yes         99.0%  0:25.58   1    13     18  208K   188K   416K    18M
53604 yes         98.6%  0:25.27   1    13     18  208K   188K   416K    18M
53598 yes         98.0%  0:25.74   1    13     18  208K   188K   416K    18M
53608 yes         95.6%  0:25.30   1    13     18  208K   188K   416K    18M
53602 yes         94.6%  0:25.84   1    13     18  208K   188K   416K    18M
53606 yes         94.5%  0:25.49   1    13     18  208K   188K   416K    18M
53610 yes         93.1%  0:25.77   1    13     18  208K   188K   416K    18M

Yeap, it’s 8 cores all right. I’m a bit disappointed that the overall CPU usage in “top” is properly normalized to [0%, 100%] instead of the more impressive [0%, 800%], though.

It is nice to be running a *nix as a primary OS again. However, one thing Windows does well is support weird display configurations, like my two rotatable widescreen monitors. I’m still using one of the monitors for my old computer but hopefully OS X will be able to switch monitor orientations (preferably with a hotkey) like my old setup. I’ll find out soon.

Setting up Boot Camp was a breeze but I had a ton of problems with Parallels. First I tried using it with my Boot Camp partition but it never detected my keyboard and mouse no matter how long I waited. Worse yet, it messed up the driver configuration enough so that even with booting back into Windows from Boot Camp I was unable to use my keyboard and mouse. After I started having weird freezes when switching back and forth between OS X and Windows I decided to scrap the Windows partition and start over. After installing Windows a second time I used Disk Utility to image the partition before I tried Parallels on it again. After confirming that it still didn’t work, I just restore the Boot Camp partition from the backup and had Parallels work on its own separate image. It’s slightly more inconvenient but it’s probably a more stable setup.

Another problem with Parallels is that it refuses to start up after a reboot, complaining about not being able to communicate with services. Reinstalling Parallels solves the problem but that’s annoying to do on every reboot. I’ll have to investigate further.

Parallels performs really well even with only 512 MB of RAM allocated to it on business apps and non-3d games. However, its 3d support is still lacking; Mythos suffers from show-stopping graphics corruption and crashes, and even when it runs I get only about 10 fps. However, Mythos performs beautifully with Boot Camp, getting ~40 fps with everything but shadows turned all the way up. This makes me curious as to how well WoW performs, but that’s tempting fate.

Booting into Boot Camp to play games is a bit of a pain but it’s mitigated by being able to hibernate in Windows. Curiously, OS X is lacking in this department; you can use a widget called “Deep Sleep” to put it into the equivalent “Safe Sleep”, but starting the computer again does not let you switch partitions as far as I can tell, so it’s useless for switching quickly between operating systems. I’ll have to investigate further into this, too.

Some miscellaneous gripes:

  • Disk Utility has this really weird bug where it would only rarely let me drag the disk to restore from an image to the “Destination” field. I just ended up using the equivalent command-line utility “asr”.
  • Not having an eject button on optical drives is annoying, especially when in Windows.
  • Safari does not close tabs on middle-click.
  • Safari is fickle when deciding when to let you use cmd-shift-left/right to move between tabs.
  • Coherence mode in Parallels leaves the entire taskbar visible. Perhaps this has something to do with me using the classic theme/Start menu.
  • I wish there was a way to turn off the “Caps Lock protection” in the new thin Apple keyboards as I bind it to Ctrl. I’m not impressed with its feel anyway, so I’ll probably end up buying another keyboard. I’m seriously considering getting the Happy Hacking Keyboard Professional 2, but it’s stupidly expensive.
  • I find the Mighty Mouse quite uncomfortable to use due to its shape. It is also quite finicky and sometimes the pointer goes off wildly to a corner of the screen. I’ll probably stick with an IntelliMouse Explorer.

Overall I’m quite satisfied with this computer. I look forward to learning the ins and outs of OS X which I was unable to really do when interacting with it only on laptops.

Backwards day at akalin.cx

27-Jan-08

If any of you have glanced at my blog lately or (more likely) I’ve complained about it to you, you’d know that blog entries have been appearing in reverse order for a while. From other people with the same problem it looks like this MySQL bug was the culprit, and indeed this post from my host confirms that MySQL was upgraded recently. 5.0.52 is out only for enterprise customers now but this only affects older versions of WordPress so I just gritted my teeth and went through the pain of upgrading to the latest version of WordPress.

We now return to your regularly scheduled programming.

Finding the longest palindromic substring in linear time

28-Nov-07

Another interesting problem I stumbled across on reddit is finding the longest substring of a given string that is a palindrome. I found the explanation on Johan Jeuring's blog somewhat confusing and I had to spend some time poring over the Haskell code (eventually rewriting it in Python) and walking through examples before it "clicked." I haven't found any other explanations of the same approach so hopefully my explanation below will help the next person who is curious about this problem.

Of course, the most naive solution would be to exhaustively examine all n (n + 1) / 2 substrings of the given n-length string, test each one if it's a palindrome, and keep track of the longest one seen so far. This has worst-case complexity O(n3), but we can easily do better by realizing that a palindrome is centered on either a letter (for odd-length palindromes) or a space between letters (for even-length palindromes). Therefore we can examine all 2n + 1 possible centers and find the longest palindrome for that center, keeping track of the overall longest palindrome. This has worst-case complexity O(n2).

It is not immediately clear that we can do better but if we're told that an O(n) algorithm exists we can infer that the algorithm is most likely structured as an iteration through all possible centers. As an off-the-cuff first attempt, we can adapt the above algorithm by keeping track of the current center and expanding until we find the longest palindrome around that center, in which case we then consider the last letter (or space) of that palindrome as the new center. The algorithm (which isn't correct) looks like this informally:

1. set the current center to the first letter
2. loop while the current center is valid
  a. expand to the left and right simultaneously until we find the largest
     palindrome around this center
  b. if the current palindrome is bigger than the stored maximum one, store
     the current one as the maximum one
  c. set the space following the current palindrome as the current center
     unless the two letters immediately surrounding it are different, in
     which case set the last letter of the current palindrome as the current
     center
3. return the stored maximum palindrome

This seems to work but it doesn't handle all cases: consider the string "abababa". The first non-trivial palindrome we see is "a|bababa", followed by "aba|baba". Considering the current space as the center doesn't get us anywhere but considering the preceding letter (the second 'a') as the center, we can expand to get "ababa|ba". From this state, considering the current space again doesn't get us anywhere but considering the preceding letter as the center, we can expand to get "abababa|". However, this is incorrect as the longest palindrome is actually the entire string! We can remedy this case by changing the algorithm to try and set the new center to be one before the end of the last palindrome, but it is clear that having a fixed "lookbehind" doesn't solve the general case and anything more than that will probably bump us back up to quadratic time.

The key question is this: given the state from the example above, "ababa|ba", what makes the second 'b' so special that it should be the new center? To use another example, in "abcbabcba|bcba", what makes the second 'c' so special that it should be the new center? Hopefully, the answer to this question will lead to the answer to the more important question: once we stop expanding the palindrome around the current center, how do we pick the next center? To answer the first question, first notice that the current palindromes in the above examples themselves contain smaller non-trivial palindromes: "ababa" contains "aba" and "abcbabcba" contains "abcba" which also contains "bcb". Then, notice that if we expand around the "special" letters, we get a palindrome which shares a right edge with the current palindrome; that is, the longest palindrome around the special letters are proper suffixes of the current palindrome. With a little thought, we can then answer the second question: to pick the next center, take the center of the longest palindromic proper suffix of the current palindrome. Our algorithm then looks like this:

1. set the current center to the first letter
2. loop while the current center is valid
  a. expand to the left and right simultaneously until we find the largest
     palindrome around this center
  b. if the current palindrome is bigger than the stored maximum one, store
     the current one as the maximum one
  c. find the maximal palindromic proper suffix of the current palindrome
  d. set the center of the suffix from c as the current center and start
     expanding from the suffix as it is palindromic
3. return the stored maximum palindrome

However, unless step 2c can be done efficiently, it will cause the algorithm to be superlinear. Doing step 2c efficiently seems impossible since we have to examine the entire current palindrome to find the longest palindromic suffix unless we somehow keep track of extra state as we progress through the input string. Notice that the longest palindromic suffix would by definition also be a palindrome of the input string so it might suffice to keep track of every palindrome that we see as we move through the string and hopefully, by the time we finish expanding around a given center, we would know where all the palindromes with centers lying to the left of the current one are. However, if the longest palindromic suffix has a center to the right of the current center, we would not know about it. But we also have at our disposal the very useful fact that a palindromic proper suffix of a palindrome has a corresponding dual palindromic proper prefix. For example, in one of our examples above, "abcbabcba", notice that "abcba" appears twice: once as a prefix and once as a suffix. Therefore, while we wouldn't know about all the palindromic suffixes of our current palindrome, we would know about either it or its dual.

Another crucial realization is the fact that we don't have to keep track of all the palindromes we've seen. To use the example "abcbabcba" again, we don't really care about "bcb" that much, since it's already contained in the palindrome "abcba". In fact, we only really care about keeping track of the longest palindromes for a given center or equivalently, the length of the longest palindrome for a given center. But this is simply a more general version of our original problem, which is to find the longest palindrome around any center! Thus, if we can keep track of this state efficiently, maybe by taking advantage of the properties of palindromes, we don't have to keep track of the maximal palindrome and can instead figure it out at the very end.

Unfortunately, we seem to be back where we started; the second naive algorithm that we have is simply to loop through all possible centers and for each one find the longest palindrome around that center. But our discussion has led us to a different incremental formulation: given a current center, the longest palindrome around that center, and a list of the lengths of the longest palindromes around the centers to the left of the current center, can we figure out the new center to consider and extend the list of longest palindrome lengths up to that center efficiently? For example, if we have the state:

<"ababa|??", [0, 1, 0, 3, 0, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]>

where the highlighted letter is the current center, the vertical line is our current position, the question marks represent unread characters or unknown quantities, and the array represents the list of longest palindrome lengths by center, can we get to the state:

<"ababa|??", [0, 1, 0, 3, 0, 5, 0, ?, ?, ?, ?, ?, ?, ?, ?]>

and then to:

<"abababa|", [0, 1, 0, 3, 0, 5, 0, 7, 0, 5, 0, 3, 0, 1, 0]>

efficiently? The crucial thing to notice is that the longest palindrome lengths array (we'll call it simply the lengths array) in the final state is palindromic since the original string is palindromic. In fact, the lengths array obeys a more general property: the longest palindrome d places to the right of the current center (the d-right palindrome) is at least as long as the longest palindrome d places to the left of the current center (the d-left palindrome) if the d-left palindrome is completely contained in the longest palindrome around the current center (the center palindrome), and it is of equal length if the d-left palindrome is not a prefix of the center palindrome or if the center palindrome is a suffix of the entire string. This then implies that we can more or less fill in the values to the right of the current center from the values to the left of the current center. For example, from [0, 1, 0, 3, 0, 5, ?, ?, ?, ?, ?, ?, ?, ?, ?] we can get to [0, 1, 0, 3, 0, 5, 0, ≥3?, 0, ≥1?, 0, ?, ?, ?, ?]. This also implies that the first unknown entry (in this case, ≥3?) should be the new center because it means that the center palindrome is not a suffix of the input string (i.e., we're not done) and that the d-left palindrome is a prefix of the center palindrome.

From these observations we can construct our final algorithm which returns the lengths array, and from which it is easy to find the longest palindromic substring:

1. initialize the lengths array to the number of possible centers
2. set the current center to the first center
3. loop while the current center is valid
  a. expand to the left and right simultaneously until we find the largest
     palindrome around this center
  b. fill in the appropriate entry in the longest palindrome lengths array
  c. iterate through the longest palindrome lengths array backwards and fill
     in the corresponding values to the right of the entry for the current
     center until an unknown value (as described above) is encountered
  d. set the new center to the index of this unknown value
4. return the lengths array

Note that at each step of the algorithm we're either incrementing our current position in the input string or filling in an entry in the lengths array. Since the lengths array has size linear in the size of the input array, the algorithm has worst-case linear running time. Since given the lengths array we can find and return the longest palindromic substring in linear time, a linear-time algorithm to find the longest palindromic substring is the composition of these two operations.

Here is Python code that implements the above algorithm (although it is closer to Johan Jeuring's Haskell implementation than to the above description):

def fastLongestPalindromes(seq):
    """
    Behaves identically to naiveLongestPalindrome (see below), but runs in linear time.
    """
    seqLen = len(seq)
    l = []
    i = 0
    palLen = 0
    # Loop invariant: seq[(i - palLen):i] is a palindrome.
    # Loop invariant: len(l) >= 2 * i - palLen. The code path that 
    # increments palLen skips the l-filling inner-loop.
    # Loop invariant: len(l) < 2 * i + 1. Any code path that increments
    # i past seqLen - 1 exits the loop early and so skips the
    # l-filling inner loop.
    while i < seqLen:
        # First, see if we can extend the current palindrome.  Note that
        # the center of the palindrome remains fixed.
        if i > palLen and seq[i - palLen - 1] == seq[i]:
            palLen += 2
            i += 1
            continue
 
        # The current palindrome is as large as it gets, so we append it.
        l.append(palLen)
 
        # Now to make further progress, we look for a smaller palindrome
        # sharing the right edge with the current palindrome.  If we find
        # one, we can try to expand it and see where that takes us.
        # At the same time, we can fill the values for l that we neglected
        # during the loop above. We make use of our knowledge of the length
        # of the previous palindrome (palLen) and the fact that the values
        # of l for positions on the right half of the palindrome are
        # closely related to the values of the corresponding positions on
        # the left half of the palindrome.
 
        # Traverse backwards starting from the second-to-last index up
        # to the edge of the last palindrome.
        s = len(l) - 2
        e = s - palLen
        for j in xrange(s, e, -1):
            # d is the value l[j] must have in order for the palindrome
            # centered there to share the left edge with the last palindrome.
            # (Drawing it out is helpful to understanding why the - 1
            # is there.)
            d = j - e - 1
 
            # We check to see if the palindrome at l[j] shares a left edge
            # with the last palindrome.  If so, the corresponding palindrome
            # on the right half must share the right edge with the last
            # palindrome, and so we have a new value for palLen.
            if l[j] == d: # *
                palLen = d
                # We actually want to go to the beginning of the outer
                # loop, but Python doesn't have loop labels.  Instead,
                # we use an else block corresponding to the inner loop,
                # which gets executed only when the for loop exits normally
                # (i.e., not via break).
                break
 
            # Otherwise, we just copy the value over to the right side.
            # We have to bound l[i] because palindromes on the left side
            # could extend past the left edge of the last palindrome,
            # whereas their counterparts won't extend past the right edge.
            l.append(min(d, l[j]))
        else:
            # This code is executed in two cases: when the for loop
            # isn't taken at all (palLen == 0) or the inner loop was
            # unable to find a palindrome sharing the left edge with the
            # last palindrome.  In either case, we're free to consider
            # the palindrome centered at seq[i].
            palLen = 1
            i += 1
 
    # We know from the loop invariant that len(l) < 2 * seqLen + 1, so we
    # must fill in the remaining values of l.
 
    # Obviously, the last palindrome we're looking at can't grow any more.
    l.append(palLen)
 
    # Traverse backwards starting from the second-to-last index up until
    # we get l to size 2 * seqLen + 1. We can deduce from the loop invariants
    # we have enough elements.
    lLen = len(l)
    s = lLen - 2
    e = s - (2 * seqLen + 1 - lLen)
    for i in xrange(s, e, -1):
        # The d here uses the same formula as the d in the inner loop above.
        # (Computes distance to left edge of the last palindrome.)
        d = i - e - 1
        # We bound l[i] with min for the same reason as in the inner loop
        # above.
        l.append(min(d, l[i]))
 
    return l

(* An exercise for the reader: in this place in the code you might think that you can replace the == with >= to improve performance. This does not change the correctness of the algorithm but it does hurt performance, contrary to expectations. Why?)

And here is a naive quadratic version for comparison:

def naiveLongestPalindromes(seq):
    """
    Given a sequence seq, returns a list l such that l[2 * i + 1] holds the
    length of the longest palindrome centered at seq[i] (which must be odd),
    l[2 * i] holds the length of the longest palindrome centered between
    seq[i - 1] and seq[i] (which must be even), and l[2 * len(seq)] holds
    the length of the longest palindrome centered past the last element of
    seq (which must be 0, as is l[0]).
 
    The actual palindrome for l[i] is seq[s:(s + l[i])] where s is
    i // 2 - l[i] // 2. (// is integer division.)
 
    Example:
    naiveLongestPalindrome('ababa') -> [0, 1, 0, 3, 0, 5, 0, 3, 0, 1]
 
    Runs in quadratic time.
    """
    seqLen = len(seq)
    lLen = 2 * seqLen + 1
    l = []
 
    for i in xrange(lLen):
        # If i is even (i.e., we're on a space), this will produce
        # e == s.  Otherwise, we're on an element and e == s + 1, as
        # a single letter is trivially a palindrome.
        s = i / 2
        e = s + i % 2
 
        # Loop invariant: seq[s:e] is a palindrome.
        while s > 0 and e < seqLen and seq[s - 1] == seq[e]:
            s -= 1
            e += 1
 
        l.append(e - s)
 
    return l

Note that this is not the only efficient solution to this problem; building a suffix tree is linear in the length of the input string and you can use one to solve this problem but as Johan also mentions, that is a much less direct and efficient solution compared to this one.

At long last…

21-Nov-07

The project I was working on at Amazon finally launched!

It’s drawn a lot of criticism since it was first leaked more than a year ago; I wonder if Amazon’s take on the e-book reader is bold enough to overcome the current limitations of e-ink and its high price tag. Hopefully this will kick-start an industry that has so far languished in obscurity.

A foray into number theory with Haskell

06-Jul-07

I encountered an interesting problem on reddit a few days ago which can be paraphrased as follows:

Find a perfect square s such that 1597 s + 1 is also perfect square.

After reading the discussion about implementing a brute-force algorithm to solve the problem and spending a futile half-hour or so trying my hand at find a better way, someone noticed that the problem was an instance of Pell's equation which is known to have an elegant and fast solution; indeed, he posted a one-liner in Mathematica solving the given problem. However, I wanted to try coding up the solution myself as the Mathematica solution, while succinct, isn't very enlightening since the heavy lifting is already done by a built-in function and an arbitrary constant was used for this particular instance of Pell's equation.

Pell's equation is simply the Diophantine equation x2d y2 = 1 for a given d; being Diophantine means that all variables involved take on only integer values. (In our original problem, d is 1597 and we are asked for y2.) The solution involves finding the continued fraction expansion of √d, finding the first convergent of the expansion that satisfies Pell's equation, and then generating all other solutions from that fundamental solution. We rule out the trivial solution x = 1, y = 0 which also implies that if d is a perfect square then there is no solution. As a rule we'll avoid considering trivial cases and re-stating obvious assumptions (like d having to be a positive integer).

A continued fraction is an expression of the form:

x = a_0 + \cfrac{1}{a_i +
\cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots\,}}}}

where all ai are integers and all but the first one are positive. The standard math notation for continued fractions is quite unwieldy so from now on we'll use <a0; a1, a2, ...> instead of the above.

The theory of continued fractions is a rich and beautiful one but for now we'll just state a few facts:

  • The continued fraction expansion of a number is (mostly) unique.
  • The continued fraction expansion of a rational number is finite.
  • The continued fraction expansion of a irrational number is infinite.
  • A quadratic surd is a number of the form \frac{a + \sqrt{b}}{c} where a, b, and c are integers. Except maybe for the first term, the continued fraction expansion of a quadratic surd is periodic; that is, it repeats forever after a certain number of terms. This applies in particular to the square root of an integer.
  • Truncating an infinite continued fraction to get a finite continued fraction gives (in some sense) an optimal rational approximation to the irrational number represented by the infinite continued fraction.

Given a quadratic surd it is fairly easy to manipulate it into the form a + \frac{1}{q} where q is another quadratic surd. This fact can be used to come up with an algorithm to find the continued fraction expansion of a square root. Wikipedia explains it pretty well so I won't go over it, but here is my Haskell implementation:

sqrt_continued_fraction n = [ a_i | (_, _, a_i) <- mdas ]
    where
      mdas = iterate get_next_triplet (m_0, d_0, a_0)
 
      m_0 = 0
      d_0 = 1
      a_0 = truncate $ sqrt $ fromIntegral n
 
      get_next_triplet (m_i, d_i, a_i) = (m_j, d_j, a_j)
          where
            m_j = d_i * a_i - m_i
            d_j = (n - m_j * m_j) `div` d_i
            a_j = (a_0 + m_j) `div` d_j

and here are some examples:

Prelude Main> take 20 $ sqrt_continued_fraction 2
[1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]

Prelude Main> take 20 $ sqrt_continued_fraction 103
[10,6,1,2,1,1,9,1,1,2,1,6,20,6,1,2,1,1,9,1]

Prelude Main> take 20 $ sqrt_continued_fraction 36
[6,*** Exception: divide by zero

(Note that we're assuming that we won't be called with a perfect square. Also, do you notice anything interesting about the periodic portion of the continued fractions, particularly of √103?)

For those who are unfamiliar with Haskell, here's a quick list of key facts:

  • The first line takes a list of triplets and forms a list of all third elements, which is what we're interested in. (The other two elements of the triplet are auxiliary variables used by the algorithm.)
  • iterate is a function which takes in another function f, an initial variable x, and returns the infinite list [ x, f(x), f(f(x)), f(f(f(x))), ... ].
  • Note that Haskell uses lazy evaluation and so this function does not take an infinite amount of time to run; all its elements are evaluated (and memoized) only when needed.
  • The rest of the function is a straightforward representation of the meat of the algorithm described in the above Wikipedia entry.

It may not be clear what √d and its continued fraction expansion has to do with solving Pell's equation. However, notice that if x and y solve Pell's equation then manipulating Pell's equation to get √d on one side reveals that xy is a good approximation of √n. In fact, it is so good that you can prove that xy must come from truncating the continued fraction expansion of √d.

This leads us to the following: if you have an infinite continued fraction <a0; a1, a2, ...> you can truncate it into a finite continued fraction <a0; a1, a2, ..., ai> and simplify it into the rational number piqi. The sequence p0q0, p1q1, p2q2, ... forms the convergents of <a0; a1, a2, ...> and converges to its represented irrational number.

It turns out you can calculate pi + 1 and qi + 1 efficiently from pi, qi, pi - 1, qi - 1, and ai + 1 using the fundamental recurrence formulas (which can be proved by induction). Here is my Haskell implementation:

get_convergents (a_0 : a_1 : as) = pqs
    where
      pqs = (p_0, q_0) : (p_1, q_1) :
            zipWith3 get_next_convergent pqs (tail pqs) as
 
      p_0 = a_0
      q_0 = 1
 
      p_1 = a_1 * a_0 + 1
      q_1 = a_1
 
      get_next_convergent (p_i, q_i) (p_j, q_j) a_k = (p_k, q_k)
          where
            p_k = a_k * p_j + p_i
            q_k = a_k * q_j + q_i

and some more examples:

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 2
[(1,1),(3,2),(7,5),(17,12),(41,29),(99,70),(239,169),(577,408)]

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 103
[(10,1),(61,6),(71,7),(203,20),(274,27),(477,47),(4567,450),(5044,497)]

Prelude Main> take 8 $ get_convergents $ sqrt_continued_fraction 1597
[(39,1),(40,1),(1039,26),(1079,27),(2118,53),(3197,80),(27694,693),(113973,2852)]

Prelude Main> let divFrac (x, y) = (fromInteger x) / (fromInteger y)

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 2
[1.0,1.5,1.4,1.4166666666666667,1.4137931034482758,1.4142857142857144,1.4142011834319526,1.4142156862745099]

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 103
[10.0,10.166666666666666,10.142857142857142,10.15,10.148148148148149,10.148936170212766,10.148888888888889,10.148893360160965]

Prelude Main> take 8 $ map divFrac $ get_convergents $ sqrt_continued_fraction 1597
[39.0,40.0,39.96153846153846,39.96296296296296,39.9622641509434,39.9625,39.96248196248196,39.9624824684432]

Here are a few more quick facts to help those unfamiliar with Haskell:

  • The expression a : as forms a new list from the element a and the existing list as (equivalent to cons in Lisp).
  • zipWith3 is a function that takes in a function f, three lists a, b, and c of the same (possibly infinite) length n, and forms the new list [ f(a0, b0, c0), f(a1, b1, c1), ..., f(an, bn, cn) ].
  • Note that the result of zipWith3 is part of the variable pqs which itself appears (twice!) in the arguments to zipWith3. This is a Haskell idiom and reflects the fact that the recurrence formulas define a convergent in terms of its two previous convergents. A simpler example (using the Fibonacci sequence) can be found in the Wikipedia entry for lazy evaluation.
  • Haskell has built-in data types for integers of arbitrary size which is necessary as the numerators and denominators of the convergents get large quickly. In fact, Haskell has built-in data types for rational numbers (represented as fractions) but it doesn't help us much here.

Since we are guaranteed that some convergent eventually satisfies Pell's equation, we can write a simple function to generate all convergents, test each one to see if it satisfies Pell's equation, and return the first one we see. Here is the Haskell implementation:

get_pell_fundamental_solution n = head $ solutions
    where
      solutions = [ (p, q) | (p, q) <- convergents, p * p - n * q * q == 1 ]
 
      convergents = get_convergents $ sqrt_continued_fraction n

Note the use of the Haskell's list comprehension syntax, similar to Python, which expresses what I just described in a matter reminiscent of set notation. Here is the full Haskell program designed so its output may be conveniently piped to bc for verification:

module Main where
 
import System (getArgs)
 
sqrt_continued_fraction :: (Integral a) => a -> [a]
{- ... the sqrt_continued_fraction function explained above ... -}
 
get_convergents :: (Integral a) => [a] -> [(a, a)]
{- ... the get_convergents function explained above ... -}
 
get_pell_fundamental_solution :: (Integral a) => a -> (a, a)
{- ... the get_pell_fundamental_solution function explained above ... -}
 
main :: IO ()
main = do
  args <- System.getArgs
  let d      = (read $ head $ args :: Integer)
      (p, q) = get_pell_fundamental_solution d in
    putStr $ "d = " ++ (show d) ++ "\n" ++
             "p = " ++ (show p) ++ "\n" ++
             "q = " ++ (show q) ++ "\n" ++
             "p^2 - d * q^2 == 1\n"
and here is it in action:
$ ./solve_pell 1597
d = 1597
p = 519711527755463096224266385375638449943026746249
q = 13004986088790772250309504643908671520836229100
p^2 - d * q^2 == 1

The solution to the original problem is therefore: 5054112910466227478111803017176109047976100000000.

Now that we've found a method to get a solution, the question remains as to whether it's the only one. In fact it is not, but it is the minimal one, and all other solutions (of which there are an infinite number) can be generated from this fundamental one with a simple recurrence relation as described on the Wikipedia article. My program above can be easily extended to generate all solutions instead of just the fundamental one (I'll leave it to the reader as an exercise).

One remaining question is the efficiency of this algorithm. For simplicity, let's neglect the cost of the arbitrary-precision arithmetic involved and assume that the incremental cost of generating each term of the continued fraction expansion and the convergents is constant. Then the main cost is just how many convergents we have to generate before we find one that satisfies Pell's equation. In fact, it turns out that this depends on the length of the period of the continue fraction expansion of √d, which has a rough upper bound of O(ln dd). Therefore, the cost of solving Pell's equation (in terms of how many convergents to generate) for a given n-digit number is O(n 2n ⁄ 2). This is pretty expensive already, although it's still much better than brute-force search (which is on the order of exponentiating the above expression). Can we do better? Well, sort of; it turns out the length of the answer is of the same order as the expression above, so any algorithm that explicitly outputs a solution necessarily takes that long. However, if you can somehow factor d into s d′, where s is a perfect square and d′ is squarefree (i.e., not divisible by any perfect square), then you can solve Pell's equation for the smaller number d′ and output the solution for d′ as the smaller fundamental solution and an expression raised to a certain power involving it. Note that in general this involves factoring d, another hard problem, but for which there exists tons of prior work. An interested reader can peruse the papers by Lenstra and Vardi for more details.

As a final note, one of the things I really like about number theory is that investigating such a simple program can lead you down surprising avenues of mathematics and computational theory. In fact, I've had to omit a lot of things I had planned to say to avoid growing this entry to be longer than it already is. Hopefully, this entry helps someone else learn more about this interesting corner of number theory.