mogrify

Let's see where this goes.

A Love Note to KeyedCollection

At DEQ, we are doing a project to move our external website out of OpenCMS and into DotNetNuke. One of my tasks is to create a tool that will take a bunch of HTML files exported from OpenCMS and import them into DotNetNuke. It’s not a simple transfer - the HTML needs to be parsed to extract metadata fields, and modified in various ways to fit into the new system.

The tool is a console application written in C#. It takes a list of files from an Excel spreadsheet (which our team is maintaining as the project moves forward), runs each file through an HTML parser, modifies the page, and then calls a series of SQL Server stored procedures to complete the import into DotNetNuke.

The HTML parsing and modification is handled by the excellent HTML Agility Pack. The biggest part of it is that the links in each page need to be rewritten to conform to our DotNetNuke system’s URL format.

Where this gets tricky is that according to our rules, a page’s new URL isn’t known until the page is parsed. It could be based on the content of the <title> element, or on the first <h1> element. As a result, it has to have (at least partially) parsed all the pages before the first one can be completed. It doesn’t have enough memory to hold the content for all the pages (about two thousand), so it is doing the parsing in two passes - once to determine all of the metadata (including the new URL), and once, just before import, to retrieve and modify the body content.

During the first parsing pass, the tool builds a collection of all the pages and their metadata fields. During the second, it traverses this collection, parsing, rewriting, and then importing each page in turn. My initial (naïve) implementation used a standard List<T>, to preserve the order of the pages. This is important because in DotNetNuke’s model, each page may have a parent page, which needs to be imported before any of its children. So it builds the page hierarchy in breadth-first order.

However, this resulted in absolutely horrendous link rewriting performance. To find the new URL of a page by its current URL, the list needs to be traversed in order until a match is found. This has to happen for every single internal link on every single page on the site. A single run of the import tool was taking over six hours to complete, with the vast majority of the time spent in the link rewriting step.

Obviously, a List<T> is not the right data structure for this problem. A Dictionary<K,V> would be better for the link rewriting step, since it would provide nearly instantaneous retrieval, but it wouldn’t preserve the item order.

This past Valentine’s Day, I decided to fix this problem. One way to approach it would be to maintain two data structures - a list for importing, and a dictionary for URL lookups. But I decided to do some research to see if there was an existing collection class in the .NET framework that could preserve order while providing O(1) retrieval by key.

Turns out, there is: KeyedCollection<K,V>. It’s an abstract class that allows both index- and key-based retrieval in (near-)constant time. To achieve this, it maintains both a list of items for iteration and index-based retrieval, and a lookup dictionary for key-based retrieval. Using it is as simple as extending it, and providing an implementation of GetKeyForItem that describes how the key should be determined for each item. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class ImportItem
{
  public string Url { get; set; }
  // other fields
}

class ImportItemCollection : KeyedCollection<string, ImportItem>
{
  public string GetKeyForItem(ImportItem item)
  {
      return item.Url;
  }    
}

static void Main(string[] args) {
  var collection = new ImportItemCollection();
  collection.Add(new ImportItem { Url = "http://some/url/1" });
  collection.Add(new ImportItem { Url = "http://some/url/2" });

  // acts like a Dictionary when you need it
  var importItem = collection["http://some/url/1"];

  // and like a List when you need it
  foreach (var item in collection) {
      // ...
  }
  importItem = collection[1];
}

As always, there are a few caveats. First, and perhaps obviously, KeyedCollection uses additional memory. Instead of a List and a Dictionary, you now have both. There’s still only one copy of each value (if it’s a reference type), but the additional collection itself is going to consume more memory. Second, KeyedCollection<K,V> doesn’t actually implement IDictionary<K,V>, so you can’t use it interchangeably with code that expects a dictionary. (It does implement the same interfaces as List<T>, so it’s best to think of it as a list with key lookups rather than a dictionary with index lookups.) Third, when you use the collection’s Contains method, watch out: the override that takes the key parameter uses the fast dictionary lookup, but the override that takes the item parameter will cause an expensive iteration over the list items. And finally, serialization performance can be slow because it serializes both the list and the dictionary. (Source for the last two items)

After switching to KeyedCollection for my page list, I had improved the tool’s run time from six hours to three minutes. Yes, three minutes. Now, this enormous improvement wouldn’t have been possible if I hadn’t started with a ridiculously awful implementation to begin with, but it’s still a testament to the awesomeness of this collection class.

And that’s my Valentine’s love note to KeyedCollection.

Sometime in the last month or so, I realized that at some point, I had pretty much stopped writing code on my own time. There are a lot of reasons for this: very busy time at my day job, three very energetic children, ongoing house renovations, electronic distractions, and other day-to-day preoccupations. But it was a shock to me to look back and realize that the only code I’d written recently was either for work or for Monotask.

The thing is that I define myself as a programmer, and not so much as a dad, husband, gamer, or a do-it-yourselfer. I am all of those things, and I would say that being a family man is (by far) my top priority; but when I introduce myself, I start with “programmer” (or some variation, which is probably a good topic for a future post).

And to me, being a programmer means writing code on your own time. It means learning a new programming language every year. It means being curious about the way software works, reading others’ code, tinkering with new technologies. It means not being able to sleep until you’ve solved a particular problem that’s been burning in your mind. It means having half a dozen side projects in various stages of completion, and hopefully even a healthy selection of public repositories on GitHub.

Of course there are programmers who don’t fit this description, but this is the kind of programmer I think I am. But it’s not the kind of programmer I’ve been lately.

So I’m going to do something about it. Starting today.

Here’s my n-step plan:

  1. Uninstall Skyrim. I mean, c’mon, I finished it a long time ago. I don’t need to spend any more time in Tamriel right now.

  2. Set up a nice, clean, shiny new development environment. For me, this is a fresh install of Linux Mint 12, which I’ve never tried (I’ve been an Ubuntu user for many years). If it’s new and interesting, I’ll want to tinker with it, and that, in turn, will give me ideas.

  3. Start this blog. To write about programming, you have to think about programming, which makes you a better programmer. The blog platform is Octopress (and therefore Jekyll), which I’ve never tried, so I’ll want to tinker with it, and… well, you get the idea. I’m not going to pressure myself to post to it, but occasionally I’ll want to, and here it will be.

  4. Beef up my GitHub profile. There must be something I’ve written in the past year that’s worth sharing. I know everyone is dying to see my AbstractBaseProviderFactoryObserverFactory. And any new tinkering I do needs to go up on GitHub as well.

  5. Read more programming books. I’m reading Code Complete at the moment, but there are dozens of books out there I’d like to read. Again, thinking about programming makes you a better programmer.

  6. Start learning a new language. So far, I’ve flirted with functional languages but have never gotten very fluent in one. I think I’ll go through SICP and learn me some Scheme.

So that’s the plan. It’s time I became a programmer again.