Tuesday, May 20, 2008

Designing a Data Structure

Students are rarely given an opportunity to design an algorithmically non-trivial data structure. We might give them a scenario and ask them to design how to represent the data, but that design decision is usually something like choosing between a hash table or an array or a linked-structure. Once that high level decision is made, there is very little left to design.

One reason for this is that many data structures depend on non-obvious invariants. Most students are pretty shaky on the whole idea of invariants to begin with; they're never going to spontaneously come up with something like the rules describing red-black trees or leftist heaps. With these kinds of data structures, we basically present the finished product and possibly ask them to implement and/or analyze the data structure, but we never ask them to design something like that.

For several years, I've been trying to give students a taste of real data structure design, using the sample priority queue below. I usually do this in class, where I can help steer them with gentle questioning, but I try very hard to make sure that they come up with the important parts.

I always do this at some point after the students have studied binary heaps (the kind typically used in heapsort). I briefly review the priority queue operations binary heaps support (insert, findMin, and deleteMin) and the basic idea of heap-ordered trees. Then, I introduce the idea of merging two heaps. With binary heaps, this takes O(N) time, or even O(N log N) time, if done naively. I ask them to design a new data structure to support merging in O(log n) time, while the other priority queue operations run in the same bounds as for binary heaps, O(1) for findMin and O(log N) for insert and deleteMin. I suggest that they use heap-ordered binary trees (real trees, rather than simulating trees using arrays like in binary heaps), but that they can modify these trees in any way that is helpful.

With heap-ordered trees, findMin is trivial—it just returns the root. To get their creative juices flowing, I then ask them to implement insert and deleteMin in terms of merge. This is also pretty easy: insert creates a new singleton tree and calls merge; deleteMin discards the root and calls merge on the two children of the root. So the whole problem boils down to merging two trees efficiently.

At this point, they usually flounder for a while, which is both ok and expected. Some floundering is good. Naturally, they'll start to get a little frustrated. That, too, is ok and expected. I just don't want them to get frustrated to the point where they shut down. If the floundering continues too long without any progress, I'll ask them what they can tell immediately about the result tree by only looking at the roots of the input trees. They'll quickly reply that the smaller of the two input roots becomes the root of the result tree. So I have them draw that much of the result. Depending on how things are going, I may further suggest that they draw two question marks for the children of the root in the result tree.

Then the students will play for a while with how to fill in those question marks. There are two slots for subtrees, but we have three subtrees that need to go in those slots: the two subtrees of the “winning” input root plus the “losing” input root. To fit three subtrees into two slots, naturally we need to merge two of them and leave one unchanged. But which two should be merged?

Usually, the students will begin with some fixed strategy, such as always merge the two subtrees of the winning root. But it's easy to come up with counter-examples where such a strategy will take O(N) time.

Once it becomes clear that some smarter strategy is required, the students focus on the question of which two of the three subtrees to merge. Usually the next thing they try is merging the two subtrees with the smallest roots, which fails on the same kinds of counterexamples as before.

Eventually, based on the intuition that merge should run faster on small trees than on big trees, they try merging the two smallest trees. This is always a good learning point about ambiguity, as we realize that there are three different possible meanings of “smallest”: the smallest root according to the ordering on keys, the smallest subtree according to number of nodes, or the smallest subtree according to the height of the subtrees. We've already shot down the first one, but either of the other two will work, so I just go with whichever one the class prefers (usually smallest in number of nodes).

One catch is that to efficiently compare sizes of subtrees (either number of nodes or height), we need to store the sizes at every node. But that's easy to do by adding an extra field, and maintaining that size information is also easy, albeit slightly easier for number of nodes than for height.

Typically, we don't have a lot of time left in class at this point, so I help them with the analysis, but it's easy to show that this design supports merge in O(log N) time. And there was much rejoicing!

I call this data structure maxiphobic heaps because of the way merge avoids the biggest subtree at every step. But I don't tell the students that name until after they've already come up with the design.


Anonymous said...

I wish you had taught my algorithms/data structures courses. :)

I think very often in courses professors don't emphasize the importance of the student designing (not simply implementing). I find that reading research publications are often a wonderful way of learning because you can delve into the design process and insights of the researcher, not simply the outcomes, which textbooks tend to focus upon. However undergrads (and sometimes grad-students) rarely have the opportunity to read research publications in a course... sad.

Too bad I'm still a poor Master's student, I really want to purchase your book: Purely Functional Data Structures. :)

P.S. -- I am friends with one of your students, Scott Lobdell, at West Point.

pjz said...

Somewhat simpler would be an exercise in combining structures to come up with hybrids like: a linked list with O(1) lookup via a hash table, 'threaded' trees (linked list + tree structure) and even simply linked lists that are 'multithreaded' for efficiency.

Jonas said...

I agree with the first comment, if we had done exercises such as this in my algorithm courses I might still use some of the data structures apart from the linked list and hash table.

Very nice description, makes me want to revisit my algorithm exercises and do something with them.

Fred said...

Hmm...it seems like you could just keep the two trees in memory separately, pick the least of the min's at findMin/deleteMin time, still hit all of your constraints, and probably have it be a bit simpler. Is there something I'm missing? It's a neat exercise, though!

Chris Okasaki said...

Fred: I'm not sure which two trees you mean, the original two trees being merged or the two subtrees being recursively merged.

Keeping the original two trees would work if there was only ever a single merge, but when there can be arbitrarily many merges, then deleteMin ends up taking linear time.

Keeping the two subtrees is a problem, because then the new root effectively has three children instead of two. Merge again and you have a node with four children, then five, six, etc. Again, you end up with operations taking linear time.

Ravi said...

"One reason for this is that many data structures depend on non-obvious invariants. Most students are pretty shaky on the whole idea of invariants to begin with"

Good Point. What do you think students can do to increase their familiarity with such invariants? read research papers? learn more theory? Is there a list of "design your own data structure" problems somewhere?

I am very interested in learning how to design (vs implement) data structures. Any pointers/advice would be greatly appreciated.

Thanks in advance.

Chris Okasaki said...

Ravi: I think a good way to learn about how these invariants work is to study a bunch of the standard data structures. But instead of stopping at just learning how that data structure works, really pay attention to the invariant and notice exactly where the algorithms take advantage of it. Think of variations of the invariant and explore how those variations would affect the algorithms.

It can be very useful to compare different designs of the same kind of data structure to see how they solve the same problems in different ways. For example, compare red-black trees and AVL trees, or leftist heaps and binomial heaps.

You don't need a list of problems, just start with things like priority queues, dictionaries, and sequences.

In my book, Purely Functional Data Structures, I discuss a bunch of techniques for designing data structures. Many of them are particularly suited to functional data structures, but many of them apply more widely. Probably the most widespread is "numerical representations", where you design the data structure in analogy to a number system, and design the algorithms on the data structure in analogy to operations on those numbers.

Raoul Duke said...

i dare say that you won't care about invariants until you have to debug the code of somebody who didn't know about the idea, and wrote up a mess.

in other words, invariants can be hard to motivate because they can seem like bondage or esoteric math foo, rather than something that just helps me ship the darned product.

i don't currently agree with that thought, but acknowledge that some folks (apparently) think that way. and i probably did when i was first told about them in CS211, or whatever class it was.