Tuesday, May 13, 2008

On Balanced Trees and Car Insurance

I usually don't teach our Data Structures course, but I often give a guest lecture about balanced binary search trees. This comes right after they've learned about plain (unbalanced) binary search trees. Often the previous lesson has ended with an illustration of what happens when values are inserted into the tree in sorted order.

After reviewing that pathological case and how it can easily arise in practice, I dive in and explain how to keep the trees balanced. In my case, I use red-black trees, but the exact choice of balancing scheme isn't critical to my point here. At the end of class, I ask them how much faster they think the balanced trees are than the unbalanced trees. They're shocked when I tell them that, most of the time, the balanced trees are slower than the unbalanced trees. As long as the data arrives in a reasonably random order, the plain binary search trees will actually end up being fairly balanced, without doing any extra work. No mucking around with colors, no rotations—of course the balanced trees are slower!

So why bother?

I tell them that it's like buying car insurance. When you buy car insurance, you hope that you're just throwing your money away. At the end of the year, if you haven't been in any accidents, you're happy to have wasted that money. Why bother? To protect against the uncommon case, to keep one unlucky event from turning into catastrophe. You pay a little bit extra all the time to keep from paying a lot extra every once in a while.

6 comments:

Epic Fail Guy said...

At the risk of sounding stupid and naive, I'd like to ask what would happen if you simply shuffled a (possibly) sorted list in advance of interning it into a normal non-balanced binary tree? The odds are, especially for very large lists, that the tree would retain something around O(log n) performance. Right?

Oh and nice blog btw, keep writing

Unknown said...

Epic Fail > You could shuffle all data, if you have access to it beforehand.

If the data is slowly trickling in, you do not have the luxury of being able to shuffle it.

Epic Fail Guy said...

LOL, I suppose you could buffer it.

Mattias said...

Loi please.

Perhaps in some cases you could buffer it. But consider a car insurance database searchable by the policy end date. New records will be added every day, with generally strictly increasing dates. And you cannot buffer the data for long enough enough to make a difference.

In the case where you need to buffer it, it makes sense to just use a balanced binary tree and avoid hacks (that will have a performance penalty as well) to make the unbalanced trees work.

haf said...

Very good comments. Thank you http://www.6i6.de

Anonymous said...

even when buffering is not feasible, (e.g. in the car insurance database example), it may be possible to use an unbalanced tree, and balance it every night (or during any expected breaks in the insertion process) by shuffling the entries. depending on the performance requirements, this might be better than suffering the overheads of a balanced tree at every insertion.