Trees with nested sets, conditional SQL with the Case Statement, SQL beyond the very basics made easier

As Chad Fowler notes  in his wonderful book “The Passionate Programmer”, the relationship between programmers and databases is not always the best one. Towards the end of chapter 7, “Be a Generalist”, he writes that:

“software developers are growing increasingly lazy and ignorant about how to work with databases.”

We are delegating too much to DBAs. I found it reassuring that I wasn’t the only one to have had the same problem. When we fear something, we avoid studying it in detail, and I used to fear databases. Then I decided to get to know them better, and have been learning something new every week at least.

Lately I picked up a book called “SQL Antipatterns” which is very interesting, and is making this voyage of going beyond the very basics of (My)SQL more fun and easy. It’s written by Bill Karwin and is published by the Pragmatic Bookshelf in 2010. It presents 24 antipatterns (bad practices), legitimate uses of those, and possible alternatives. It ends with some thoughts on normalization.

“To call SELECT only one statement in that language is like calling an engine only one part of an automobile.”

The book invites us to look beyond the basics and the obvious and the widespread habits in SQL.

The second  antipattern in the book is about trees. Naive trees. It notes that if we have to keep comments in a database, and each comment can be a reply to another comment, if we create our table in a way that each comment row “knows” its parent comment, we can only query so many levels of the tree at a time, because the JOIN statement will only take us so far. Adding a node is easy with this Adjacency List method, it’s enough that the new row “knows” its parent’s id. But deleting is more complex.

There are three alternatives: Path Enumeration (fun to learn that is what a UNIX paths like /usr/local/lib really is), Nested Sets (the one I took an interest in), and Closure Table (a neat separate table which contains node relationships).

It was fun to discover that, with nested sets, each node knows not its parent, but knows its “territory”, that is its “left and right” values in between which its descendants are found.  If our nodes fall between another node’s left and right, we have an ancestor. This means that we can simply query the descendants and the ancestors with a ON … BETWEEN … WHERE ,  and we can insert a node by first doing a big shift with a CASE statement.

The Nested Set Model

The Nested Set Model diagram from Wikipedia, http://en.wikipedia.org/wiki/File:NestedSetModel.svg

The Nested Sets Model

The Nested Sets Model Representation from Wikipedia, http://en.wikipedia.org/wiki/File:Clothing-hierarchy-traversal-2.svg

What’s the Case Statement? It’s the way to have some conditional logic in your SQL code (CASE/WHEN/THEN/ELSE). Curious about how to do all of that? Click “Continue reading” and get your queries 🙂 If you have actually used any of this, especially the nested sets or the Case Statement, please leave a comment! It would be very useful.So, we describe a tree in a way that each node knows not its parent, but its “territory”, that means that if a node’s left value is 2 and its right value is 9, all that falls in between is their descendants, and the node with the left value 1 and the right value bigger than 9 such as 22 (see the above diagram) is the ancestor.

Which means we can get all our descendats with a single query. BTW all queries can be downloaded from the books’ web site at this url, the nested sets directory is code/Trees/soln/nested-sets/. The descendants query is:

SELECT c2.*
FROM Comments AS c1
JOIN Comments as c2
ON c2.nsleft BETWEEN c1.nsleft AND c1.nsright
WHERE c1.comment_id = 4;

And to query the ancestors use:

SELECT c2.*
FROM Comments AS c1
JOIN Comment AS c2
ON c1.nsleft BETWEEN c2.nsleft AND c2.nsright
WHERE c1.comment_id = 6;

Or just the direct parent:

SELECT parent.*
FROM Comment AS c
JOIN Comment AS parent
ON c.nsleft BETWEEN parent.nsleft AND parent.nsright
LEFT OUTER JOIN Comment AS in_between
ON c.nsleft BETWEEN in_between.nsleft AND in_between.nsright
AND in_between.nsleft BETWEEN parent.nsleft AND parent.nsright
WHERE c.comment_id = 6
AND in_between.comment_id IS NULL;

The depth can be calculated with:

— Reports depth = 3
SELECT c1.comment_id, COUNT(c2.comment_id) AS depth
FROM Comment AS c1
JOIN Comment AS c2
ON c1.nsleft BETWEEN c2.nsleft AND c2.nsright
WHERE c1.comment_id = 7
GROUP BY c1.comment_id;

DELETE FROM Comment WHERE comment_id = 6;

— Reports depth = 2
SELECT c1.comment_id, COUNT(c2.comment_id) AS depth
FROM Comment AS c1
JOIN Comment AS c2
ON c1.nsleft BETWEEN c2.nsleft AND c2.nsright
WHERE c1.comment_id = 7
GROUP BY c1.comment_id;

SELECT c1.comment_id, COUNT(c2.comment_id) AS depth FROM Comments AS c1
JOIN Comments AS c2
ON c1.nsleft BETWEEN c2.nsleft AND c2.nsright
WHERE c1.comment_id = 7 GROUP BY c1.comment_id;

And the insert statement starring our fantastic Case Statement is:

— make space for NS values 8 and 9
UPDATE Comment
SET nsleft = CASE WHEN nsleft >= 8 THEN nsleft+2 ELSE nsleft END,
nsright = nsright+2
WHERE nsright >= 7;

— create new child of comment #5, occupying NS values 8 and 9
INSERT INTO Comment (nsleft, nsright, author, comment)
VALUES (8, 9, ‘Fran’, ‘Me too!’);

Not everyone knows that you can use conditional logic with MySQL and not everyone who knows that has actually done it before. Nested sets are a fun opportunity to give it a try, at least as an exercise.

Are nested sets really worth all the extra trouble? The “SQL Antipatterns” concludes that:

[p. 44] “Nested Sets is a clever solution – maybe too clever. It also fails to support referential integrity. It’s best used when you need to query a tree more frequently than you need to modify the tree.”

What about performance issues? I haven’t yet had the time to do any benchmarking. Wikipedia reports that [copied 2011-09-11]:

“Nested set are very slow for inserts because it requires updating lft and rgt for all records in the table after the insert. This can cause a lot of database thrash as many rows are rewritten and indexes rebuilt.

The Nested interval model does not suffer from this problem, but is more complex to implement, and is not as well known. The nested interval model stores the position of the nodes as real numbers expressed as quotients (n/d). “

But it’s not as critical about querying:

“Queries using nested sets can be expected to be faster than queries using a stored procedure to traverse an adjacency list, and so are the faster option for databases which lack native recursive query constructs, such as MySQL. However, recursive SQL queries can be expected to perform comparably for ‘find immediate descendants’ queries, and much faster for other depth search queries, and so are the faster option for databases which provide them, such as PostgreSQL, Oracle, and Microsoft SQL Server.”

More advice and experiences can be found searching http://www.stackoverflow.com, either through internal search form or using Google with site:stackoverflow.com like I do every day 😀 Some people denormalize things a bit by keeping track of the parent anyway, for example. I’m no expert here, if you have any thoughts or ideas, write in the comments. Any comment or constructive criticism is welcome, hope this post is useful for you and not too boring.

Advertisements

About apprenticecoder

My blog is about me learning to program, and trying to narrate it in interesting ways. I love to learn and to learn through creativity. For example I like computers, but even more I like to see what computers can do for people. That's why I find web programming and scripting especially exciting. I was born in Split, Croatia, went to college in Bologna, Italy and now live in Milan. I like reading, especially non-fiction (lately). I'd like to read more poetry. I find architecture inspiring. Museums as well. Some more then others. Interfaces. Lifestyle magazines with interesting points of view. Semantic web. Strolls in nature. The sea.
This entry was posted in books, SQL and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s