Software Development

No fuss SQL joins without a database

CPAN Perl provides an excellent generic database driver, but it does not include join support. Learn how to easily add join functionality with this Perl code.

As I was browsing through the CPAN Perl archives, I discovered the DBD::Sprite module, a database driver you can use when you don't have a database. DBD::Sprite can be a great alternative to costly database implementations, but its most obvious shortcoming is a failure to support SQL joins. Fortunately, we've found a way to add those joins by coding directly in Perl.

The DBD::Sprite database driver
In terms of complexity, database servers can be ranked roughly as follows: Oracle and SQL Server at the high end; perhaps followed by PostgreSQL; down to MySQL or even further to db; and at the bottom, flat file formats such as comma-separated values (CSV). Perl's DBI module provides general infrastructure and offers a specific DBD driver for each of these database models.

The DBD::Sprite driver's complexity lies somewhere between that of MySQL and CSV files. It does operate on simple flat files, but it has enough polish to provide a SQL-like interface. Because it's implemented entirely in Perl, there's no server to set up or connect to and no libraries that need linking to the Perl interpreter.

When you're installing DBD::Sprite, you'll also install the tool makesdb.pl (a copy is installed into the Perl DBD module directory). This tool creates a single flat file of trivial details about database name and location. There's nothing sacred about this file; it's mostly plain text. You can make one for yourself by hand, as long as you glance at the Sprite.pm module to see how passwords are made with Perl's crypt() function. Database and user names are trivial features of DBD::Sprite that are designed to fit in with the DBI interface and to make DBD::Sprite mimic a grown-up database. There's no real multiuser support.

In addition to the .sdb (Sprite DataBase) file, there's an .stb (Sprite TaBle) file for each relational table you create. These files are in a CSV-like format. That's all there is to DBD::Sprite; there are no transaction log files or other complexity.

SQL with DBD::Sprite
The file insert.sql, shown in Listing A, provides an example of how to create and fill relational tables with DBD. DBI's AutoCommit feature is on by default, so each SQL operation is a separate transaction. Here's a sample command:
 
$db->do("INSERT INTO singers VALUES (1, 'Billy Holiday', 170)");
 

This command looks like any command in any other SQL database, but DBD::Sprite has amazingly limited SQL support. Most datatypes are faked: Only Perl strings and numbers are actually supported as types. There are no keys or primary keys and no indexes. Each table is just a set of anonymous records. On top of this basic arrangement is some SQL syntax that's a lot like Oracle or SQL92, depending on your perspective. You get the chance to express yourself in SQL-language and not much more.

Still, familiar is good. The DBD::Sprite authors warn that performance is lousy, but they must be referencing big corporate databases where tables have millions of records. In our experiments, we could easily see subsecond response for a table containing thousands of records. That's enough for a simple mailing list database.

In this article's example we've created two tables: singers and actors. We want to find a singer and an actor with the same height (stored in centimeters), so we can pair them up for a vaudeville show. That's a problem for DBD::Sprite, because there is no join support. All queries act on one table only; there are no subselects or other join substitutes.

The problem of joins
Writing logic to do a join (we mean a natural join, equi-join, or inner join, on one column only) is fairly simple. The main problem is deciding which join algorithm to use. Joins are traditionally slow and expensive operations. If two large tables are to be joined, the choice of the join algorithm is critical.

This brings us to computer science's O notation. O means Order and O(n) means Order-of-n. These statistics measure the total operations needed to complete an algorithm. O(n*n*n) describes worse performance than O(n*n) because as n gets big, n*n*n grows faster than n*n.

Two well known join algorithms are the so-called nested-loop join - O(n*n) and the sort-merge join - O(n*log(n)). On the face of it, sort-merge must always be faster since n*n gets bigger quicker than n*log(n). That, however, is just a generality. If one of the joined tables is very small and one is very large, the nested-loop join looks more like O(n), which is then faster than merge-sort. So for the best possible result, have more than one join option handy. We'll run through samples of both.

Implementing joins in Perl
DBD::Sprite, DBI in general, and many other Perl technologies use data structures that are lists-of-lists. Any Perl join will have to manipulate these lists. In Perl, it's common to suck up lists of records from a file until all the rows are held in memory. This is a good idea for a lightweight system like DBD::Sprite, since memory requirements should stay moderate. Join algorithms are easier with this approach, as well. Performance will be good enough for Web site mailing lists (or audition lists); it's not enough for big applications.

To allow DBI/DBD to hand back from a SQL SELECT, you use a data structure that is a reference to an array that contains references, each of which refers to an array containing data items for the fields of one row. Basically, it's an array of arrays.

The same structure expressed directly in Perl literal style looks like either:
 
[ [ a1, a2, .., aN ], [ b1, b2, … , bN ], … , [z1, z2, …, zN ]
 

or
 
\( \(a1,a2,…,aN), \(b1,b2, …, bN), …, \(z1, z2, …, zN) )
 

where a, b … z indicate a single row of data each. For a two-table join, you'll have two of these structures.

The script join.pl, shown in Listing B, has a routine for each of these joins. Both routines construct and return a third table datastructure in the same format as the two supplied tables—that's the joined output. Perl novices should brush up on the perlref and perldata man-pages before examining the code.

Subroutine nested_loop_join() is just as it seems: two simple loops inside each other. The routine spins through each record of table A, and for each record, spins through the records of table B. If the join column matches, it adds the two rows to the result.

Subroutine sort_merge_join() is a little trickier. Each of the tables is separately sorted, using Perl's built-in sort function, which we hope is O(nlog(n)) at worst. The routine looks at the top of both sorted lists, comparing elements. If there are matches, it creates join records. If there are no matches, it discards a top element and compares the tops of the two tables again. Subroutine sort_merge_join() repeats until both tables are empty; by then all matches will have been found.

Summary
DBD::Sprite is a quick-and-easy database solution for simple purposes, but don't run your corporate CRM system on it. With a little extra effort, you can add tolerable join functionality. Joins greatly increase the flexibility of SQL queries, and you can implement them once and never think about them again, except to choose the right one for the right occasion.

 

 

 

 
0 comments