Storing binary large objects (BLOBs) in a database is a common requirement. Unfortunately, the SQL92 standard doesn’t define any suitable data type for BLOBs, so each vendor provides its own proprietary solution, if any. In this article, I’ll present a solution for PHP and PostgreSQL. It’s based on standard features, so it should be easy to port to any language/database.
What's a BLOB?
BLOB is an acronym for binary large object. It refers to any random large block of bits that needs to be stored in a database, such as a picture or sound file. The essential point is that a BLOB is an object that can’t be interpreted within the database itself.
PostgreSQL large objects
The whole thing started when I needed to store images in a PostgreSQL database as part of a PHP Web application. Although PostgreSQL provides support for BLOBs in the form of Inversion Large Objects, you can’t manipulate them with regular SQL commands. PostgreSQL large objects have their own API, which is available in PHP through the functions pg_locreate, pg_loopen, pg_lowrite, pg_loread, pg_loreadall, pg_loclose, and pg_lounlink (see Table A).
To deal with large objects, PostgreSQL provides a separate API that mimics file I/O, exposed by PHP as functions (see the PHP/PostgreSQL documentation for details).
You must store the object identifier (oid) returned by pg_locreate, usually as a column of a related table, in order to get access to the large object later. PostgreSQL provides a data type (oid) for that. This can be a problem because if, for any reason, you lose the large object’s oid, there’s no way to retrieve or delete the object other than by examining the system tables in order to discover its oid.
Given a clean, bug-free application, this shouldn’t happen in normal situations, but there’s always a risk because external factors can break the application, leading to dangling large objects. So, to be safe, a regular clean procedure would be necessary.
Another problem with the PostgreSQL solution is that there’s no SQL command to manipulate large objects, so you must write scripts to perform even the most trivial tasks. This isn’t critical since your application should provide the needed operations, but I like to be able to use psql (the PostgreSQL front end) to manually manipulate tables and rows.
A standard solution
Storing large objects as a column in a table solves both problems because transactions guarantee the database integrity (no unreferenced large object) and the table can be manipulated through SQL. However, the SQL92 standard doesn’t define any suitable data type for that.
Among the data types defined by the standard, the best choice is character varying, which stores a variable-length string. However, character varying is intended to store characters, not binary data, and although the SQL92 standard doesn’t specify any limitation on a data type’s size, each implementation imposes an upper limit.
PostgreSQL text data type
I used the PostgreSQL text data type instead of character varying because it can hold any number of characters (within the limitations of the PostgreSQL implementation) without the need to specify the upper limit. The physical size of text is basically just the size of the characters stored, while character varying always spends the maximum size specified. If this isn’t true for your database, no problem. Otherwise you should use an equivalent of the PostgreSQL text data type—many databases provide similar data types. The actual data type doesn’t matter as long as it stores strings. If you use character varying, remember that space is wasted every time the actual string stored is smaller than the maximum. However, you can adjust the block size to minimize lost storage: Use a large block size to store many large objects, a small block size for small objects, and something in between if your application mixes large and small objects.
The first problem is easy to solve—just convert the data to text using any suitable encoding scheme. In the case of PHP, the functions base64_encode and base64_decode do exactly that. The function base64_encode takes a binary string and returns it encoded according to the base64 encoding; base64_decode does the opposite. The base64 is a MIME standard that encodes binary data as ASCII characters.
You can use whatever encoding scheme is available in your language. If no encoding scheme is built in, you can use your own. Implementing base64 is not difficult, and plenty of implementations are freely available for several languages.
The solution to the second problem is not as straightforward, but it’s not difficult either. After encoding the data, break it into blocks of a size compatible with your database limitations (I call it the block size) and store each block as a row of a table, with a sequence number identifying each block so that you can reconstruct the original data later.
When you want to retrieve your large object, just get the rows in the right order (using the sequence number), concatenate the blocks, and decode the data. This way, you can store objects of any size within the physical limitations of your database.
A simple PHP implementation
Listing A is a set of PHP functions that implements the solution described so far, and Table B lists each function with a short description. My implementation stores the large objects in a separate table, named BLOBs, specific for them (see Table C). The functions in Listing A implement BLOBs in a portable way, using only regular SQL commands and tables. Although they were written for PostgreSQL, it’s not difficult to port them to another database.
Notice that each large object is identified by a unique integer (id) used to access the object later, much like the PostgreSQL oid. However, the id must be passed as a parameter to the createBlob function. So determining a unique id for each large object is application dependent.
I chose to do that just because doing it the other way, letting createBlob return the id, would be implementation dependent. However, you can easily create a wrapper (or change createBlob directly) to determine a unique id using a PostgreSQL sequence, for example.
The most important function is createBlob. Given a large object and its id, createBLOB encodes the data, breaks it into chunks of a predefined size (I used a block size of 4096 bytes, which is compatible with PostgreSQL limitations), and saves them in the BLOBs table. You can change the block size, changing the constants inside createBlob (just two places). Each chunk is numbered so that it can be retrieved later in the correct order to reconstruct the original data.
The function getBlob retrieves a large object, given its id. It retrieves the blocks, concatenates them, decodes the data, and returns the result. The function deleteBlob deletes a large object, given its id. The updateBlob function replaces a large object with a new one—updateBlob is actually a shortcut that calls deleteBlob to delete the old object and createBlob to put the new one in the table.
The functions xxxxxFromFile do the same things as their counterparts; however, they take the large object from a file, given its name, instead of a PHP string. They are handy when uploading files because PHP saves the uploaded data in temporary files.
Implementing binary large objects using standard SQL is not a difficult task. With proper design, it can be as easy to use as native implementations. Since this solution for PHP/PostgreSQL relies only on standard features, it shouldn’t be difficult to port it to other databases or languages.
There’s room for improvement, however. One limitation with my implementation is that it represents the large object in memory as a PHP string, which can be a problem if you intend to deal with huge objects. In that case, an API more like the PostgreSQL inversion large objects would be welcome.