Hardware

Information Overload - Part II: Data Difficulties

This is the second in a two part series.

In the first part of this series ("Information Overload - Part I: Too Many Options"), I discussed how developers create too many options and data paths for users, creating a situation information overload. In this part, I will review how issues with the data itself contribute to the problem, and how developers can help resolve it.

Walled Gardens

Far too many content and communications systems are walled gardens. What this means is that while working within the system is great, it has no way of communicating with the outside world. MySpace is the biggest example of this. The blogs you post there cannot be accessed via RSS, the songs, videos, and images you post cannot be accessed outside of their system through URLs, and you are unable to send messages outside (or into) MySpace. In a nutshell, it is like having a cell phone that is unable to call any phone except other cell phones with that provider. What this does is force users to spend far too much time accessing and using far too many devices and systems in order to work with everyone in their network.

This is a bad case of developer hubris. Like the issues in Part I, too many developers seem to have the idea that their users will "live" in their application, on their device, or on their Web site. This is completely false. MySpace and Outlook are the only "homes" for people. People really do not need another. Outlook is hardly a "walled garden." MySpace sadly is.

The rational behind a "walled garden" is that you want your users to spend as much time there as possible. It is a problem particular to Web sites that earn money via advertising. MySpace is the perfect example, as stated earlier. If you know enough people on MySpace, you have no almost no need for an email account, a Web site, an RSS reader, or many other applications (or Web applications). However, this comes at a price: you are cut off from anyone not on MySpace.

Users really do not like "walled gardens," they only tolerate them, and only if there is enough in the garden to be satisfactory. As soon as a more open or more attractive option becomes available, the users flee. Look at the failure of every single portal in the late 90's. AOL is another good example of what happens to "walled gardens." Its closed system could not survive the Internet.

At the end of the day, try to avoid creating a "walled garden." Sometimes business dictates that you must, but whenever possible, hook into known, open standards such as SMTP, NNTP, POP3, LDAP, IMAP, etc. The more you allow your system to interact with others, the more likely you are to gain and retain users.

Lack of Storage Unification

Even for systems that are not walled gardens, there is a complete lack of storage unification. For example, a message sent through Yahoo! Mail does not appear in my Outlook "Sent Items" folder. A file stored in Flikr is completely cut off from my local file system; a change in one requires manual intervention on the other. As a result, users end up with a large number of places for data to be stored and hidden. Users need less storage areas, not more, that seamlessly sync and interact with each other. Even more importantly, simple data such as basic word processor documents need an easy way to appear on Web sites.

Where systems and users are headed towards is a new version of thin client computing, but unlike traditional thin client computing, the data is stored all over the place, not on a central server. Because of this, each system has its own rules for data storage and retrieval, and there are higher barriers between them. In the thin client days (as well as the client/server days), you could count on each piece of software having access to the same data, either through the file system or a database connection. Now, with the data locked behind a registration screen and no standard APIs for over-the-Web authentication (do you use HTTP authentication headers? Pass usernames/passwords through GET or POST variables? What are the fields called?), it is much harder for various systems to read or write to each other. At best, one site will provide you with an HTML snippet (YouTube style) to post your content from one site onto another site.

Of course, this is really not different from what the Web was originally conceived to do: allow data from many different sources be put together into one place. But there is now a business model attached to it. A company cannot make a living if you link directly to an image or movie on their site; they need to wrap it in their packaging. Again, look at YouTube. They could provide you with code to allow their movies to appear in the user's media player of choice within the page. Instead, they put a wrapper around it that encourages you to visit their site. In the future, will they be splicing their ads into it as well? Google Maps followed this path. They waited until a critical mass of Web sites were using their API, then suddenly started plastering their ads on those maps.

This would not be as much of a problem, except that data is now scattered all over the Web, with different logins and different methods of retrieving or updating it. This leads back to "too many options." Even with a self-contained system, like Salesforce.com, the result is chaos. Some data is within the SOA provider's cloud, some of it is stored within the LAN. Getting applications (and users) to work with two separate pieces of data like this is not easy at all.

Unfortunately, there is no easy fix for this at this time. The only hope is for a set of standards to be developed which help resolve this situation. Anything short of being able to access a piece of data just like network storage is not good enough.

The Metadata Problem

Metadata is great. Not only does it provide useful information, but it is a great aid in the data searching process. Unfortunately, very few systems automatically generate metadata. While they offer systems for the manual creation of metadata, rare indeed is the user who tags, marks up, or enters metadata. More systems need to find ways of properly providing the metadata automagically.

You, as a developer, need to find a way to determine the applicable metadata within your user's data, and make it available easily and openly. Plain text search just is not good enough either. There needs to be relevant, searchable metadata. It is a shame that WinFS is dead; I loved the idea. In general, file systems do not store any truly relevant metadata.

Poor Search and Lookup

Most search systems offer simple text search, and sometimes allow the user to refine their search based upon metadata. This is just not enough. For example, if you remember that someone emailed you their phone number, which is better? Searching for all emails from that user, or being able to use a pre-defined regular expression (such as "find all emails from John that contain a phone number")? More systems need to support regular expressions in a way that the average user can understand, and pre-define appropriate searches. In addition, more systems need to search through automagically generated metadata.

This is a tall order. One thing I would think would be helpful is if languages (or regex libraries) had some standard regex's built in. It would also be great if SQL supported regex's, for those applications that query against databases. Languages also need to do a better job at supporting regex's. OOP Languages in particular make regex's too much work. I like that Perl brought regex's to the level of string operators. One reason why Perl is "so good" at regex's compared to other languages (despite other languages typically using Perl-like regex syntax) is because a regex is a string operator.

That aside, if you want to support truly useful search/replace in your software, not only do you need to provide regex support, but it has to be done in a user friendly way. Pre-built regex's (email addresses, phone numbers, mailing addresses, other relevant data) helps, but the syntax itself needs to be friendlier too, maybe something along the lines of Office string formatting codes.

Data Format Chaos

Right now, all of our data is in far too many formats. Format incompatibility and conflicts are the kinds of issues that users simply do not care about, only vendors. The user does not care if they are using ODF or OpenXML, they only notice the format if it gives them a problem. Documents need to be able to be seamlessly shifted to the Web or from the Web to an environment where they can be edited. Vendors need to learn that the value they add to the users does not come from a file format, but comes from their business logic. Furthermore, as code increasingly gets developed within managed code with massive standard libraries (Java, .Net) or uses common open source libraries, the urge to develop custom file formats is reduced.

Pick a common file format, whether it is ODF, HTML, XML, OpenXML, CSV, or whatever is applicable to your data needs, and support it natively and seamlessly. This file format should be one that other applications in the same market, as well as related markets support. A good example of a data format that "refuses to die" is the dBase format. Despite the fact that no one has used dBase itself in I do not know how long, the format still is around because dBase was popular enough that everyone allowed importing and exporting from it. As a result, everyone retains that dBase compatibility, and the format itself has developed a second life as a common database data sharing format. Maybe if OpenXML is as open as ODF (or nearly as open) it will achieve the same status. HTML is at that point already, but unfortunately it is not well suited for data transportation, only data display. XML in and of itself is not a very good format for data exchange, because it different vendors do not use the same schema (or interpret the schema in different ways) it becomes useless. XML really relies upon certain standards (such as RSS) to become a useful format.

If you can, get together with other vendors to work out a common data format. Doing business based upon data format lock-in, as opposed to competing on actual feature sets and performance hurts the users. Look at the trouble Microsoft has had over the years because of their various attempts to lock users in based on formats. It does nothing but generate bad press and bad feelings. Reducing the data format mess is a joint effort, and you cannot do it alone unless you are working in a well established market with well established standards.

J.Ja

About Justin James

Justin James is the Lead Architect for Conigent.

Editor's Picks

Free Newsletters, In your Inbox