What feeds data into enterprise systems

Justin James says that by understanding the relationship between your application and its underlying sources of data, you are well on your way to writing a better application.

As the IT industry matures, the value add to data changes. It used to be that having any data was the value proposition. After all, very few organizations had computers, let alone data that could easily be retrieved or searched or that was otherwise useful.

As computers became more widely adopted, the ease of access became the differentiating factor. RDBMS' enabled businesses to access data in a standardized way and allowed the database vendor to do a lot of the worrying about transaction integrity and performance, but enterprises still had to generate the data. In the current environment, data comes from all over the place and much of it is standardized and/or commoditized. So where does all of this data actually come from, and why does it matter where it comes from? I'll answer these questions in reverse order.

It is important to know where the data comes from because much of the work on a project is often about getting the data into the system. The origin of the data affects things such as the trustworthiness of the data, how much scrubbing is needed, and what kinds of transformations may be applied. In short, where the data comes from dictates much of the work that needs to be done -- even if the data is already neatly placed in a SQL database by the time your software sees it.

You may think that the answer to the question "where does the data come from" is obvious, but it's not anymore. Even as recently as 10 years ago, much of the data came from one of a few sources. All of these sources had in common the relationship that was established through a different channel from the data exchange. There is a measure of trust that is implicit when that occurs, which is much like the way public and private keys work in the encryption world. You trust a data vendor's weekly updates because you saw their building and know that they're not a fly-by-night company; or you trust a user because you know that Bob the systems administrator set up the user's account in person.

Many applications now use a great number of automated and untrusted data sources. When a data tape came in from a vendor and a manual process was needed to load the data, there were a lot of chances to see that the data might be bad, such as a file size being much different from the previous data tape or eyeballing the data for things that are obviously wrong. When the data comes in automatically or is linked in real-time through a Web service, the opportunity for spot-checking the data is lost. Another example of this effect is that we are allowing anonymous users (or users who sign up within the application itself, with no verification of identity) to add data to the system, which is subsequently used by and shown to other users.

While there is nothing inherently wrong with this, it does require a number of additional layers of protection; these layers are inconvenient to build and are often skipped in the rush to ship. In the period of time since the Web application boom, we have seen the rise of SQL injection attacks, followed by cross-site scripting (XSS) and cross-site forgery (XSF) attacks. The prevalence of XSS and XSF attacks has already forced security-conscious programmers to reduce functionality. The accuracy of the information in and of itself needs to be verified if the data is to be used to make business decisions, not to mention legal liability.

In the modern data equation, there are data initiators, such as the people providing the raw map data; companies measuring overall warehouse shipments to distributors; someone typing in the last stock sale price (or agreeing to the sale price electronically), and so on. This is carried by a data aggregator who transforms data from disparate data initiators into a "single stream of consciousness." Think of Google taking all sorts of geotagged data and putting it onto maps, IMS' pharmaceutical databases, or Yahoo! Finance having all the stock prices from around the world. The data consumer gets the data from the aggregator and performs the business-specific transformations, like the Web developer embedding the Google Map onto the site to provide directions, the incentive compensation software calculating the bonus for salespeople, or the day trader making purchasing decisions.

Developers can still add value, and companies can still make money in this ecosphere.

The initiators can be very software centric. Better software can make the data cleaner, more accurate, and more plentiful, which means that the initiator can make more money. After all, much of that data is manually created. Software with better usability or a wider user base or that is easier to hook up to an aggregator will rise to the top.

The aggregators have the boring work, but there is tons of money in it. It is such a pain in the neck doing the work that the market is typically dominated by only a few of these companies. To the data consumer, the value that was added is that they only need to program against one data feed and possibly join the data tables to a source identification table. The data initiators are saved the hassle of trying to sell their data to thousands of customers; instead, they cut a deal with a few aggregators. Sure, they make a bit less money, but it is worth not having to manage thousands of small customers. The consumers pay more for the data, but it is still cheaper than paying a developer for his time -- and it's more reliable.

The consumers' developers potentially have the most interesting work. I say potentially because, all too often, the software that gets put in front of users is barely more than a database browser that enforces a few business rules. There is a wide gap between business rules and business logic. There are billions of dollars to be made by selling software that promises to tame the data from dozens of various aggregators (turning the enterprise into a meta-aggregator or a mega-aggregator) and perform business logic on it. This is what all of these reporting portals, business intelligence suites, analytic tools, ERP systems, CRM systems, and so on are about. Anyone who has been involved with these initiatives can tell you about the difficulty of making one successful. After all, the project is so large that by the time you are finished implementing it, all of the underlying requirements and logic have changed anyway.

My personal preference for programming is to work on computational rather than transactional systems; however, the bulk of available work is primarily transactional. By understanding the relationship between your application and its underlying sources of data, you are well on your way to writing a better application.