Enterprise Software

Application Design: Classifying and representing data in service-oriented applications

Get a high-level look at the types of data that service-oriented architectures use, and see how that data, both inside and outside a service, can be represented. This article will help clarify matters as you begin to consider SOA in your organization.


Perhaps I'm just getting older, but in days gone by, the IT world was a simpler place. After all, when big iron ruled, the applications that the IT department developed shared a single data source, and all the clients for those applications shared the same code base and platform. There were no deployment issues, and it was easy to share data between applications since that data simply existed in an easily accessible file or table on the mainframe.

Even in the heady days of client-server computing, the fat client Windows applications that the IT folks built in PowerBuilder or Visual Basic shared a single relational database (the vaunted "enterprise data model") with service desktops within the organization. This architecture allowed the applications to share data through transactions that updated another application's data.

Well, the world is changing. The need to share data both within and between organizations in our ever-more-connected world, scale our applications out on commodity hardware, and support heterogeneous clients from Windows desktops to other Web sites, PDAs, and Tablet PCs has muddied this once simple picture.

How do we bring our applications into this brave new world? Enter service-oriented architecture (SOA). In its simplest form, SOA calls for the functionality of an organization (the functionality that used to be contained in individual applications) to be factored into a set of business services. These services communicate with consumers and other internal and external services by receiving service-request messages ("I want to know X") and sending service-response messages ("Here is the answer Y"). Not surprisingly, these messages are represented using XML and SOAP in order to be platform- and device-agnostic.

However, when architects and developers begin thinking about designing and implementing an SOA, questions about what data services should share, how that data is accessed and maintained, and how it is represented come to the forefront. In this article, I'll present one high-level look at the types of data that an SOA will use and how that data—both inside and outside a service—can be represented. This article will help you sharpen your thinking as you begin to consider SOA in your organization.

What kinds of data do I need to represent?
When building an SOA, there are essentially four kinds of data you'll need to be concerned with. Going from the outside to the inside of the service, these are: message data, lookup data, process data, and business data. In the following sections, I'll define these types and their characteristics and provide some pointers when dealing with this data.

Message data
The data that flows between services is known as message data. This data identifies the task or operation the consumer of the service wants to perform (the request) and the result the consumer receives from the service (the response). This is the only type of data that flows between services and therefore represents the public interface of the service. It allows SOAs to be platform-independent.

So, by definition, message data requires an open schema so that consumers can discover how to formulate a request and process a response. Message data is therefore relatively stable in its definition of the operations exposed by the service; but when changes occur, it must be versioned. In addition, message data is immutable in that once the message is written, it is never modified.

Services deal with message data by generating a unique identifier for each message as well as a timestamp, version identifier, conversation identifier, and sequence number within the conversation (along with any security tokens required, of course). This additional information allows services to discard messages that are not delivered in a timely fashion using the timestamp and to ensure once-only processing of messages in the appropriate order. The service should store all request and response messages in order to ensure the ability to handle a message arriving multiple times by returning the previously generated and now cached response (which also increases performance, by the way). Message data is often used to move lookup data.

Lookup data
The second type of data that services deal with is lookup data. This type of data is used to pass parameters to operations in service requests or to interpret the data returned in a service response. As a result, a consumer may request lookup data and then use it to help formulate a request message. For example, a company that provides technical training may publish lookup data consisting of its training locations and vendors so that consumers can pass valid values to the service when requesting a course schedule. As you might expect, lookup data is relatively static but should be versioned when it changes, resulting in each version of the lookup data being immutable. Like message data, lookup data is used external to the service and so requires an open schema.

Services can deal with lookup data by identifying each item uniquely and stamping the item with a version identifier; for example, Quilogy - LocationCodes - v012004. In this way, when consumers make requests, they pass the version of the reference data used so that the service can create the proper response. Since lookup data eventually requires refreshing, typically at defined intervals, the service publishes the new version to interested subscribers in either a push (using e-mail, HTTP, or even DVD) or pull (using HTTP, FTP, e-mail) fashion.

Process data
Within a service, the first type of data you'll deal with is process data. Unlike message and lookup data, process data is private to the service and is entirely encapsulated within it, so it doesn't require an open schema. Process data represents the business process or function that is being performed by the service. Typically, these are long-running operations that are eventually completed. Examples of process data include shopping baskets, purchase orders, and invoices. Process data is client- or conversation-dependent, and so it has low concurrency requirements since it will only be accessed serially by a single client. Unlike lookup data, it is also updateable during the time the operation is active and typically becomes read-only once the operation has completed. It is subsequently referenced less frequently as time goes on.

Services process request messages and take actions that build up process data within the service for the conversation. Therefore, they must correlate the process data with the conversation identifier. A service can use a variety of techniques to manipulate process data. For example, during the time the process is active, the service can encapsulate the process data in an object and cache it in memory. Services must also be able to clean up aborted or abandoned processes.

Business data
The final type of data used in an SOA is business data. This is the kind of data most people think of when talking about applications—customer information, product inventory, and bank accounts, for instance. Like process data, business data is also private to the service and does not require an open schema. However, unlike process data, it lives longer than a single long-running operation and has high concurrency requirements since it may be changed by several operations simultaneously. As a result, business data is very volatile and transactional by nature.

Perhaps the most important way that services interact with business data is by adhering to the principle of a single owner. In other words, services take ownership of a specific part of the data for the organization. For example, one service owns customer data while another owns employee data. When data needs to be shared, each service publishes changes to other services, which then cache it for internal use. Each publication of a service's data should, of course, be versioned with an incrementing identifier. Only the owner of the service updates the data. So if a nonowner requests a change, the owning service will make the change and republish the data to interested subscribers.

How do I work with and store the types of data?
Once the types of data have been defined, you'll start to wonder what technologies you might use to work with and store the data. In short, the big three technologies—XML, objects, and SQL form the core of how this will be done.

Message data
Obviously, message data, since it requires an open schema and heterogeneity, is best suited to be modeled using SOAP. Therefore, you can publish the schema for messages using WSDL. In addition, some of the features of message data that you'll want to implement (time stamping and security, for instance) can be provided by SOAP specifications such as WS-* and implemented in toolkits such as the Web Services Enhancements (WSE) 2.0 technology from Microsoft.

Within a service, message data can be stored with its identifying attributes as XML within a relational database in order to implement idempotence. This is also recommended since you'll likely want to query on the attributes, but its immutability will not require you to frequently read and parse the message content. In addition, storing the messages allows for auditing and future analysis. However, message data by definition has the shortest shelf life and so can be more readily archived if not used for analysis.

One other consideration when thinking about message data involves what happens when messages are sent between services. In an environment within an organization where the message structure used by multiple services evolved independently, it makes sense to create a canonical or standard schema into which a message can be translated before being consumed by a service. This avoids the problem where each service represents the same piece of lookup or business data in its own way and requires a far fewer number of transformations to be built.

Lookup data
Since lookup data is public and requires an open schema, it, too, should be modeled as XML. A good choice here is to publish the schema of the lookup data using WSDL while specifying version information. From a database perspective, to increase performance, the lookup data can be stored as XML in a relational table since it will be written only once and retrieved directly as XML. In addition, storing lookup data in a relational table allows for easy versioning. Internally, the service may also cache the current lookup data in memory to increase the response time when lookup data is requested by a consumer.

Process data
Because process data is private to the service, it needn't be published with XSD or represented as XML. As a result, it is typically stored in normalized relational database tables and encapsulated within the service using an object-persistence layer such as the ObjectSpaces framework that will ship in Visual Studio .NET "Whidbey." This approach allows the service to manipulate the process data in memory as a full-fidelity object and take advantage of caching technology such as the ASP.NET caching engine. When the process completes or is in a wait state, the object can be serialized to the database using the object-persistence layer.

Since process data is updated serially during a single conversation, it can be accessed using optimistic concurrency. You'll want to store historical process data for analysis; however, the older it gets, the more likely you'll be able to archive it.

Business data
Like process data, business data is private to the service and so is not represented as XML, except of course insofar as parts of it are returned in response messages. For these parts, canonical XSDs can be created and published so that other services can interpret the data owned by the service.

Business data is therefore stored in normalized relational tables and encapsulated via components managed by a transaction manager such as Component Services (COM+) in order to ensure pessimistic concurrency (locking) during the transaction. The components themselves provide stateless access to the data because of its high volatility and concurrency. Business data is also used in analysis coupled with process data.

One model
Figure A summarizes the four types of data I've discussed. I hope this discussion will help provide a mental map on which you can place your organization's data as you begin to think about implementing an SOA. This diagram highlights the four types of data used in an SOA and how it might be represented and processed.

Figure A
A conceptual model of an SOA

Dan Fox is a technical director for Quilogy in Overland Park, KS, where he evangelizes technology through writing and speaking at events such as Tech Ed and Developer Days.

Editor's Picks