It’s all about the data. Understanding this statement and implementing controls accordingly is key to protecting sensitive data and keeping track of discoverable information. Organizations are faced with a growing number of channels through which information is processed, transmitted, and stored. The traditional methods of protecting and tracking data are no longer good enough. You must take additional steps to ensure data owners and security personnel know where the data are located and where they might be sent. At a high level, this is the function of content monitoring and filtering (CMF).


There are two main reasons why the wide distribution of data, and subsequent leakage, is becoming more prevalent. First, users are getting more and more access to databases and other data repositories so they can run queries and develop custom reports. On the surface, this is a big business benefit. Business users don’t have to wait in an IS development queue for reports, and developers are freed to work on more complex and strategic projects. But this means is that large amounts of information are stored in spreadsheets and other documents across the enterprise. This makes discovery and security much harder because your company’s data could be on someone’s local drive, in a network share, on a USB drive, memory stick or even on an iPod or other media player. You get the picture.

Also, once data is removed from secure repositories, it can be sent via any number of communication paths to who knows how many destinations. This brings us to the second primary data tracking challenge-a growing number of user accessible communication channels.

Each day, more communication channels become available across various ports and from various online services. Just think about the different channels–Instant messaging , peer-to-peer networks, including GotoMyPC, FTP, online file transfer services, such as, e-mail, HTTP, online disk storage services, such as XDrive, etc.

Any attempt you make to block all existing and emerging channels will only result in channel providers and end-users finding new ways to circumvent your controls. Encryption might be a possible answer, but according to Rich Mogull, a Gartner analyst,

“Although encryption of sensitive information, including to the row or attribute levels in database tables should be an eventual control, the ability of businesses to achieve this level of encryption, and properly manage it can take years to accomplish.” (Gartner Research Article G00141630, 2006)

Even if an organization already has a sophisticated, well-managed encryption solution in place, the biggest potential weakness in data leakage prevention is still information in the hands of those who actually have permission to access it. The data leak risk associated with specific data depends on four characteristics of that data: accessibility, significance, copyability, and detectability.

AccessibilityThe ease with which data is accessed is a big factor in whether it’s a significant leakage concern. Data easily retrieved, manipulated, and stored in low-trust repositories are excellent targets for cybercriminals. They can access the data through enterprise search solutions, data warehousing, ERP, read-only direct database access for query functionality, and through other business intelligence systems.
SignificanceAs the value of information increases, so do the chances that a criminal will apply the effort necessary to get it. Information becomes more attractive to an outsider if it’s easy to sell and has a big market value, if it provides the attacker with some kind of competitive advantage, or whether the information has social or political significance that can be leverage.

However, significance takes on a different meaning when applied to discovery challenges. Discoverable information finds its way to a variety of distributed repositories because of how users do their work or view network stability, including:

  • Mobile user requirements for information at point of service, care, etc.
  • Management expectations that force users to work at home, on home systems
  • User perceptions of network instability, the potential for data loss or inaccessibility
  • Storage limitations and the absence of a document retention process that users see as a reason to store documents and other information on local disk, CDs, etc.

Copyability – The easier it is to copy data, the harder it is to control
Detectability – This measures an organization’s ability to monitor for and react to anomalous use or movement of potentially discoverable data. It further gauges the extent to which users are aware that the data is being monitored.
These four characteristics of data leakage risk, and the increased effort associated with ESI discovery, can be depicted in a variation of the standard information security risk formula, shown in Figure A.

Figure A

Figure 1: Data Leakage Risk Formula

This isn’t meant to be a perfect model of data leak risk. It’s simply a tool to help you visualize the relationships between various leak and data distribution considerations. Using this model, we see risk is reduced by decreasing accessibility, significance, or copyability. It can also be mitigated by increasing detectability.

You can decrease accessibility by improving identity and access management processes and technology, or by encrypting sensitive data. You can limit the amount of information business users can access when running their queries by denormalizing information stored in databases and storing it in special-use data warehouse repositories.

You can address copyability by using administrative controls that are well-known to the user population. You can also implement technical controls to control or prevent the use of removable storage devices.

It’s pretty hard to reduce the significance of data to an attacker. If the data doesn’t have any value to an attacker, the time and resources you applied to protecting it are better utilized somewhere else. However, IT managers can and should address user challenges that cause employees to copy information to distributed and less manageable locations.

Finally, detectability is increased through the use of tools that detect sensitive data stored in low-trust repositories and, in general, monitor the use of that data anywhere on the network.

IT and security managers can manage data detectability and copyability characteristics with CMF.

Content Monitoring and Filtering (CMF)

CMF is not a preventive control. It’s a detective safeguard that determines whether your preventive and document management policies, standards, guidelines, processes and technology are working effectively.

CMF solutions typically perform two basic tasks on a network. First, they discover and classify sensitive data on local or network storage by using user-defined rules. After discovery, business policies defined by management and configured into the CMF software determine whether the data are stored in a location with controls commensurate with its classification. If it’s determined that they don’t, then the software can either move the data to a secure location, send an alert to an administrator, or both.

Second, a CMF solution should be able to look inside packets to identify sensitive data in transit. If sensitive data are detected being copied or sent over a monitored communication channel, user-defined business rules determine whether to allow the data to continue to its destination, alert an administrator, or both.

Both tasks described above enable organizations to know where its information is located, where and how it’s traveling, and whether controls protecting it comply with business policies.

How does CMF locate/detect sensitive data?

Anyone who has used filtering software knows that false positives are an ever-present problem. Ratcheting down the ability to identify text reduces errors, but it can also reduce filtering effectiveness.

But a true CMF product uses multiple methods for detecting target data:

Key terms and phrases – Using key terms and phrases alone will most certainly cause a high number of false positives. In addition, getting the rules set so as to eliminate them can take months and result in a crippled filtering solution.
Key policies and filters – Vendors usually provide a significant number of policies and filters out-of-the-box, including compliance sets for regulations like the HIPAA. By themselves, policies and filters can result in a lower false positive rate. Used in isolation, however, they can miss key information that terms and phrases filtering might detect.

The best solution is to use a layered approach in which terms, phrases, policies, and filters work together to provide effective monitoring with a low number of false positives. In Gartner’s April 2007 CMF Magic Quadrant narrative, the following additional requirements are identified:

  • Ability to perform content aware, deep packet inspection on outbound packets using a variety of communication channels.
  • Ability to extend beyond individual packet analysis to complete session analysis.
  • Ability to use linguistic analysis to supplement simple word matching, including advanced regular expressions and document fingerprinting. (“Linguistic Analysis is the process of breaking down a document to extract the important concepts and meanings it contains” (
  • Ability to detect content in accordance with policy-based rules.
  • Ability to monitor traffic for email (minimum requirement) and other common communication channels (e.g. HTTP, IM, and FTP). Analysis must be performed across multiple channels with results accessible via a single management interface.
  • Ability to block policy violations over email.

The final word

Many organizations have delayed implementation of CMF usually because of the cost. However, if you include ESI discovery savings and risk reduction into your ROI calculation, executive management will see the value.

Two CMF solutions given high marks by Gartner are Vontu, acquired by Symantec, and Websense.