Data Management

Indexing technology helps data aggregator optimize human resources

The Web represents a quantum leap in the availability of information, but managing and organizing reams of published material can be a substantial headache. Learn the solution one data aggregator found.


Gale Group, a data aggregator for libraries, schools, and businesses, adds some 30,000 items every day to its databases of articles from disparate sources, such as academic journals and reference book entries. But the idea of sending human indexers to deal with the tidal wave of articles gave the enterprise pause; it knew the staff simply wouldn’t be able to keep pace.

The Foster City, CA, enterprise wasn’t facing just increased availability of information, explained John Joyner, director of editorial automation and systems analysis for Gale Group. In July, the company partnered with another data aggregator, Ingenta, which brought the addition of some 5,000 periodicals. The company—whose competitors include Lexis-Nexis (though it’s also a partner), EBSCO, UMI, and Proquest—clearly needed to find a way to automate the basic indexing and categorization of articles.

Realizing it needed technology to augment the staff indexing effort, the company decided to deploy Verity’s K2 Enterprise search-and-categorization software, an application that Joyner said gives his enterprise the best of what technology and human effort have to offer.

The indexing scenario
When an article comes into the Gale system, an indexer reads it and synthesizes the information, culling keywords that relate to the taxonomy of subject headings standardized by the Library of Congress. The initial part of this process is easy: traditionally the headline, deck, and first paragraph of any article contain the logical keywords used to index an article. Anyone typing in a keyword for a subsequent search gets a range of articles that are most likely to relate to what they’re searching for.

The human brain is necessary in the process, explained Joyner, as an article can relate to a concept without ever actually using a specific word to describe that concept. For example, an article like the one you’re reading is about a software installation, though that specific phrase might never actually be used.

For Gale, providing this kind of precision is a competitive advantage. “It’s how we differentiate ourselves from our competitors or a Web search engine,” Joyner said.

Yet there's an inescapable reality when it comes to labor and cost issues, and that’s where technology comes into play.

“We’re not under the illusion we can replace our indexers,” said Joyner, who started at the company as an indexer a decade ago. “We want K2 to approximate the indexing that our workers do.”

Gale’s indexers create what’s known as a “topic,” in Verity’s vernacular, for each article. The topic contains keywords and rules that apply to the article and are matched against search criteria for highest accuracy. A topic is a complex query consisting of many weighted subqueries that identifies whether an article is about a particular topic, Joyner explained.

K2’s query language lets users express a lot of different parameters, he added, such as weighing a word highly, but not when it’s part of a particular phrase, or weighing a phrase higher if it’s closer to the beginning of the article. For instance, a user searching for articles about the petroleum or oil industries shouldn’t get parenting articles discussing petroleum jelly and baby oil.

“We found in our initial evaluation last year that an editor, in less than a day, could write a Verity topic that could rival something we get from manual indexing for certain topics,” Joyner said.

Additionally, Gale can use K2 for automatic indexing that represents the initial step for any article. Most categorization tools involve a statistical component that tallies the number of times a word appears in a story and how close that word appears to the article's beginning. That’s why search results provide a percentage next to an article—it’s a percentage of chance that the article has what the user needs.

Unique feature sealed the deal
One of K2’s facets that intrigued Joyner was a rule-based statistical component called logarithmic regression classification (LRC), which, on an automated basis, can translate its results into a Verity topic.

By running K2 against the onslaught of new articles it’s indexing, Gale was able to offload the indexing that previously only a human could do.

“We get the best of both worlds,” said Joyner, noting that K2 can handle the initial simple process, followed by an indexer who adds more complicated aspects.

“If an article relates to a particularly arcane aspect of accounting law that relates to oil companies, it’s simple to ensure that the phrase ‘oil companies’ comes up in a search. It’s harder to ensure that the phrase ‘arcane accounting’ does,” Joyner added.

Initially, Gale Group will be using Verity in its Infotrac division, which aggregates newspapers and magazine articles, and then will be adding the capabilities in its Gale Research division, which publishes reference books. Ultimately the company wants to be able to offer intelligent searches that cross both divisions, so that a search on Albert Einstein would bring up not only periodical references but also biographical entries in reference books.

Though the K2 Enterprise software, which costs about $170,000, won’t enable the company to reduce head-count, it does make its human capital more efficient and enable the company to tackle even more indexing projects.

“We have to keep looking at ways to improve and keep up with the information explosion,” said Joyner, “as well as keep up in a market space where periodical databases competition is pretty stiff.”

Editor's Picks