Editor's note: When this article published, the product name was Amplify; the product name is now OpenAmplify. This article has been updated to reflect the name change.
Hapax recently introduced OpenAmplify, a natural language processing (NLP) Web service that can parse documents and blocks of text to derive their meaning. NLP is a topic that has always gotten the geeky part of my brain pretty excited, so I was glad to talk to the folks at Hapax and to try out OpenAmplify for myself. First, I spoke with Hapax CEO Mark Redgrave and Hapax CIO Mike Petit; then, I put the OpenAmplify service through its paces.
OpenAmplify is a Web service, but unlike some of the Web services I have been dealing with recently (namely, Exchange Web Services), it is extraordinarily lightweight. A typical call to the OpenAmplify service only requires two pieces of information to be sent: your OpenAmplify license key and either the URL to the document to be processed or a block of URL encoded text. You can call OpenAmplify as a REST service with a GET/POST, or you can make a SOAP request to it. When making your request, you can narrow down the results to only certain search terms and result sections. One thing that disappointed me a bit is that you cannot point OpenAmplify to a PDF or Word document; if you want to process text from a PDF or Word source, you need to extract the text yourself and pass it to OpenAmplify.
Understanding OpenAmplify's results
The results come back as a very minimalist XML document, DoubleClick's dart format, or JSON (your choice). The results are split into four major areas: topics, actions, demographics, and style. Each result is assigned two versions of the score: a name (which represents a broad range of underlying numeric values) and a scalar modifier to it. Sometimes the scalar value "emphasizes" the named value. For example, "negative" and "-0.8" for the "polarity" on a topic means that the author is very negative about that topic (while a -0.01 would barely be negative). Other items use the scalar value to indicate certainty. The demographics may say "male," but a scalar value of "0.001" might indicate that the system is barely learning towards "male" as opposed to "neutral." This combination gives OpenAmplify both a convenient, broad value for human consumption, and a highly granular value for aggregate calculations and other numeric usages.
"Topics" tells you what items are discussed in the text, how much of the text focuses on them, if the author offers advice or is asking for advice on the topic, and whether the author is positive or negative about them. Topics are organized into "domains" (e.g., "brakes" and "tires" would be in the "automobile" domain), "locations," and proper nouns. The "actions" results correspond to the verbs in the documents and shows their temporality (past, present, future), the "decisiveness" of the verbs, and if the author is offering or asking for advice or information on the action. The "demographics" section has basic information about the author: age, gender, and education level. Mark and Mike described a snapshot of how these items are calculated, and they are really taking a lot of factors into account. The "style" section results let you know if the document contains a lot of slang and how "flamboyant" the author is. "Flamboyance" is a measure of things such as the complexity of the sentence structure, the vocabulary used, and so on.
Putting OpenAmplify to the test
When I put OpenAmplify to work, I couldn't help but barrage it with a wide variety of items from my favorite author: myself. To be fair, I also passed it some other items, mostly postings from various bloggers regarding different topics, writing styles, and quality. For my own work, I felt flattered (to say the least) by OpenAmplify's results in "demographics" and "style." For other authors, I generally agreed with the demographics and style results, and the verifiable items (gender, age) were correct when they were not "neutral." Overall, the topics were spot-on as well.
The only flaws I noticed was that it occasionally did not get my attitude towards a topic correct, and sometimes it thought that I was not offering or requesting guidance when I was, particularly on "actions." I also noticed text that I thought was simple and normal was often given a high "flamboyance" rating and a high education rating, which has me extremely concerned about what is considered "average" writing quality.
As to be expected from an application performing such computationally intense work, OpenAmplify is not at "ultra real-time" speed, but it is not "slow" either. It took well under the magic 10 second threshold to respond when I pointed it at a very lengthy paper of mine (12 printed pages) on a difficult topic. It made short work of blogs, news articles, and so on.
Praising OpenAmplify's documentation
OpenAmplify documentation is excellent. The "quick start" information on its site had me up and running in less than a minute. The full documentation did an excellent job of explaining what the various result items meant with examples that made sense.
What I liked best is that, while Hapax employs a ton of super-smart language experts, the API documentation would make sense to anyone -- even someone without a programming background! I feel that I could give the API docs to, say, an office manager and ask him to put a few documents through OpenAmplify, and he could tell me what the "top ten topics" or typical education level was of those documents.
Running on Amazon's EC2 platform
OpenAmplify is Linux based and is written in C++, and it is run on Amazon's EC2 platform (to the best of my knowledge, this is the first time I have ever used an app running on that system). There is no need to "train" the system (unlike many other NLP systems out there) because OpenAmplify does not take a statistics-based approach. As a result, users are not able to upload or provide personal ontologies either.
Building a developer community
Hapax has some big plans around OpenAmplify. The company is researching "discourse analysis," which takes "snapshots" of a "conversation" and will be able to determine the "velocity" of the conversation. For example, it could determine if two sources are becoming increasingly hostile to one another.
Hapax is also working on authorship "signatures," which can identify authors based on things such as certain word usage patterns and favorite phrases (I know my "signature" already: lots of commas and semi-colons and sentences of Germanic length). Hapax is also working to build a community around OpenAmplify where developers can exchange ideas about how to use it.
Implementing security measures
One thing that concerns me is security; after all, OpenAmplify is running on a third-party cloud computing platform. Mark and Mike gave me some very reassuring information on the topic. There is no data persistence other than a short-term cache of the results, so if you ask for the same item twice, you get the same results. In addition, you can call the service via SSL to ensure that your text and results are encrypted in transit. If you're passing OpenAmplify a URL instead of the text directly, you can give it an HTTPS URL. These security measures also reduce the infrastructure complexity. Because there is no data persistence, there are no shared state issues, which lets them scale up and out very easily with a simple "round robin" load balancing scheme.
OpenAmplify is currently available for less than 1,000 "transactions" for free; each request may be comprised of multiple transactions: asking for all sections of results counts as two transactions, large documents (more than 2.5K of "stripped" text) require an extra transaction, and providing a URL to the document (instead of the text itself) is also an additional transaction. In other words, the "worst case scenario" is that a document can cost up to four "transactions" to process. Developers needing more than 1,000 transactions per day can be put on a pay plan. You can visit the OpenAmplify Web site to sign up for an account.
Where I see OpenAmplify being a really neat item is as a part of a greater whole. It lends itself to various mashup ideas quite well. For example, I could see it being very helpful in various intelligence and law enforcement scenarios when combined with a Web spider and possibly some other text processing tools. A company could write software that uses OpenAmplify to comb through documents for legal discovery such as determining "what they knew, and when" (I may pitch that one to my boss). Researchers could use it to comb through documents to find applicable references. And so on and so on.
While this is not a space that I am familiar with in terms of who the market players are, from what I have seen, OpenAmplify is a fine product. From the development perspective, it is very easy to work with, which counts for a lot. I think that the Hapax team has a lot to be proud of here.
J.JaDisclosure of Justin's industry affiliations: Justin James has a working arrangement with Microsoft to write an article for MSDN Magazine. He also has a contract with Spiceworks to write product buying guides.
---------------------------------------------------------------------------------------Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!
Justin James is the Lead Architect for Conigent.