Enterprise Software

Hapax's OpenAmplify makes it easy to extract meaning from text

Justin James put Hapax's OpenAmplify, a lightweight natural language processing Web service, through its paces. Find out what he thinks of this relatively new product.

 

Editor's note: When this article published, the product name was Amplify; the product name is now OpenAmplify. This article has been updated to reflect the name change.

Hapax recently introduced OpenAmplify, a natural language processing (NLP) Web service that can parse documents and blocks of text to derive their meaning. NLP is a topic that has always gotten the geeky part of my brain pretty excited, so I was glad to talk to the folks at Hapax and to try out OpenAmplify for myself. First, I spoke with Hapax CEO Mark Redgrave and Hapax CIO Mike Petit; then, I put the OpenAmplify service through its paces.

OpenAmplify is a Web service, but unlike some of the Web services I have been dealing with recently (namely, Exchange Web Services), it is extraordinarily lightweight. A typical call to the OpenAmplify service only requires two pieces of information to be sent: your OpenAmplify license key and either the URL to the document to be processed or a block of URL encoded text. You can call OpenAmplify as a REST service with a GET/POST, or you can make a SOAP request to it. When making your request, you can narrow down the results to only certain search terms and result sections. One thing that disappointed me a bit is that you cannot point OpenAmplify to a PDF or Word document; if you want to process text from a PDF or Word source, you need to extract the text yourself and pass it to OpenAmplify.

Understanding OpenAmplify's results

The results come back as a very minimalist XML document, DoubleClick's dart format, or JSON (your choice). The results are split into four major areas: topics, actions, demographics, and style. Each result is assigned two versions of the score: a name (which represents a broad range of underlying numeric values) and a scalar modifier to it. Sometimes the scalar value "emphasizes" the named value. For example, "negative" and "-0.8" for the "polarity" on a topic means that the author is very negative about that topic (while a -0.01 would barely be negative). Other items use the scalar value to indicate certainty. The demographics may say "male," but a scalar value of "0.001" might indicate that the system is barely learning towards "male" as opposed to "neutral." This combination gives OpenAmplify both a convenient, broad value for human consumption, and a highly granular value for aggregate calculations and other numeric usages.

"Topics" tells you what items are discussed in the text, how much of the text focuses on them, if the author offers advice or is asking for advice on the topic, and whether the author is positive or negative about them. Topics are organized into "domains" (e.g., "brakes" and "tires" would be in the "automobile" domain), "locations," and proper nouns. The "actions" results correspond to the verbs in the documents and shows their temporality (past, present, future), the "decisiveness" of the verbs, and if the author is offering or asking for advice or information on the action. The "demographics" section has basic information about the author: age, gender, and education level. Mark and Mike described a snapshot of how these items are calculated, and they are really taking a lot of factors into account. The "style" section results let you know if the document contains a lot of slang and how "flamboyant" the author is. "Flamboyance" is a measure of things such as the complexity of the sentence structure, the vocabulary used, and so on.

Putting OpenAmplify to the test

When I put OpenAmplify to work, I couldn't help but barrage it with a wide variety of items from my favorite author: myself. To be fair, I also passed it some other items, mostly postings from various bloggers regarding different topics, writing styles, and quality. For my own work, I felt flattered (to say the least) by OpenAmplify's results in "demographics" and "style." For other authors, I generally agreed with the demographics and style results, and the verifiable items (gender, age) were correct when they were not "neutral." Overall, the topics were spot-on as well.

The only flaws I noticed was that it occasionally did not get my attitude towards a topic correct, and sometimes it thought that I was not offering or requesting guidance when I was, particularly on "actions." I also noticed text that I thought was simple and normal was often given a high "flamboyance" rating and a high education rating, which has me extremely concerned about what is considered "average" writing quality.

As to be expected from an application performing such computationally intense work, OpenAmplify is not at "ultra real-time" speed, but it is not "slow" either. It took well under the magic 10 second threshold to respond when I pointed it at a very lengthy paper of mine (12 printed pages) on a difficult topic. It made short work of blogs, news articles, and so on.

Praising OpenAmplify's documentation

OpenAmplify documentation is excellent. The "quick start" information on its site had me up and running in less than a minute. The full documentation did an excellent job of explaining what the various result items meant with examples that made sense.

What I liked best is that, while Hapax employs a ton of super-smart language experts, the API documentation would make sense to anyone -- even someone without a programming background! I feel that I could give the API docs to, say, an office manager and ask him to put a few documents through OpenAmplify, and he could tell me what the "top ten topics" or typical education level was of those documents.

Running on Amazon's EC2 platform

OpenAmplify is Linux based and is written in C++, and it is run on Amazon's EC2 platform (to the best of my knowledge, this is the first time I have ever used an app running on that system). There is no need to "train" the system (unlike many other NLP systems out there) because OpenAmplify does not take a statistics-based approach. As a result, users are not able to upload or provide personal ontologies either.

Building a developer community

Hapax has some big plans around OpenAmplify. The company is researching "discourse analysis," which takes "snapshots" of a "conversation" and will be able to determine the "velocity" of the conversation. For example, it could determine if two sources are becoming increasingly hostile to one another.

Hapax is also working on authorship "signatures," which can identify authors based on things such as certain word usage patterns and favorite phrases (I know my "signature" already: lots of commas and semi-colons and sentences of Germanic length). Hapax is also working to build a community around OpenAmplify where developers can exchange ideas about how to use it.

Implementing security measures

One thing that concerns me is security; after all, OpenAmplify is running on a third-party cloud computing platform. Mark and Mike gave me some very reassuring information on the topic. There is no data persistence other than a short-term cache of the results, so if you ask for the same item twice, you get the same results. In addition, you can call the service via SSL to ensure that your text and results are encrypted in transit. If you're passing OpenAmplify a URL instead of the text directly, you can give it an HTTPS URL. These security measures also reduce the infrastructure complexity. Because there is no data persistence, there are no shared state issues, which lets them scale up and out very easily with a simple "round robin" load balancing scheme.

Pricing

OpenAmplify is currently available for less than 1,000 "transactions" for free; each request may be comprised of multiple transactions: asking for all sections of results counts as two transactions, large documents (more than 2.5K of "stripped" text) require an extra transaction, and providing a URL to the document (instead of the text itself) is also an additional transaction. In other words, the "worst case scenario" is that a document can cost up to four "transactions" to process. Developers needing more than 1,000 transactions per day can be put on a pay plan. You can visit the OpenAmplify Web site to sign up for an account.

Conclusion

Where I see OpenAmplify being a really neat item is as a part of a greater whole. It lends itself to various mashup ideas quite well. For example, I could see it being very helpful in various intelligence and law enforcement scenarios when combined with a Web spider and possibly some other text processing tools. A company could write software that uses OpenAmplify to comb through documents for legal discovery such as determining "what they knew, and when" (I may pitch that one to my boss). Researchers could use it to comb through documents to find applicable references. And so on and so on.

While this is not a space that I am familiar with in terms of who the market players are, from what I have seen, OpenAmplify is a fine product. From the development perspective, it is very easy to work with, which counts for a lot. I think that the Hapax team has a lot to be proud of here.

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a working arrangement with Microsoft to write an article for MSDN Magazine. He also has a contract with Spiceworks to write product buying guides.

---------------------------------------------------------------------------------------

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

About

Justin James is the Lead Architect for Conigent.

27 comments
solson
solson

I must be missing something. I do not understand why the program would be useful. Is it just to tell you what you should be able to understand about content without having to read it yourself?

Murfski-19971052791951115876031193613182
Murfski-19971052791951115876031193613182

I'd purely love to see the results Amplify would give if you put in the text of a political campaign speech -- or something like the US Income Tax Code. Probably come up with something like "Semantically null."

Justin James
Justin James

Do you think that Amplify would make a good addition to a product you are working on, or give you any ideas for new products? J.Ja

Justin James
Justin James

What makes Amplify useful is that it outputs XML, dart, or JSON. This means that an *application* can get awareness of a document's contents, and therefore leverage it. The fact that a human can read the output is practically irrelevant, except to test it like I was, and "eyeball" the results. J.Ja

Justin James
Justin James

It actually handles "dry" text (like legalese) pretty well. I would definitely be interested in, say, a comparison of a speech from a verbally dextrous politician and one that is more "plain spoken". J.Ja

John.Graffio
John.Graffio

There may be a panel of language experts working on this, but even if the base level system interprets meaning mathematically, there is nothing to say that an intermediate layer could be inserted that would tweak or flavor the output in a predetermined direction. And how would you know this was happening? This should be able to produce RDF output which is the basis of the semantic web. So now what if search engines started processing RDF documents on web sites? And what if I could "tweak" the "meaning" of the original document via modifying this RDF output? Now search engine results of extracted "meaning" may not match what was "meant" in the original document. So now SEO hackers can have another lever to manipulate search results, at least for a while. I think the basic importance of the product is well founded, but everyone should be aware of the fragility of what something "means" before they rely on the output of machine processing.

santeewelding
santeewelding

He ran it on us all and now he has a file...

herlizness
herlizness

Is there a demo available anywhere on the web? I want to feed it the US Constitution and see what comes back.

M.W.H.
M.W.H.

Monetizing great ideas today seems to revolve around selling something to third parties so I'm leery about what that might be with this service. Geeks can invent something cool with the best of intentions but they're not always good at figuring out how to sell it to someone without compromising their original integrity.

carl_iddings
carl_iddings

What would happen if I passed a link to French text? Or Turkish?

stevezachjohnson
stevezachjohnson

I think it can enhance your project since it makes it easy for you to extract meaning from text...

darpoke
darpoke

HR departments everywhere ought to start using it to polish their CVs :-)

Murfski-19971052791951115876031193613182
Murfski-19971052791951115876031193613182

Years ago there was a "grammar analyzer" called Grammatik, which would compare text to a set of standardized formats -- newspaper, children's book, legalese, etc. I ran it against a report done by an outside consulting firm; the resulting reply from the analyzer said "This is probably bureaucratese, but are you sure it was written in English?" This was around 1991 or so.

herlizness
herlizness

> shouldn't be any more of a problem than FOX News already is ...

Justin James
Justin James

Yup... my favorite is to put some of the discussions around the more... politically charged... discussions through it, and keep a list of who scored "elementary school" and who got "post-graduate". :) J.Ja

Justin James
Justin James

I don't know of any demos out there. It only took a few moments to register for a develop account and it is like 2 seconds to figure out the API. That being said, if you like, I'd be glad to spend a few moments throwing up a quicking demo myself. Just let me know, and I'll go ahead with it. :) J.Ja

Justin James
Justin James

They already have monetization plans, which were mentioned in the post. Folks needing more than 1,000 "transactions" pay for it. Based on that, I would imagine that unless they get a huge ratio of small users to big, pay users, they probably won't need to go down that route. Also, I am not really sure what they could sell or how they could sell it, other than saying "this user has processed these pieces of text or these Web pages". By stripping the text out of a Web page yourself to pass it in (which you should do anyways, since it saves "transaction"), Amplify is not aware of the source of your data, which further protects your privacy. J.Ja

Justin James
Justin James

I didn't see any flags or parameters to indicate language, so I would assume it kicks back garbage. J.Ja

darpoke
darpoke

was that HR should be using it to polish *their* resumes. Sounds like if a piece of software can scan for keywords without being able to 'read between the lines' as it were, well... the software's probably cheaper to run, right? From following the thread above, it seems to me in the case of the vacancy requiring SQL, no matter what applicants are reviewed it's overwhelmingly likely that people with no SQL experience will be rejected. Why then are people with no experience in what it takes to code a given project responsible for hiring programmers? To put it another way, just because you're a tool does not make you the right tool for the job :-)

herlizness
herlizness

> this information is not especially difficult to uncover if you structure your screening device properly ... but most corps don't ... and what they end up doing is filtering IN people who serendiptously rank high on the key words and filtering OUT those who don't. It's simply too gamey, too quirky and not helpful to anyone. If I want a certain level of MSSQL experience, it's really not difficult for me to devise a structured set of questions which will tend to show pretty reliably whether the candidate likely has those skills or likely does not. People who are responsible for hiring others with technical skill should know how to screen for that skill and if they can't the task should be handed over to someone who can. They should also be mindful of the fact that there are certain technical skills which actually can be acquired quite rapidly for a person with a brain. Sorry, I just think we're doing an absolutely terrible job of talent development, screening and continuing education.

Justin James
Justin James

Yup, exactly right. I was working on a system that allowed people to apply for jobs online. The plan was to use this to pre-populate the forms, and then allow the applicant to tweak the values as needed. It would dramatically slash the time needed for a user to apply for a job through the system. J.Ja

john
john

I guess the real value if scanning 10,000 resumes for those with SQL Server experience, not necessarily those with three years SQL Server experience. That would save a recruiter many, many hours of research.

Justin James
Justin James

I cannot agree more. At the same time, I also have a bit of sympathy for the HR people. They are given zero training in their problem domains, as it were. They are a lot like a programmer who knows a language really well, but not the items being programmed. Sure, they can follow the spec competantly, but you don't want them involved in actually writing the spec or straying from it. When HR people without adequete knowledge are asked to find IT talent, it's the same thing. They don't know what questions to ask, their BS-sniffers don't have the proper heuristics, and so on. I used to despise the HR folks, until my last job, where the company I worked for provided HR-related applications and services, and I got a lot more exposure to how the field works (I should say, "barely limps along"). In a nutshell, they are typically under trained to do their jobs to begin with, and lack knowledge of the positions that they are hiring for. They are facing an uphill fight. J.Ja

Justin James
Justin James

Yeah, that's how it is. Then again, does the computer do much worse than the typical HR/recruiting person? ;) J.Ja

herlizness
herlizness

> right ... even though what you really did is a trivial prototype using XYZ in the final week of your three year tenure

Justin James
Justin James

There's a company called "Resume Mirror" who have a very good product. It takes a resume in a variety of formats and converts it to the HR-XML format. It does a GREAT job at extracting data from the various sections, identifying it, and then quantifying it. For example, if you had a job for 3 years, and put in the job description that you used skill XYZ, it will say you have 3 years experience with XYZ. The problem of course, is that we know how HR folks and recruiters are. Instead of taking this information with a grain of salt, they assume that it is all 100% correct, and then procede to make their usual buzzword driven decisions from there. J.Ja

herlizness
herlizness

sure, if you don't mind it'd be nice to try it out .... thanks ...