Apps

OpenAmplify developer's diary - part three: Topic intention comparisons

Justin James explains how he used OpenAmplify to provide an approximation of the similarities of two documents.

 

In part two of this series, I discussed comparison of the author information ("Demographics" and "Style") that the OpenAmplify output provides. In part three, I am shooting for a much more ambitious target, "Topic Intentions." My goal is to be able to provide an approximation of how similar the two documents are in terms of what they discuss and how they discuss it.

Why do I want to do this? Well, my application, Rat Catcher, gives me what I call a "Semantic Match Score" (SM score). The SM score is used to display any similarities between the contents of the two documents. From my initial testing, the SM score is a great supplement to the existing percentage that shows the number of matched "phrases" in the documents. What makes SM score so useful for this is that it helps the user find documents which may be a "creative rewording" and as a result, will not have a very high phrase match percentage.

In the future, I plan to take the SM score much further. To begin with, I would like to use a high SM score to trigger a "thesaurus comparison" of documents in which individual phrases are broken down to root word stems. From there variations can be created from a thesaurus and each variation looked for in the target document. Needless to say, this will be a computationally brutal exercise, so if the SM score can be used to filter out documents that are eligible for this treatment, I will be much happier.

My logic in this method is to do the following:

  1. Create a "Topic Intention" score for the "Top Topics" from 0 to 100. ("0" means "no Top Topics from the original document appear in the comparison document" and "100" indicates "all Top Topics in the original appear in the comparison document) and match Polarity, Requesting Guidance, and Offering Guidance. "
  2. Replicate this logic for Proper Nouns.
  3. Replicate this logic for Locations, but instead of using the in-depth Topic Intention comparison, just check for the existence in each document.
  4. Combine the three scores into a composite score by adding them together and dividing by 3.
  5. Any errors result in immediate termination and a result of 0, for the sake of expediency.

For now let's look at the SM score and how it is expressed as a float, with a range of 0 to 100 (0 being "no match" and 100 meaning "perfect match").

Here is my method declaration:

private float CompareOpenAmplifyContent(XDocument Original, XDocument CompareTo)

The code to perform the comparison between the Top Topics and the code for the Proper Nouns is identical, other than the XML elements referenced, so I am only going to show how I compare the Top Topics:

var topOriginalTopics =

from topic in

Original.Root.Element("AmplifyReturn").Element("TopicIntentions").Element

("TopTopics").

Elements()

select topic;

var topCompareToTopics =

from topic in

CompareTo.Root.Element("AmplifyReturn").Element("TopicIntentions").Element

("TopTopics").

Elements()

select topic;

float topTopicsResult = 0;

if (topOriginalTopics.Count() > 0 && topCompareToTopics.Count() > 0)

{

foreach (var originalTopic in topOriginalTopics)

{

XElement compareToTopic = null;

foreach (var topic in topCompareToTopics)

{

if (topic.Element("Topic").Element("Name").Value.ToLower().Trim() == originalTopic.Element("Topic").Element("Name").Value.ToLower().Trim())

{

compareToTopic = topic;

break;

}

}

if (compareToTopic == null)

{

continue;

}

topTopicsResult += CompareOpenAmplifyTopicIntentionResults

(originalTopic, compareToTopic) *

(100 / topOriginalTopics.Count());

}

topTopicsResult = (float)Math.Max(Math.Round(topTopicsResult), 100);

}

if (topOriginalTopics.Count() == 0)

{

topTopicsResult = 100;

}

I create a list of the Top Topics in each document. Next, I iterate through the list of original Top Topics and search for nodes in the comparison Top Topics with the same name. If they match I break out of the loop. At the end of the loop if I find anything (using some negative logic; I continue to the next iteration if nothing was found) I calculate the Topic Intention Result (the XML node which contains the details of an item within Topic Intentions) similarity and divide it by the number of Top Topics in the original document (so a 100% match is weighted to the number of topics) and add it to the current score for the Top Topics. If there were no Top Topics (unlikely) I give it a 100% match. Here is my code to compare Topic Intentions:

private float CompareOpenAmplifyTopicIntentionResults(XElement Original, XElement CompareTo)

{

if (Original == null || CompareTo == null)

{

return 0;

}

var matchedItems = 0;

if (Original.Element("Polarity").Element("Min").Element("Name") == CompareTo.Element("Polarity").Element("Min").Element("Name"))

{

matchedItems++;

}

if (Original.Element("Polarity").Element("Mean").Element("Name") == CompareTo.Element("Polarity").Element("Mean").Element("Name"))

{

matchedItems++;

}

if (Original.Element("Polarity").Element("Max").Element("Name") == CompareTo.Element("Polarity").Element("Max").Element("Name"))

{

matchedItems++;

}

var polarityRating = (float)matchedItems / 3;

var offeringGuidanceRating = 0;

if (Original.Element("OfferingGuidance").Element("Name") == CompareTo.Element("OfferingGuidance").Element("Name"))

{

offeringGuidanceRating++;

}

var requestingGuidanceRating = 0;

if (Original.Element("RequestingGuidance").Element("Nam e") == CompareTo.Element("RequestingGuidance").Element("Name"))

{

requestingGuidanceRating++;

}

var result = Math.Min(((polarityRating + offeringGuidanceRating + requestingGuidanceRating) / 3), 1);

return result;

}

As you can see, there is nothing particularly complex or exciting about this code; it's just doing a quick and dirty comparison on an element-by-element basis between the two Topic Intention nodes. If you look carefully, you will see that the Polarity rating has three components for the three Polarity results (Mean, Min and Max). I am kicking back the results as a value between 0 and 1.

To perform the Locations comparison:

var originalLocations =

from topic in

originalXml.Root.Element("AmplifyReturn").Element("TopicIntentions").

Element("Locations").

Elements()

select topic;

var compareToLocations =

from topic in

compareToXml.Root.Element("AmplifyReturn").Element("TopicIntentions").

Element("Locations").

Elements()

select topic;

float locationsResult = 0;

if (originalLocations.Count() > 0 && compareToLocations.Count() > 0)

{

foreach (var originalTopic in originalProperNouns)

{

foreach (var topic in compareToLocations)

{

if (topic.Element("Result").Element("Name").Value.ToLower().Trim() == originalTopic.Element("Result").Element("Name").Value.ToLower().Trim())

{

locationsResult += 100 / originalLocations.Count();

break;

}

}

}

}

if (originalLocations.Count() == 0)

{

locationsResult = 100;

}

Again, there is nothing terribly complex here. I am just looping through and checking to see how many items in the original document appear in the comparison.

In part four I will dive into the SOAP interface to OpenAmplify, which will be of particular interest to the Java and .NET developers where the environments are very heavily geared towards SOAP interaction.

J.Ja

About

Justin James is the Lead Architect for Conigent.

0 comments

Editor's Picks