Developer

OpenAmplify developer's diary - part two: Author comparisons

Justin James explains how he used results from OpenAmplify to make the comparison of author demographics and styles between two documents.

 

In part one of the Developer Diary, I reviewed what it takes to make a request to the OpenAmplify service. This week, I'll explain how I use the OpenAmplify results to make the comparison of author demographics and styles between two individual documents. The key to all of my work from here on is that when I retrieve the results from OpenAmplify I load them into an XDocument instance, which allows me to use LINQ.

If you're not familiar with LINQ one of the biggest misconceptions is the idea that it's only for querying databases. In reality, LINQ is simply a query language baked into VB.NET and C#. Also, when using LINQ, various "providers" can be plugged in that allow queries to various data stores. While the SQL Server provider ("LINQ to SQL") is one of the most prominent, it's not the only one available. For the purposes of this project, we will be using LINQ to Objects (which operates on objects that implement the IEnumerable interface) and LINQ to XML. If you would like more information on LINQ, I highly recommend the book Essential LINQ by Charline Calvert and Dinesh Kulkarni.

Now let's get started. First, I will take a look at the author portion of the OpenAmplify results to get an idea of how to properly compare them. For this exercise, what I am calling the "author information" is really the "Demographic Analysis" and "Style Analysis" sections of the OpenAmplify output. Combining the two sections we are given six results:

  • Flamboyance - Does the text use a lot of uncommon words or less common structures and components of grammar?
  • Slang - Does the text use many words outside of "proper language?"
  • Age - The approximate age of the author and intended audience.
  • Gender - A guess at the author or audience gender.
  • Education - An approximation of the author or audience education level.
  • Language - What language the text is written in.
What kind of comparison am I looking for? In the case of my application (Rat Catcher), I want to present the user with a simple percentage so my code will produce a float object with a value between 0 and 100, with 100 being a perfect match and 0 being a perfect non-match. I don't see any reason to weight any one component more heavily than another, except for "Language." If the languages are not the same, I see no reason to look at anything else. I will return 0 in that case and be done with it. That being said, it's pretty simple to assign a weight to any of the components, if desired.

Each element in the comparison contains a "Name" element, which provides an English text value describing the result. For example, I can get the following textual results for the "Education" analysis:

  • Undecided
  • Pre-Secondary
  • Secondary
  • College
  • Post Graduate

I will compare these values for a hit, and give 1/6th of 100% credit (about 16.6%) to any element with a matching name.

In addition, all components contain a numeric value. Sometimes, the number provides additional granularity, sometimes it does not. In this project, Rat Catcher compares a "clean" source document against found Internet documents which are not "clean." The comparison documents include data such as navigation, advertisements and additional text which is not relevant to the comparison. I understand that an exact match is nearly impossible to begin with (unless the found document is a non-Web page document), so I will ignore the granularity.

Some applications — for example, a legal discovery tool looking to tie related documents found on a subpoenaed server to another — may need to examine the scalar results. Those applications might also give "partial credit" for neighboring values. For example, if the "Education" score in one document contains "Secondary" and "College," then the comparison document might need to give half as much credit as the normal given.

For Rat Catcher's purposes, a simple comparison is more than adequate. Let's take a look at the code:

private float CompareOpenAmplifyAuthors(XDocument originalXml, XDocument compareToXml)

{

    float result = 0;

    var originalDemographicNodes = from node in originalXml.Root.Element("AmplifyReturn").Element("Demographics").Elements()

                                   select node;

    var originalStyleNodes = (

        from node in originalXml.Root.Element("AmplifyReturn").Element("Styles").Elements()

        select node);
    foreach (var node in originalDemographicNodes)

    {

        var compareToNode = compareToXml.Root.Element("AmplifyReturn").Element("Demographics").Element(node.Name);

        if (compareToNode.Element("Name").Value.ToLower() == node.Element("Name").Value.ToLower())

        {

            result += ((float)100) / 6;

        }

        else

        {

            if (node.Name == "Language")

            {

                return 0;

            }

        }

    }
    foreach (var node in originalStyleNodes)

    {

        var compareToNode = compareToXml.Root.Element("AmplifyReturn").Element("Styles").Element(node.Name);

        if (compareToNode.Element("Name").Value == node.Element("Name").Value)

        {

            result += ((float)100) / 6;

        }

    }
    result = Math.Min(100, result);
    return result;
}

As you can see, I first accept two XDocument objects, originalXml and compareToXml, to work with and initialize our result variable to 0. Next, I select only the "Demographics" and "Style" sections of the original XML into separate enumerations. From there, I iterate over the original Demographic and Style enumerations and find a node in the comparison XML with the same element name.

In each case, if the values of the "Name" sub-element match, I add 100/6 to the result. You'll note that in the "Demographics" comparison, if the "Name" sub-element does not match and it's for the "Language" component, I immediately return 0, implementing my logic that "Language" is a critical match. Finally, I trim the number up to ensure it's not more than 100 (due to potential rounding issues) and return it.

Is this the most elegant code possible? Not at all. While it's completely possible to express this in one to three LINQ statements, there are reasons why you do not want to:

  • Readability - It would be a very lengthy LINQ statement and any elegance advantage would be lost quickly.
  • No speed advantage - Often, elegant code is a lot faster than brute force code. In this case, I am comparing six values; how much speed will elegance give me?
  • Maintenance - All too often, elegant solutions become maintenance nightmares. There's a fine line between elegant and too clever. You don't want to find out three months into the project and then discover you have a problem.

What I've discovered so far is that, with a little bit of LINQ, I could make it easy to do the lookups I need to make good comparison between authors. Next week, I will move on to the real bear —comparison of the topic intentions that are needed to provide the ability to compare the meaning of two documents.

J.Ja

About Justin James

Justin James is the Lead Architect for Conigent.

Editor's Picks

Free Newsletters, In your Inbox