How to make complicated machine learning developer problems easier to solve

Commentary: AI/ML is more complicated than software development, but there are ways to make it more approachable.

AI (Artificial Intelligence) Concept Machine Learning, Nanotechnologies, Smart Objects and Global Network Technology

Image: KOHb, Getty Images/iStockphoto

We sometimes assume the best software companies will yield the best artificial intelligence (AI) but, as Andreessen Horowitz investors Martin Casado and Matt Bornstein argue in their post about AI economics, this isn't necessarily the case. In fact, "...a deep, practical understanding of the problem to be solved," and not necessarily of the software to be used, may hold the key. In some ways, this hearkens back to Gartner analyst Svetlana Sicular's contention: "Organizations already have people who know their own data better than mystical data scientists….Learning Hadoop is easier than learning the company's business."

Given Casado's and Bornstein's contention that "AI development is a process of experimenting, much like chemistry or physics," and not a software development "process of building and engineering," how should companies approach AI to maximize their chances of success?

SEE: Natural language processing: A cheat sheet (free PDF) (TechRepublic)

Taming the long tail of artificial intelligence and machine learning

Economist John Maynard Keynes once quipped, "In the long run we are all dead." When it comes to AI/machine learning (ML), however, it's more a matter of "in the long tail we are all hopelessly confused." Or, as Casado and Bornstein pointed out, incapable of easily taming data that refuses to homogenize:

Many of the difficulties in building efficient AI companies happen when facing long-tailed distributions of data….It's becoming clear that long-tailed distributions are also extremely common in machine learning, reflecting the state of the real world and typical data collection practices….

[C]urrent ML techniques are not well equipped to handle [long-tail distributions of data]. Supervised learning models tend to perform well on common inputs (i.e. the head of the distribution) but struggle where examples are sparse (the tail). Since the tail often makes up the majority of all inputs, ML developers end up in a loop--seemingly infinite, at times--collecting new data and retraining to account for edge cases. And ignoring the tail can be equally painful, resulting in missed customer opportunities, poor economics, and/or frustrated users.

Unfortunately, the answer isn't to throw more computational horsepower or data at the problem. The very problem of disparate data across diverse customer inputs contributes to diseconomies of scale, whereby it may cost 10X more (in terms of data, infrastructure, and more) to generate a 2X improvement. In AI/ML, then, the answers to business problems can get worse even as we throw more money at the problem. 

So what's an AI/ML engineer to do?

Detecting bots with a simplify and conquer approach

Though Casado and Bornstein delve into transfer learning and meta models to tackle the hardest of ML problems (local, rather than global, long-tail distributions of data), the most straightforward approach to data complexity seems to involve narrowing down the problem(s) to be solved and, hence, the data distribution. (Or, as they first point out, first determine whether a long-tailed data distribution is even involved. "If the problem can be described reasonably well with linear or polynomial constraints--the message was clear: don't use machine learning! And especially don't use deep learning.")

Rather than approaching a big problem like "bot detection" in a universal way, a company like Cloudflare has used a technique dubbed "componentizing" to simply the task:

[Cloudflare's] goal was to process a massive set of log files to identify (and flag or block) non-human visitors to millions of websites. Treating this as a single task was ineffective at scale because the concept of a "bot" included hundreds of distinct subtypes--search crawlers, data scrapers, port scanners, etc--exhibiting unique behaviors. Using clustering techniques and experimenting with various levels of granularity, though, they ultimately found 6-7 categories of bots that could each be addressed with a unique supervised learning model. Their models are now running on a meaningful portion of the internet, providing real-time protection, with software-like gross margins.

Similar techniques can help in other ways. For natural language processing, for example, narrowing down what users can enter can help "shorten the tail." It also helps to restrict the scope of the output to product suggestions ("others who bought X also bought Y"). In general, such approaches can help to give customers enough of the AI magic to satisfy, without overwhelming the developer with cost and complexity. 

All of this is easier if the company/developer clearly understands what they need to build, which depends on a strong sense of the organization's business (and, hence, data). The more clearly someone understands the business, the better they'll be able to break up complicated problems into approachable solutions. 

Disclosure: I work for AWS, but the views expressed herein are mine and don't necessarily reflect those of my employer.

Also see