Machine learning can tell you which developer wrote what code, with 95% accuracy

New research identifies developers from code samples, and could help settle patent disputes.

Machine learning can be used to de-anonymize specific code samples with a 90-95% accuracy rating, according to new research from Drexel University's Rachel Greenstadt and George Washington University's Aylin Caliskan.

As noted in a report by Wired, the pair of researchers presented their findings at the 2018 DefCon security conference in Las Vegas. Using machine learning to find out who wrote a given piece of code could help with copyright disputes, but it could also have some privacy implications, the Wired report noted.

The tool works by using some of the principles of stylometry—the analytical method of studying one's linguistic style. Applied to code, the report noted, it can find the fingerprint a developer may have left behind in their work.

SEE: IT leader's guide to the future of artificial intelligence (Tech Pro Research)

The algorithm that was created looks at about 50 different features to match a given code sample to a certain developer. Of course, it has to train on examples of that specific developer's code as well, the report mentioned.

The researchers have been working on this type of algorithm for a while. In 2017, they published a paper on their work (including input from other researchers) that showed how you only need small code fragments to de-anonymize code and match it to a developer.

Additionally, the Wired report noted, there was another paper published by the researchers that explained how programmers could be de-anonymized using only their compiled binary code. In that instance, they were able to decompile the code back into C++ and then analyze it.

So, what are the implications of this work? As noted, it could be used in the case of suspected plagiarism or copyright infringement, or to tell who created a particular malware tool. However, the report noted, it could also be used to find out who is contributing to a given open source project, or who may be building a tool to circumvent particular government censorship or processes, the report noted.

Interestingly, experienced developers are easier to spot than novice ones—mostly due to the unique style they pick up as they go along in their careers. Code samples addressing difficult problems were also easier to de-anonymize (85% success rate), compared to those for easy problems (90% success rate), the report said. The researchers could also differentiates between developers from different countries as well.

The big takeaways for tech leaders:

  • Researchers have developed a machine learning algorithm that can de-anonymize code to figure out what developer wrote it.
  • More experienced developers are easier to de-anonymize than novices, due to their unique style.

Also see

Image: iStockphoto/kasto80

By Conner Forrest

Conner Forrest is a Senior Editor for TechRepublic. He covers enterprise technology and is interested in the convergence of tech and culture.