URL parsing: A ticking time bomb of security exploits

The modern world would grind to a halt without URLs, but years of inconsistent parsing specifications have created an environment ripe for exploitation that puts countless businesses at risk.

Web browser closeup on LCD screen with shallow focus on https padlock

Image: RobertAx, Getty Images/iStockphoto

A team of security researchers has discovered serious flaws in the way the modern internet parses URLs: Specifically, that there are too many URL parsers with inconsistent rules, which has created a worldwide web easily exploited by savvy attackers.

We don't even need to look very hard to find an example of URL parsing being manipulated in the wild to devastating effect: The late-2021 Log4j exploit is a perfect example, the researchers said in their report. 

"Because of Log4j's popularity, millions of servers and applications were affected, forcing administrators to determine where Log4j may be in their environments and their exposure to proof-of-concept attacks in the wild," the report said. 

SEE: Google Chrome: Security and UI tips you need to know (TechRepublic Premium)

Without going too deeply into Log4j, the basics are that it uses a malicious string that, when logged, would trigger a Java lookup that connects the victim to the attacker's machine, which is used to deliver a payload. 

The remedy that was initially implemented for Log4j involved only allowing Java lookups to whitelisted sites. Attackers pivoted quickly to find a way around the fix, and found out that, by adding the localhost to the malicious URL and separating it with a # symbol, attackers were able to confuse the parsers and carry on attacking.

Log4j was serious; the fact that it relied on something as universal as URLs makes it even more so. To make URL parsing vulnerabilities understandably dangerous, it helps to know what exactly it means, and the report does a good job of doing just that.

url-structure.jpg

Figure A: The five parts of a URL

Image: Claroty/Team82/Snyk

The color-coded URL in Figure A shows an address broken down into its five different parts. In 1994, way back when URLs were first defined, systems for translating URLs into machine language were created, and since then several new requests for comment (RFC) have further elaborated on URL standards. 

Unfortunately, not all parsers have kept up with newer standards, which means there are a lot of parsers, and many have different ideas of how to translate a URL. Therein lies the problem.

URL parsing flaws: What researchers found

Researchers at Team82 and Snyk worked together to analyze 16 different URL parsing libraries and tools written in a variety of languages:

  1. urllib (Python)
  2. urllib3 (Python)
  3. rfc3986 (Python)
  4. httptools (Python)
  5. curl lib (cURL)
  6. Wget 
  7. Chrome (Browser)
  8. Uri (.NET)
  9. URL (Java)
  10. URI (Java)
  11. parse_url (PHP)
  12. url (NodeJS)
  13. url-parse (NodeJS) 
  14. net/url (Go)
  15. uri (Ruby)
  16. URI (Perl)

Their analyses of those parsers identified five different scenarios in which most URL parsers behave in unexpected ways:

  • Scheme confusion, in which the attacker uses a malformed URL scheme
  • Slash confusion, which involves using an unexpected number of slashes
  • Backslash confusion, which involves putting any backslashes (\) into a URL
  • URL-encoded data confusion, which involve URLs that contain URL-encoded data
  • Scheme mixup, which involves parsing a URL with a specific scheme (HTTP, HTTPS, etc.)

Eight documented and patched vulnerabilities were identified in the course of the research, but the team said that unsupported versions of Flask still contain these vulnerabilities: You've been warned.

What you can do to avoid URL parsing attacks

It's a good idea to protect yourself—proactively—against vulnerabilities with the potential to wreak havoc on the Log4j scale, but given the low-level necessity of URL parsers, it might not be easy.

The report authors recommend starting by taking the time to identify the parsers used in your software, understand how they behave differently, what sort of URLs they support and more. Additionally, never trust user-supplied URLs: Canonize and validate them first, with parser differences being accounted for in the validation process. 

SEE: Password breach: Why pop culture and passwords don't mix (free PDF) (TechRepublic)

The report also has some general best practice tips for URL parsing that can help minimize the potential of falling victim to a parsing attack:

  • Try to use as few, or no, URL parsers at all. The report authors say "it is easily achievable in many cases." 
  • If using microservices, parse the URL at the front end and send the parsed info across environments. 
  • Parsers involved with application business logic often behave differently. Understand those differences and how they affect additional systems.
  • Canonicalize before parsing. That way, even if a malicious URL is present, the known trusted one is what gets forwarded to the parser and beyond.

Also see