encoding · level 8

Percent-Encoding, In Depth

form vs path vs query · the double-encoding trap.

180 XP

Percent-Encoding, In Depth

You already know %20 is space. The deeper question is which %20 — because percent-encoding is not one rule but a family of overlapping rules, and using the wrong one is how authentication bypasses, path-traversal vulnerabilities, and silently-corrupted form data are born.

Three escape sets you meet every day

When you write a URL like https://example.com/search?q=hello world, the question "is the space allowed here?" has three answers depending on where the space appears.

1. The path

Inside a path component (the /... part), the reserved set is /?#. Most other ASCII characters are legal literally. Spaces are not — they must be %20. The plus sign is just a literal +.

GET /docs/quick%20start HTTP/1.1

2. The query string

Inside a query string, the intent is key=value pairs separated by &. To deliver "hello world" as a value:

  • Spec-strict (RFC 3986): use %20.
  • Form-style (HTML form submissions): use +.

The two are interchangeable as long as the consumer knows which one you're using. This is the source of more bugs than I can count.

3. application/x-www-form-urlencoded (the form body)

This is the encoding HTML forms produce when submitted with <form method=post enctype="application/x-www-form-urlencoded">. It's also what most APIs accept as request bodies for token endpoints (OAuth 2 demands it). The rules:

  • Space → +.
  • Reserved + non-ASCII → %XX.
  • Plus sign → %2B (because plain + is space).
POST /login HTTP/1.1
Content-Type: application/x-www-form-urlencoded

email=alice%40example.com&password=p%2Bword

Side-by-side

Input RFC 3986 path RFC 3986 query form-urlencoded
(space) %20 %20 +
+ + + %2B
& & (reserved sub-delim, but commonly encoded inside values) %26 %26
= = %3D %3D
/ / (delimiter) / %2F
é %C3%A9 %C3%A9 %C3%A9

The "common" parts: percent-encode any non-ASCII byte using its UTF-8 representation. The differences come down to which delimiters need escaping in which context.

The "right" function for the job

JavaScript exposes three encoders, and picking the wrong one is the cause of most URL bugs.

// encodeURI — preserves URL delimiters. Use when wrapping a whole URL.
encodeURI("https://example.com/p?q=hello world&x=1");
// → "https://example.com/p?q=hello%20world&x=1"
//    note: '?', '=', '&' are PRESERVED — they're URL syntax.

// encodeURIComponent — escapes everything that has special meaning in URLs.
// Use for query VALUES.
encodeURIComponent("hello world&x=1");
// → "hello%20world%26x%3D1"

// URLSearchParams — form-style: '+' for space.
new URLSearchParams({q: "hello world"}).toString();
// → "q=hello+world"

Rule of thumb:

  • Building a query value? encodeURIComponent or URLSearchParams.
  • Wrapping an already-formed URL (e.g., for a redirect)? encodeURI.
  • Building a form body? URLSearchParams (or qs.stringify in Node).

The double-encoding trap

A defender writes a filter:

if (req.url.includes("../")) return res.status(403).end();

An attacker sends:

GET /static/..%252Fadmin/secret.txt HTTP/1.1

The framework decodes the URL once: ..%2Fadmin arrives at the handler. The handler reads the path, sees ..%2F, and (because filesystem APIs decode percent escapes too) the OS opens ../admin/secret.txt. The first decode turned %252F into %2F; the second decode turned %2F into /.

This is double encoding. The fix is two-fold:

  1. Validate after decoding fully. Decode in a loop until the value is stable, then run path-traversal checks.
  2. Don't decode inputs more than the protocol mandates. A web framework should decode percent-escapes once. Application code should never decode again.

Variants of this attack break login filters (adm%6Cnadmin), SSRF mitigations (http%253A//169.254.169.254), and content-type sniffing (%2E%2E%2F for ../).

Common pitfalls

  • Encoding : and @. They have special meaning in userinfo@host:port parts, but in paths and queries they're legal literally. Over-escaping doesn't break anything but produces ugly URLs.
  • Encoding / inside path segments. %2F is not equivalent to / for path-routing. Some servers treat /files%2Fadmin and /files/admin differently. Apache's AllowEncodedSlashes is off by default for exactly this reason.
  • Mixing encodings. A common bug: build a query string with URLSearchParams, then run encodeURI over it. Now spaces are double-encoded — + becomes %2B. Pick one and stop.
  • Not normalising before comparing. Hello and Hell%6f decode to the same string. If you cache or compare URLs, decode first.

Mental model

Percent-encoding is bytes-to-text-safely. It encodes one byte (0x000xFF) as three characters (%XX). Multi-byte characters (UTF-8) become multiple %XX escapes — é (UTF-8: C3 A9) becomes %C3%A9.

Three contexts, three escape sets, but one underlying mechanism. Pick the right context, never decode twice, and prefer the standard library encoder over hand-rolled escaping.

Tools in the wild

4 tools
  • Browser- and Node-built-in form-urlencoded encoder/decoder. Always prefer over hand-rolled escaping.

    library
  • urllib.parsefree tier

    Python stdlib — quote/quote_plus/unquote/unquote_plus + urlencode. Three contexts, three functions.

    library
  • Pen-tester's tool for nested encoding/decoding — useful for hunting double-encoding bypasses.

    service
  • ZAPfree tier

    Open-source web app scanner — automatically tries double-encoded payloads against bypass filters.

    service