HTML Escaping and XSS — Why You Must Encode Special Characters

HTML escaping is the process of replacing characters that "have a special meaning in HTML" — such as & < > " ' — with entities (character references) that look the same when displayed. Neglect it and a string a user typed is interpreted as tags or scripts, leading to the serious vulnerability known as XSS (cross-site scripting). This article lays out, accurately, what to convert and why, how XSS works and its types, and how to defend correctly according to the output context.

The bottom line first: "when you output a user-derived value to the page, escape it to match the place (context) you output it into" is the heart of XSS defense. Input validation alone is not enough, because the escaping needed differs across HTML body, attributes, JavaScript, and URLs. Modern frameworks auto-escape by default, so the first step is simply not to disable that.

1. What HTML escaping is — turning five special characters into entities

When the browser reads HTML, it treats certain characters as "symbols that express structure." For example, < is the signal "a tag begins here," and & is the signal "a character reference begins here." When you want to display those characters as themselves, you must replace them with entities. That is HTML escaping.

The representative characters targeted by HTML escaping are the following five. All code and symbols are shown in entity notation.

CharacterNameEntity (named)Numeric character reference
&Ampersand&amp;&#38;
<Less-than&lt;&#60;
>Greater-than&gt;&#62;
"Double quote&quot;&#34;
'Single quote&#39; (or &apos;)&#39;
Mind the order of conversion: always convert & to &amp; first. If you convert < to &lt; before handling &, the & you just produced would be converted again (becoming something like &amp;lt;) and break.

For instance, suppose a user types <b>bold</b>. If you output it without escaping, the browser treats <b> as a tag and actually renders "bold" in bold. If you escape it and output &lt;b&gt;bold&lt;/b&gt;, the screen literally shows the text <b>bold</b>.

2. Why it is necessary — the danger of user input being interpreted as HTML

Most web apps display values that users enter (names, comments, search keywords, and so on). The problem is when that value is embedded directly into the HTML. The browser cannot distinguish "this is data" from "this is an instruction (a tag)," and it interprets the embedded string as part of the HTML.

Suppose the following string is posted into a comment field (shown in entity notation).

<script>alert(document.cookie)</script>

If this is output to the page without escaping, the browser executes <script> as a real script tag. As a result, the attacker's code runs in the browser of every other user who opens that page. This is the basic structure of XSS.

The key point is that "data and instructions travel through the same channel (HTML)." It helps to think of escaping as a mechanism that explicitly marks data as harmless text so it cannot be mistaken for an instruction. It is the same idea as placeholders (prepared statements) in SQL.

3. How XSS (cross-site scripting) works and its types

XSS is a vulnerability in which an attacker's string is executed as script or HTML within a page. Once executed, it enables theft of cookies and access tokens, defacement of the page, forging of forms, and actions performed while impersonating the user. XSS is broadly divided into three types by how the string gets in.

TypePath of the attack stringCharacteristics
ReflectedSent via a URL parameter, etc., and reflected straight back in the responseRequires luring the victim into following a trap link. Common in search results and error displays
StoredSaved in a database, etc., and later served to other usersOccurs on bulletin boards and comment fields. Affects all viewers, so the impact is large
DOM-basedNever goes to the server; JavaScript in the browser passes it to a dangerous operationCaused by handling of innerHTML, location, etc. Cannot be prevented by server-side escaping alone

Reflected and stored XSS are caused by a missed escape at the moment the server outputs the string. DOM-based XSS occurs when the HTML from the server is correct but JavaScript in the browser passes a user-derived value, unmodified, into something like innerHTML. In other words, "fix only the server side and you are safe" is not true.

4. Preventing it with escaping — defenses by output context

The essence of XSS defense is to "escape according to where you output the value." Even for the same user input, the dangerous characters and the escaping required differ depending on whether you place it in the HTML body, in an attribute value, or inside JavaScript.

HTML body (element content)

Convert at minimum & < >. This blocks the start of a tag.

Attribute values

Always quote attributes, and in addition to the three body characters, convert the enclosing quote (" or '). If you do not escape the quote, an attacker can break out of the attribute and inject a new tag, as in "><script>.... For example, if the value is something like x" onmouseover="alert(1), they can escape the attribute and inject an event handler.

JavaScript context

As a rule, avoid embedding user input directly inside <script>. If you must pass it, use a form that is safe as JavaScript (for example, encode it on the server with the equivalent of JSON.stringify, and additionally neutralize sequences such as </script>) rather than HTML escaping. Reusing HTML escaping as-is does not make it safe.

URL context

When you place a user value into href or src, you need URL encoding plus scheme validation. A URL starting with javascript: executes script on click, so allow only permitted schemes such as http / https.

Input validation and escaping are different things: input-time validation (format checks) is useful, but on its own it cannot prevent XSS. Escaping at output time is the direct defense. Use both, and always pass through output escaping as the "last line of defense."

5. Types of entities — named and numeric character references

The "character reference" that represents an escaped value comes in two families. Both display as the same character.

For either form, the correct way to write it is with a trailing semicolon ;. Note that &apos; (the named reference for the single quote) is widely usable in HTML, but for compatibility reasons the numeric &#39; is sometimes preferred in practice. Rather than implementing escaping yourself, it is safer and more reliable to use the standard functions of your language or framework.

Numeric character references are turned back into the original character at the stage the browser interprets them. Because of this, attackers may try to slip past detection with expressions like &#60;script&#62;. Do not try to prevent XSS with your own blacklist (a scheme that rejects specific strings). The proper approach is to always pass output through escaping (or a sanitization library).

6. In practice — framework auto-escaping and CSP

Modern template engines and UI frameworks (React, Vue, various server-side templates, and so on) HTML-escape by default when you embed a variable. Many XSS flaws arise when this auto-escaping is explicitly disabled by a "insert raw HTML" feature. The cardinal rule is to not use raw-HTML insertion (the equivalent of innerHTML) for untrusted values.

In summary: XSS defense is built on "escaping according to the output context," layered with not disabling the framework's auto-escaping, not inserting untrusted values as raw HTML, and adding defense in depth with CSP.
Free Tool Convert for real with the HTML Escape tool Escape and unescape strings as HTML right in your browser. See on the spot which entity each special character is converted into.

Frequently Asked Questions (FAQ)

What is HTML escaping?

It is the process of replacing characters that have a special meaning in HTML with entities (character references) that look the same when displayed. Specifically, the five characters ampersand, less-than, greater-than, double quote, and single quote are converted into forms like &amp; &lt; &gt; &quot; &#39;. As a result, the string is not interpreted as a tag or attribute and is displayed purely as text.

What is XSS (cross-site scripting)?

It is a vulnerability in which a string supplied by an attacker is interpreted and executed as script or HTML within a web page. When unescaped user input is output to the page as-is, the attacker's script tag runs, enabling theft of cookies and tokens, defacement of the page, and actions performed while impersonating the user. It is broadly divided into three types: reflected, stored, and DOM-based.

Which characters should I convert?

In HTML body text, convert at minimum the three characters & (ampersand), < (less-than), and > (greater-than); inside attribute values, also convert " (double quote) and ' (single quote). Converting & first is the established rule. However, the escaping required for safety changes with the output context (HTML body, attribute, inside JavaScript, or inside a URL), so context-appropriate escaping is essential.

← Back to the Tech Blog list