"How many characters is this text?" is a deceptively simple question that has no single answer. For the same text, the result changes depending on whether you count bytes, code units, code points, or graphemes (the characters you see). This article walks through the basics of Unicode and UTF-8, all the way to the common programming pitfall of "an emoji counted as two characters," with verified worked examples.
1. The "character count" depends on how you count
For example, "a" is one character and one byte by anyone's count. But "あ" is one character yet three bytes in UTF-8, and although "😀" looks like a single character, JavaScript's "😀".length returns 2. None of these is wrong; they are simply counting different units. The starting point is to distinguish these four ways of counting.
- Byte count: the actual amount of data after encoding. It changes with the encoding (such as UTF-8).
- Code unit count: the number of minimal units of an encoding form. JavaScript's
str.lengthis the number of UTF-16 code units. - Code point count: the number of Unicode code points (U+XXXX).
- Grapheme cluster count: the number of units a person perceives as "one character."
2. Unicode basics — code points and planes
Unicode is a standard that assigns a unique number to every character in the world. That number is called a code point and is written as a hexadecimal value following U+. For example, "A" is U+0041, "あ" is U+3042, and "😀" is U+1F600.
Code points range from U+0000 to U+10FFFF and are divided into "planes" of 64K each.
- BMP (Basic Multilingual Plane):
U+0000–U+FFFF. Contains everyday characters such as Latin letters, hiragana, and most kanji. - Supplementary Planes:
U+10000–U+10FFFF. Many emoji, some kanji, historic scripts, and so on.
Whether a character "fits in the BMP or is in a supplementary plane" is an important boundary that directly affects UTF-16 surrogate pairs and the str.length discrepancy discussed below.
3. UTF-8 — a variable-length encoding
UTF-8 encodes a code point into a byte sequence using a variable length of 1 to 4 bytes, and it is the de facto standard of today's web. The number of bytes is determined by the size of the code point.
| Code point range | UTF-8 byte length | Examples |
|---|---|---|
U+0000–U+007F | 1 byte | ASCII (A, digits, symbols) |
U+0080–U+07FF | 2 bytes | Latin Extended, Greek, Cyrillic, etc. |
U+0800–U+FFFF | 3 bytes | Most Japanese such as hiragana and kanji (あ) |
U+10000–U+10FFFF | 4 bytes | Many emoji (😀), some kanji |
The advantages of UTF-8 are that it is backward compatible with ASCII (ASCII characters stay one byte) and that it is endianness independent. On the other hand, because the number of bytes per character is not constant, "character count ≠ byte count" always holds.
4. Code units vs code points vs grapheme clusters
This is the part most often misunderstood in programming. JavaScript strings are represented internally as UTF-16, and str.length returns the number of UTF-16 code units. It is not necessarily the number of code points or graphemes.
Surrogate pairs (UTF-16's supplementary-plane representation)
UTF-16 represents BMP characters with one code unit (2 bytes), but characters in the supplementary planes (U+10000 and above) are represented by two code units — a surrogate pair. That is why "😀".length === 2. When you want to count by code point, use the spread syntax or for...of.
"😀".length→ 2 (number of UTF-16 code units)[..."😀"].length→ 1 (number of code points)
Combining characters, ZWJ, and grapheme clusters
Furthermore, several code points can combine into a single glyph. Examples include combining characters (e.g., representing é as e + combining accent U+0301) and family emoji that join emoji with a ZWJ (zero-width joiner, U+200D). To count these as "one character a person sees," use Intl.Segmenter.
// Count by grapheme
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("😀")].length; // 1
[...seg.segment("👨👩👧")].length; // 1
👨 + ZWJ + 👩 + ZWJ + 👧 — five code points (in UTF-16 its .length is 8) — yet it appears as one grapheme. For character limits and validation on social platforms, which unit you count by has a big impact on the user experience.
5. Verifying with worked examples
Here are representative characters lined up under the four ways of counting (the UTF-8 byte counts and the JavaScript values have been verified for real).
| String | Graphemes | Code points | UTF-16 length | UTF-8 bytes |
|---|---|---|---|---|
a | 1 | 1 | 1 | 1 |
あ | 1 | 1 | 1 | 3 |
漢 | 1 | 1 | 1 | 3 |
😀 | 1 | 1 | 2 | 4 |
é (e+U+0301) | 1 | 2 | 2 | 3 |
👨👩👧 | 1 | 5 | 8 | 18 |
For example, "あ" is U+3042, which falls in the 3-byte range, so it is 1 character = 3 bytes. "😀" is U+1F600 in a supplementary plane, so it is 1 grapheme and 1 code point but its length is 2 and it takes 4 bytes in UTF-8. "👨👩👧" is three emoji plus two ZWJs — five code points in total. Each emoji is 4 bytes × 3 = 12, and each ZWJ (U+200D) is 3 bytes × 2 = 6, for a total of 18 bytes.
6. Line breaks, whitespace, and when you need byte counts
In practice, it is not only "what counts as one character" but also how you treat whitespace and line breaks that matters.
- Line breaks: depending on the environment, this is
LF(1 byte) orCRLF(2 bytes). The result changes depending on whether you include line breaks in the character count, or countCRLFas 2 in bytes. - Whitespace: spaces, full-width spaces (
U+3000), and tabs are all "characters." Define how to treat them when, for example, counting against a manuscript-style limit. - Manuscript counts: Japanese "400-character" manuscript counts are typically based on grapheme characters, and the treatment of whitespace and line breaks should be fixed by a rule and applied consistently.
On the other hand, the situations where byte counts are required are equally clear.
| Situation | Unit to count | Why |
|---|---|---|
| Character limit on an input field (UI) | Graphemes | To match a person's "one character" by sight |
Database column length (VARCHAR, etc.) | Bytes or characters | The unit of the limit depends on the DB and character set |
| Data transfer / file size | Bytes | It is the amount actually transferred and stored |
| Input to crypto / hashing | Bytes | The processing target is always a byte sequence |
Frequently Asked Questions (FAQ)
Why is an emoji sometimes counted as two characters?
JavaScript's str.length returns the number of UTF-16 code units. A character in the supplementary planes (U+10000 and above), such as "😀", is represented by two code units (a surrogate pair), so length counts it as 2. To match the way people count characters by sight (graphemes), use [...str].length or Intl.Segmenter.
What is the difference between a character count and a byte count?
A character count is how many characters there are; a byte count is how many bytes are needed to store or transmit them. In UTF-8, ASCII is 1 byte, most Japanese characters are 3 bytes, and emoji are 4 bytes, so the byte count differs even for a single character. Byte counts matter for database column sizing and estimating data transfer.
What is a grapheme cluster?
It is the unit a person perceives as a single character. Combining characters or a ZWJ (zero-width joiner) can make several code points appear as one glyph; for example "👨👩👧" is made of several code points but is a single grapheme. In JavaScript you can split text into graphemes with Intl.Segmenter.