Counting Characters and UTF-8 — Characters, Bytes, and Graphemes

"How many characters is this text?" is a deceptively simple question that has no single answer. For the same text, the result changes depending on whether you count bytes, code units, code points, or graphemes (the characters you see). This article walks through the basics of Unicode and UTF-8, all the way to the common programming pitfall of "an emoji counted as two characters," with verified worked examples.

The bottom line up front: the "one character" a person sees = a grapheme cluster, the unit a program counts internally = a code unit / code point, and the amount needed to store or send it = bytes. Remembering that these three are different things will save you from a lot of confusion.

1. The "character count" depends on how you count

For example, "a" is one character and one byte by anyone's count. But "あ" is one character yet three bytes in UTF-8, and although "😀" looks like a single character, JavaScript's "😀".length returns 2. None of these is wrong; they are simply counting different units. The starting point is to distinguish these four ways of counting.

Byte count: the actual amount of data after encoding. It changes with the encoding (such as UTF-8).
Code unit count: the number of minimal units of an encoding form. JavaScript's str.length is the number of UTF-16 code units.
Code point count: the number of Unicode code points (U+XXXX).
Grapheme cluster count: the number of units a person perceives as "one character."

2. Unicode basics — code points and planes

Unicode is a standard that assigns a unique number to every character in the world. That number is called a code point and is written as a hexadecimal value following U+. For example, "A" is U+0041, "あ" is U+3042, and "😀" is U+1F600.

Code points range from U+0000 to U+10FFFF and are divided into "planes" of 64K each.

BMP (Basic Multilingual Plane): U+0000–U+FFFF. Contains everyday characters such as Latin letters, hiragana, and most kanji.
Supplementary Planes: U+10000–U+10FFFF. Many emoji, some kanji, historic scripts, and so on.

Whether a character "fits in the BMP or is in a supplementary plane" is an important boundary that directly affects UTF-16 surrogate pairs and the str.length discrepancy discussed below.

3. UTF-8 — a variable-length encoding

UTF-8 encodes a code point into a byte sequence using a variable length of 1 to 4 bytes, and it is the de facto standard of today's web. The number of bytes is determined by the size of the code point.

Code point range	UTF-8 byte length	Examples
`U+0000`–`U+007F`	1 byte	ASCII (`A`, digits, symbols)
`U+0080`–`U+07FF`	2 bytes	Latin Extended, Greek, Cyrillic, etc.
`U+0800`–`U+FFFF`	3 bytes	Most Japanese such as hiragana and kanji (`あ`)
`U+10000`–`U+10FFFF`	4 bytes	Many emoji (`😀`), some kanji

The advantages of UTF-8 are that it is backward compatible with ASCII (ASCII characters stay one byte) and that it is endianness independent. On the other hand, because the number of bytes per character is not constant, "character count ≠ byte count" always holds.

4. Code units vs code points vs grapheme clusters

This is the part most often misunderstood in programming. JavaScript strings are represented internally as UTF-16, and str.length returns the number of UTF-16 code units. It is not necessarily the number of code points or graphemes.

Surrogate pairs (UTF-16's supplementary-plane representation)

UTF-16 represents BMP characters with one code unit (2 bytes), but characters in the supplementary planes (U+10000 and above) are represented by two code units — a surrogate pair. That is why "😀".length === 2. When you want to count by code point, use the spread syntax or for...of.

"😀".length → 2 (number of UTF-16 code units)
[..."😀"].length → 1 (number of code points)

Combining characters, ZWJ, and grapheme clusters

Furthermore, several code points can combine into a single glyph. Examples include combining characters (e.g., representing é as e + combining accent U+0301) and family emoji that join emoji with a ZWJ (zero-width joiner, U+200D). To count these as "one character a person sees," use Intl.Segmenter.

// Count by grapheme
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("😀")].length;          // 1
[...seg.segment("👨‍👩‍👧")].length; // 1

👨‍👩‍👧 (a family emoji) is made of 👨 + ZWJ + 👩 + ZWJ + 👧 — five code points (in UTF-16 its .length is 8) — yet it appears as one grapheme. For character limits and validation on social platforms, which unit you count by has a big impact on the user experience.

5. Verifying with worked examples

Here are representative characters lined up under the four ways of counting (the UTF-8 byte counts and the JavaScript values have been verified for real).

String	Graphemes	Code points	UTF-16 `length`	UTF-8 bytes
`a`	1	1	1	1
`あ`	1	1	1	3
`漢`	1	1	1	3
`😀`	1	1	2	4
`é` (`e`+U+0301)	1	2	2	3
`👨‍👩‍👧`	1	5	8	18

For example, "あ" is U+3042, which falls in the 3-byte range, so it is 1 character = 3 bytes. "😀" is U+1F600 in a supplementary plane, so it is 1 grapheme and 1 code point but its length is 2 and it takes 4 bytes in UTF-8. "👨‍👩‍👧" is three emoji plus two ZWJs — five code points in total. Each emoji is 4 bytes × 3 = 12, and each ZWJ (U+200D) is 3 bytes × 2 = 6, for a total of 18 bytes.

6. Line breaks, whitespace, and when you need byte counts

In practice, it is not only "what counts as one character" but also how you treat whitespace and line breaks that matters.

Line breaks: depending on the environment, this is LF (1 byte) or CRLF (2 bytes). The result changes depending on whether you include line breaks in the character count, or count CRLF as 2 in bytes.
Whitespace: spaces, full-width spaces (U+3000), and tabs are all "characters." Define how to treat them when, for example, counting against a manuscript-style limit.
Manuscript counts: Japanese "400-character" manuscript counts are typically based on grapheme characters, and the treatment of whitespace and line breaks should be fixed by a rule and applied consistently.

On the other hand, the situations where byte counts are required are equally clear.

Situation	Unit to count	Why
Character limit on an input field (UI)	Graphemes	To match a person's "one character" by sight
Database column length (`VARCHAR`, etc.)	Bytes or characters	The unit of the limit depends on the DB and character set
Data transfer / file size	Bytes	It is the amount actually transferred and stored
Input to crypto / hashing	Bytes	The processing target is always a byte sequence

Free Tool Count for real with the Character Counter Measure character counts, byte counts and more on the spot. Paste your text and see how the result differs depending on how you count.

Frequently Asked Questions (FAQ)

Why is an emoji sometimes counted as two characters?

JavaScript's str.length returns the number of UTF-16 code units. A character in the supplementary planes (U+10000 and above), such as "😀", is represented by two code units (a surrogate pair), so length counts it as 2. To match the way people count characters by sight (graphemes), use [...str].length or Intl.Segmenter.

What is the difference between a character count and a byte count?

A character count is how many characters there are; a byte count is how many bytes are needed to store or transmit them. In UTF-8, ASCII is 1 byte, most Japanese characters are 3 bytes, and emoji are 4 bytes, so the byte count differs even for a single character. Byte counts matter for database column sizing and estimating data transfer.

What is a grapheme cluster?

It is the unit a person perceives as a single character. Combining characters or a ZWJ (zero-width joiner) can make several code points appear as one glyph; for example "👨‍👩‍👧" is made of several code points but is a single grapheme. In JavaScript you can split text into graphemes with Intl.Segmenter.

Counting Characters and UTF-8 — Characters, Bytes, and Graphemes

1. The "character count" depends on how you count

2. Unicode basics — code points and planes

3. UTF-8 — a variable-length encoding

4. Code units vs code points vs grapheme clusters

Surrogate pairs (UTF-16's supplementary-plane representation)

Combining characters, ZWJ, and grapheme clusters

5. Verifying with worked examples

6. Line breaks, whitespace, and when you need byte counts

Related pages

Frequently Asked Questions (FAQ)