Text Processing Tips — Master Dedupe, Sort, and Trim

Reordering a list, removing duplicate lines, stripping extra whitespace — these "tidy up the text" tasks come up everywhere: organizing data, investigating logs, proofreading drafts. Each one is simple on its own, but if you get the criteria wrong — "what counts as the same?" and "in what order should it sort?" — the result changes. This article lays out the basic line-operation techniques along with the points where people commonly stumble.

The bottom line first: processing is most stable when you follow the order "trim to normalize the basis → sort/dedupe → finishing touches." If whitespace or invisible characters remain, lines that look identical are judged "different," and dedupe and sort will not work as expected.

1. What "text processing" means in daily work

Text processing means taking a collection of lines separated by newlines and applying operations such as reordering, deduplication, and whitespace removal to put it into a workable shape. For example, you need it in situations like these.

Removing duplicates from a list pasted out of a spreadsheet or form to get a unique list.
Reordering email addresses or keywords into A–Z order to make them easy to scan.
Stripping blank lines and trailing whitespace from logs or pasted drafts so diffs are easier to take.
Shuffling lines for a drawing or random display, or reversing them for a quick check.

The common theme in all of these is "how do you handle the line as a unit?" Let us start with reordering.

2. Sorting lines — ascending/descending and "lexical vs numeric"

Sorting comes in ascending (smallest first) and descending (largest first). Even more important is the difference between "sorting as characters (lexical)" and "sorting as numbers (numeric)."

Lexical order compares character by character from the front using character codes, so lines that contain numbers tend to come out in a counterintuitive order.

Input	Lexical (ascending)	Numeric (ascending)
2	10	2
10	100	10
100	2	100

In lexical order, "10" and "100" come before "2." This is because the leading character 1 is judged smaller than 2, and the number of digits is not taken into account. When you want to sort as numbers — such as numbers, versions, or amounts — choose numeric order.

Watch the locale, too. Even "lexical order" varies depending on whether it is a simple character-code order or an order aligned with human language sense. For instance, the handling of upper/lower case (how to order A and a) and the ordering of accented characters or Japanese kana and kanji differ depending on whether you use a locale-aware comparison (in JavaScript, localeCompare). It is not noticeable in simple alphabet-only cases, but for multilingual lists you should be conscious of the comparison method.

3. Removing duplicate lines — the "exact match" criterion and order preservation

Deduplication keeps one line of identical content and removes the rest. The key is "what counts as the same line." Most tools use an exact match of the entire line, so even lines that look similar will remain if they differ as follows.

Differences in upper/lower case (Apple vs apple).
Presence of leading or trailing whitespace (foo vs foo ).
Full-width vs half-width differences (ABC vs ＡＢＣ).

To remove them as intended, it is reliable to first normalize the basis of comparison — by trimming whitespace with trim, unifying case, and so on — and only then dedupe.

The other axis is how order is handled. There are broadly two approaches to deduplication.

Preserve order of appearance: keep the first occurrence and remove only later ones. Use this when you do not want to disturb the original arrangement.
Sort, then remove adjacent duplicates: reorder first, then collapse consecutive identical lines. The command-line sort | uniq works this way (see below).

uniq only removes "consecutive duplicates." Duplicates that are far apart remain, so the standard idiom is to bring them next to each other with sort first and then pipe to uniq. When you want to remove all duplicates while preserving order, use the "preserve order of appearance" mode instead of this approach.

4. Stripping blank lines and surrounding whitespace — trim and invisible characters

Trim is the operation that removes whitespace from the start and end of each line. Extra whitespace that sneaks in during copy and paste is a classic cause of dedupe and sort producing wrong results even though you cannot see it. It helps to include not only half-width spaces but also the following invisible characters in what trim targets.

Tab (\t): easily introduced when pasting from spreadsheet software.
Full-width space (U+3000): can appear unintentionally with Japanese input.
Carriage return (CR, \r): can remain at the end of a line due to differences in newline codes between Windows and other environments.

Removing blank lines is also a staple of text processing. Blank lines inserted as paragraph breaks, or countless blank lines lined up at the end, become noise for line counting and diff comparison. If you tidy each line with trim and then remove blank lines, only the lines with actual content remain cleanly.

5. Reversing, shuffling, and other reorderings

Beyond sorting, there are reorderings suited to different purposes.

Reverse: flip the line order top to bottom as is. Handy for quickly switching logs between newest-first and oldest-first, or turning a sorted result into descending order.
Shuffle: reorder lines randomly. Used for ordering candidates in a drawing, or for removing bias from test data. A fair shuffle uses an unbiased algorithm such as the Fisher–Yates method.

These operations "leave the content unchanged and change only the arrangement." Combined with deduplication and trim, you can perform a whole chain of processing at once, such as "remove duplicates, then reverse" or "strip blank lines, then shuffle."

6. How it maps to the command line (sort / uniq) and browser tools

The operations covered so far have long been staples on the Unix-style command line (the terminal on macOS / Linux). Here are the representative mappings.

What you want	Command example	Notes
Ascending sort	`sort file.txt`	Lexical order. Numeric order is `sort -n`
Descending sort	`sort -r file.txt`	Sort in reverse
Sort + dedupe	`sort file.txt \| uniq`	Only consecutive duplicates are removed, so `sort` is a prerequisite
Reverse (flip lines)	`tac file.txt`	Flip the whole file top to bottom
Shuffle	`shuf file.txt`	Reorder randomly

The command line is strong for large data and scripting, but it requires setting up an environment and memorizing options (that uniq only removes consecutive duplicates, that you must mind full-width spaces, and so on). For a quick bit of list tidying, an online tool where you paste and press a button is convenient. It requires no installation, and the fact that the data is processed only within your own browser is reassuring as well.

Free Tool Process lines with Line Tools Run sort, dedupe, trim, reverse, and shuffle together in your browser. Just paste and press a button — the data is not sent anywhere.

Frequently Asked Questions (FAQ)

How does dedupe decide what counts as the same line?

Most tools use an exact match of the entire line as the criterion. That means if there are differences in upper/lower case, leading or trailing whitespace, or full-width versus half-width characters, the lines are treated as different and are not removed. To remove duplicates as intended, it is reliable to first normalize the basis of comparison, for example by trimming surrounding whitespace or unifying case, and only then dedupe. Note also that there is a mode that preserves the order of first appearance and removes only later occurrences, and a mode that sorts first and removes only consecutive duplicates.

What is the difference between lexical and numeric sorting?

Lexical (string) order compares character by character using character codes, so "2" ends up after "10" (because comparing the leading "1" and "2", "1" is judged smaller). Numeric order interprets each line as a number and compares magnitudes, so 2, 10, and 100 line up as expected. Choose numeric order when you want to sort as numbers, such as numbers, versions, or amounts; choose lexical order when you want to sort as characters, such as ID strings or keywords.

What is trim?

Trim is the operation that removes whitespace characters from the start and end of each line. If you include not only half-width spaces but also tabs, full-width spaces, and stray carriage returns (CR) at the end of a line, you can resolve mismatches that are invisible to the eye. Extra whitespace and invisible characters introduced by copy and paste are a common cause of dedupe and sort producing unexpected results, so applying trim at the very start of your processing makes the later steps stable.

Text Processing Tips — Master Dedupe, Sort, and Trim

1. What "text processing" means in daily work

2. Sorting lines — ascending/descending and "lexical vs numeric"

3. Removing duplicate lines — the "exact match" criterion and order preservation

4. Stripping blank lines and surrounding whitespace — trim and invisible characters

5. Reversing, shuffling, and other reorderings

6. How it maps to the command line (sort / uniq) and browser tools

Related pages

Frequently Asked Questions (FAQ)