Removing duplicate lines from a text file is a common task in Bash scripting. This article explores two efficient methods: using sort
and uniq
, and leveraging the power of awk
.
Table of Contents
Using sort
and uniq
This approach combines two fundamental Unix utilities for a straightforward solution. sort
arranges lines alphabetically, a prerequisite for uniq
, which then eliminates consecutive duplicates. The order of the *first* occurrence of each line is preserved.
Here’s the command:
sort file.txt | uniq > file_unique.txt
This pipes the sorted output of file.txt
to uniq
, saving the unique lines to file_unique.txt
. The original file remains untouched.
Example:
If file.txt
contains:
apple
banana
apple
orange
banana
grape
file_unique.txt
will contain:
apple
banana
grape
orange
Using the awk
Command
awk
offers a more flexible and powerful solution, particularly useful when preserving the original order of lines is crucial. It employs an associative array to track encountered lines.
The command is remarkably concise:
awk '!seen[$0]++' file.txt > file_unique.txt
Let’s break it down:
$0
represents the entire current line.seen[$0]
accesses an element in theseen
array, using the line as the key.++
post-increments the value (initially 0).!
negates the result; the line is printed only if it’s encountered for the first time (whenseen[$0]
is 0).
This method maintains the original order of lines.
Example:
Using the same file.txt
, the output in file_unique.txt
will be:
apple
banana
orange
grape
Conclusion:
Both methods effectively remove duplicate lines. sort | uniq
is simpler for basic scenarios, while awk
provides superior flexibility and control, especially for preserving original order or handling more intricate duplicate removal needs.