Bash Scripting

Efficiently Removing Duplicate Lines in Bash

Spread the love

Removing duplicate lines from a text file is a common task in Bash scripting. This article explores two efficient methods: using sort and uniq, and leveraging the power of awk.

Table of Contents

Using sort and uniq

This approach combines two fundamental Unix utilities for a straightforward solution. sort arranges lines alphabetically, a prerequisite for uniq, which then eliminates consecutive duplicates. The order of the *first* occurrence of each line is preserved.

Here’s the command:


sort file.txt | uniq > file_unique.txt

This pipes the sorted output of file.txt to uniq, saving the unique lines to file_unique.txt. The original file remains untouched.

Example:

If file.txt contains:


apple
banana
apple
orange
banana
grape

file_unique.txt will contain:


apple
banana
grape
orange

Using the awk Command

awk offers a more flexible and powerful solution, particularly useful when preserving the original order of lines is crucial. It employs an associative array to track encountered lines.

The command is remarkably concise:


awk '!seen[$0]++' file.txt > file_unique.txt

Let’s break it down:

  • $0 represents the entire current line.
  • seen[$0] accesses an element in the seen array, using the line as the key.
  • ++ post-increments the value (initially 0).
  • ! negates the result; the line is printed only if it’s encountered for the first time (when seen[$0] is 0).

This method maintains the original order of lines.

Example:

Using the same file.txt, the output in file_unique.txt will be:


apple
banana
orange
grape

Conclusion:

Both methods effectively remove duplicate lines. sort | uniq is simpler for basic scenarios, while awk provides superior flexibility and control, especially for preserving original order or handling more intricate duplicate removal needs.

Leave a Reply

Your email address will not be published. Required fields are marked *