Linux Command Line

Efficiently Counting Unique Lines in Linux Files

Spread the love

Counting unique lines in a file is a common task in Linux. This article presents two efficient command-line methods: using sort and uniq, and using awk.

Table of Contents

Counting Unique Lines with sort and uniq

This method combines the power of sort and uniq for a straightforward approach. sort arranges lines alphabetically, a prerequisite for uniq, which counts only consecutive identical lines. The -c option in uniq adds a count prefix to each line.

To count unique lines in file.txt:


sort file.txt | uniq -c

This displays each unique line with its count. To get the total number of unique lines, pipe the output to wc -l:


sort file.txt | uniq -c | wc -l

Example:

If file.txt contains:


apple
banana
apple
orange
banana
apple

sort file.txt | uniq -c outputs:


      3 apple
      2 banana
      1 orange

And sort file.txt | uniq -c | wc -l outputs:


3

Counting Unique Lines with awk

awk offers a flexible solution, particularly useful for more complex scenarios. This method employs an associative array to track unique lines and their counts.

To count unique lines and display them with their counts:


awk '{count[$0]++} END {for (line in count) print count[line], line}' file.txt

This script increments the count for each line in the count array, using the line as the key. The END block iterates through the array, printing each line’s count and the line itself.

To obtain only the total count of unique lines:


awk '{count[$0]++} END {print length(count)}' file.txt

This uses length(count) to directly output the number of unique lines (the array’s size).

Example:

Using the same file.txt, the first awk command produces the same output as the sort | uniq -c method. The second awk command outputs 3, indicating three unique lines.

Choose the method that best suits your needs. sort and uniq are simpler for basic tasks; awk provides greater flexibility for complex scenarios.

Leave a Reply

Your email address will not be published. Required fields are marked *