Pandas is a powerful Python library for data manipulation and analysis. A common task involves extracting unique values from a DataFrame column and then sorting them. This article explores two efficient methods to accomplish this.
Table of Contents
- Extracting Unique Values with the
unique()
Method - Extracting Unique Values with the
drop_duplicates()
Method - Sorting Unique Values
Extracting Unique Values with the unique()
Method
The unique()
method provides a concise way to obtain unique values from a Pandas Series (a single column). It returns a NumPy array containing only the unique elements, preserving their original order.
import pandas as pd
data = {'col1': ['A', 'B', 'A', 'C', 'B', 'D'],
'col2': [1, 2, 1, 3, 2, 4]}
df = pd.DataFrame(data)
unique_values = df['col1'].unique()
print(unique_values) # Output: ['A' 'B' 'C' 'D']
This code creates a sample DataFrame and then uses unique()
on the ‘col1’ column. The output is a NumPy array showing the unique values in their first appearance order.
Extracting Unique Values with the drop_duplicates()
Method
The drop_duplicates()
method offers more flexibility, particularly when dealing with multiple columns. While primarily used for removing duplicate rows, it can efficiently extract unique values from a single column.
import pandas as pd
data = {'col1': ['A', 'B', 'A', 'C', 'B', 'D'],
'col2': [1, 2, 1, 3, 2, 4]}
df = pd.DataFrame(data)
unique_values = df['col1'].drop_duplicates().values
print(unique_values) # Output: ['A' 'B' 'C' 'D']
This example directly applies drop_duplicates()
to the ‘col1’ Series. The .values
attribute converts the result to a NumPy array. The order of unique values mirrors their first occurrence in the DataFrame.
Sorting Unique Values
Both methods above return unique values, but not necessarily in sorted order. To sort, utilize NumPy’s sort()
function or Pandas’ sort_values()
method.
import pandas as pd
import numpy as np
data = {'col1': ['A', 'B', 'A', 'C', 'B', 'D'],
'col2': [1, 2, 1, 3, 2, 4]}
df = pd.DataFrame(data)
# Using unique() and sort()
unique_values = np.sort(df['col1'].unique())
print(unique_values) # Output: ['A' 'B' 'C' 'D']
# Using drop_duplicates() and sort_values()
unique_values = df['col1'].drop_duplicates().sort_values().values
print(unique_values) # Output: ['A' 'B' 'C' 'D']
This showcases sorting using both approaches. np.sort()
works on the NumPy array from unique()
, while sort_values()
is used on the Pandas Series from drop_duplicates()
. Both yield a sorted array. For descending order with sort_values()
, use ascending=False
.
In summary, both unique()
and drop_duplicates()
efficiently extract unique values. The optimal choice depends on your specific needs and whether you’re working with single or multiple columns. Remember to sort the results using the appropriate method for your desired order.