Pandas is a powerful Python library for data manipulation and analysis. Filtering DataFrame rows based on column values is a fundamental task in data processing. This article explores various techniques to efficiently filter Pandas DataFrames, covering simple to complex scenarios.
Table of Contents
- Basic Filtering: Single Column, Single Condition
- Negation: Selecting Rows That Don’t Match a Condition
- Numerical Comparisons: Greater Than, Less Than, etc.
- Combining Conditions: AND and OR Operations
- Advanced Filtering: Multiple Columns and Complex Logic
- Efficient Filtering with
isin()
Basic Filtering: Single Column, Single Condition
The simplest form of filtering involves selecting rows where a specific column matches a particular value. This is achieved using boolean indexing.
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
# Select rows where City is 'London'
london_residents = df[df['City'] == 'London']
print(london_residents)
This code creates a boolean mask (df['City'] == 'London'
) which is True
where the ‘City’ column is ‘London’ and False
otherwise. This mask is then used to index the DataFrame, selecting only the rows where the mask is True
.
Negation: Selecting Rows That Don’t Match a Condition
To select rows where a column does not contain a specific value, negate the boolean condition using the !=
operator.
# Select rows where City is NOT 'London'
not_london_residents = df[df['City'] != 'London']
print(not_london_residents)
Numerical Comparisons: Greater Than, Less Than, etc.
For numerical columns, use comparison operators (>
, <
, >=
, <=
) to filter based on value ranges.
# Select rows where Age is greater than 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)
# Select rows where Age is less than or equal to 25
younger_than_or_equal_to_25 = df[df['Age'] <= 25]
print(younger_than_or_equal_to_25)
Combining Conditions: AND and OR Operations
Filter based on multiple column values by combining boolean conditions with logical operators &
(AND) and |
(OR). Enclose each condition in parentheses.
# Select rows where City is 'London' AND Age is greater than 25
london_and_older = df[(df['City'] == 'London') & (df['Age'] > 25)]
print(london_and_older)
# Select rows where City is 'London' OR Age is greater than 25
london_or_older = df[(df['City'] == 'London') | (df['Age'] > 25)]
print(london_or_older)
Advanced Filtering: Multiple Columns and Complex Logic
For complex scenarios, the query()
method offers improved readability.
# Using query() for better readability
complex_filter = df.query('(City == "London" and Age > 25) or (Age < 23)')
print(complex_filter)
Efficient Filtering with isin()
The isin()
method provides a concise way to check if values are contained within a list.
cities_to_include = ['London', 'Paris']
filtered_df = df[df['City'].isin(cities_to_include)]
print(filtered_df)
This article demonstrated various techniques for filtering DataFrames in Pandas. Choosing the right method depends on the complexity of your filtering criteria. Remember to use parentheses correctly with logical operators to ensure accurate results.