Stepwise regression is a powerful technique for selecting the most relevant predictor variables in a regression model. By iteratively adding or removing variables based on statistical significance, it helps build parsimonious models that are easier to interpret and less prone to overfitting. This article explores various methods for performing stepwise regression in Python, using popular libraries like statsmodels
, scikit-learn
(sklearn
), and mlxtend
.
Table of Contents
- Forward Selection
- Backward Elimination
- Stepwise Selection (Bidirectional)
- Stepwise Regression with
statsmodels
- Stepwise Regression with
sklearn
- Stepwise Regression with
mlxtend
- Choosing the Right Method and Considerations
Forward Selection
Forward selection starts with an empty model and iteratively adds the predictor variable that most significantly improves the model fit. This improvement is typically assessed using metrics like the F-statistic, AIC (Akaike Information Criterion), or BIC (Bayesian Information Criterion). The process continues until adding another variable doesn’t provide a statistically significant improvement.
Backward Elimination
Backward elimination takes the opposite approach. It begins with a model including all predictor variables and iteratively removes the least significant variable. The variable with the highest p-value (or lowest F-statistic) is removed at each step, provided its removal doesn’t significantly worsen the model fit. This continues until removing any remaining variable significantly degrades the model’s performance.
Stepwise Selection (Bidirectional)
Stepwise selection combines the strengths of forward and backward selection. It starts either with an empty model (like forward selection) or a full model (like backward elimination) and iteratively adds or removes variables based on their significance. This allows for a more flexible and potentially optimal subset of predictors.
Stepwise Regression with statsmodels
statsmodels
doesn’t have a built-in stepwise regression function, but we can implement forward selection manually using its OLS
(Ordinary Least Squares) model:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Sample data
data = {'y': [10, 12, 15, 18, 20, 22, 25, 28, 30, 32],
'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x2': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
'x3': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}
df = pd.DataFrame(data)
def forward_selection(data, target, significance_level=0.05):
included = []
while True:
changed=False
excluded = list(set(data.columns)-set(included)-{target})
best_pvalue = 1.0
best_feature = None
for new_column in excluded:
model = sm.OLS(target, sm.add_constant(data[included+[new_column]])).fit()
pvalue = model.pvalues[new_column]
if pvalue < best_pvalue:
best_pvalue = pvalue
best_feature = new_column
if best_pvalue < significance_level:
included.append(best_feature)
changed=True
if not changed: break
return included
selected_features = forward_selection(df, df['y'])
print(f"Selected features: {selected_features}")
final_model = sm.OLS(df['y'], sm.add_constant(df[selected_features])).fit()
print(final_model.summary())
Stepwise Regression with sklearn
sklearn
utilizes Recursive Feature Elimination (RFE) for feature selection. RFE iteratively removes features based on their importance scores from a base model (like Linear Regression):
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
rfe = RFE(model, n_features_to_select=2) # Adjust as needed
rfe = rfe.fit(X_train, y_train)
print(f"Selected features: {X.columns[rfe.support_]}")
final_model = LinearRegression().fit(X_train[X.columns[rfe.support_]], y_train)
print(f"R-squared on test set: {final_model.score(X_test[X.columns[rfe.support_]], y_test)}")
Stepwise Regression with mlxtend
mlxtend
‘s SequentialFeatureSelector
provides a convenient way to perform forward, backward, or stepwise selection:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
model = LinearRegression()
sfs = SFS(model, k_features='best', forward=True, floating=False, scoring='r2', cv=5) #forward selection
sfs = sfs.fit(X, y)
print(f"Selected features: {list(sfs.k_feature_idx_)}")
final_model = LinearRegression().fit(X[X.columns[list(sfs.k_feature_idx_)]], y)
Choosing the Right Method and Considerations
The choice of stepwise regression method (forward, backward, or stepwise) and the library used depends on the specific dataset and goals. Remember to consider:
- Data size: Backward elimination might be computationally expensive with a large number of predictors.
- Multicollinearity: Stepwise methods can struggle with highly correlated predictors.
- Interpretability vs. predictive accuracy: A simpler model (fewer variables) might be easier to interpret, even if slightly less accurate.
- Cross-validation: Always validate your model using techniques like k-fold cross-validation to ensure generalizability.
Stepwise regression should be used cautiously. It can lead to unstable models that don’t generalize well to new data. Consider exploring other feature selection methods such as LASSO or Ridge regression, which incorporate regularization to handle overfitting.