Machine Learning

Mastering Stepwise Regression in Python: A Comprehensive Guide

Spread the love

Stepwise regression is a powerful technique for selecting the most relevant predictor variables in a regression model. By iteratively adding or removing variables based on statistical significance, it helps build parsimonious models that are easier to interpret and less prone to overfitting. This article explores various methods for performing stepwise regression in Python, using popular libraries like statsmodels, scikit-learn (sklearn), and mlxtend.

Table of Contents

Forward Selection

Forward selection starts with an empty model and iteratively adds the predictor variable that most significantly improves the model fit. This improvement is typically assessed using metrics like the F-statistic, AIC (Akaike Information Criterion), or BIC (Bayesian Information Criterion). The process continues until adding another variable doesn’t provide a statistically significant improvement.

Backward Elimination

Backward elimination takes the opposite approach. It begins with a model including all predictor variables and iteratively removes the least significant variable. The variable with the highest p-value (or lowest F-statistic) is removed at each step, provided its removal doesn’t significantly worsen the model fit. This continues until removing any remaining variable significantly degrades the model’s performance.

Stepwise Selection (Bidirectional)

Stepwise selection combines the strengths of forward and backward selection. It starts either with an empty model (like forward selection) or a full model (like backward elimination) and iteratively adds or removes variables based on their significance. This allows for a more flexible and potentially optimal subset of predictors.

Stepwise Regression with statsmodels

statsmodels doesn’t have a built-in stepwise regression function, but we can implement forward selection manually using its OLS (Ordinary Least Squares) model:


import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {'y': [10, 12, 15, 18, 20, 22, 25, 28, 30, 32],
        'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'x2': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'x3': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}
df = pd.DataFrame(data)

def forward_selection(data, target, significance_level=0.05):
    included = []
    while True:
        changed=False
        excluded = list(set(data.columns)-set(included)-{target})
        best_pvalue = 1.0
        best_feature = None
        for new_column in excluded:
            model = sm.OLS(target, sm.add_constant(data[included+[new_column]])).fit()
            pvalue = model.pvalues[new_column]
            if pvalue < best_pvalue:
                best_pvalue = pvalue
                best_feature = new_column
        if best_pvalue < significance_level:
            included.append(best_feature)
            changed=True
        if not changed: break
    return included

selected_features = forward_selection(df, df['y'])
print(f"Selected features: {selected_features}")
final_model = sm.OLS(df['y'], sm.add_constant(df[selected_features])).fit()
print(final_model.summary())

Stepwise Regression with sklearn

sklearn utilizes Recursive Feature Elimination (RFE) for feature selection. RFE iteratively removes features based on their importance scores from a base model (like Linear Regression):


from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
rfe = RFE(model, n_features_to_select=2) # Adjust as needed
rfe = rfe.fit(X_train, y_train)

print(f"Selected features: {X.columns[rfe.support_]}")

final_model = LinearRegression().fit(X_train[X.columns[rfe.support_]], y_train)
print(f"R-squared on test set: {final_model.score(X_test[X.columns[rfe.support_]], y_test)}")

Stepwise Regression with mlxtend

mlxtend‘s SequentialFeatureSelector provides a convenient way to perform forward, backward, or stepwise selection:


from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

model = LinearRegression()
sfs = SFS(model, k_features='best', forward=True, floating=False, scoring='r2', cv=5) #forward selection

sfs = sfs.fit(X, y)

print(f"Selected features: {list(sfs.k_feature_idx_)}")

final_model = LinearRegression().fit(X[X.columns[list(sfs.k_feature_idx_)]], y)

Choosing the Right Method and Considerations

The choice of stepwise regression method (forward, backward, or stepwise) and the library used depends on the specific dataset and goals. Remember to consider:

  • Data size: Backward elimination might be computationally expensive with a large number of predictors.
  • Multicollinearity: Stepwise methods can struggle with highly correlated predictors.
  • Interpretability vs. predictive accuracy: A simpler model (fewer variables) might be easier to interpret, even if slightly less accurate.
  • Cross-validation: Always validate your model using techniques like k-fold cross-validation to ensure generalizability.

Stepwise regression should be used cautiously. It can lead to unstable models that don’t generalize well to new data. Consider exploring other feature selection methods such as LASSO or Ridge regression, which incorporate regularization to handle overfitting.

Leave a Reply

Your email address will not be published. Required fields are marked *