Data Science

Efficiently Importing and Combining Multiple CSV Files with Pandas

Spread the love

This tutorial demonstrates how to efficiently import multiple CSV files into a Pandas DataFrame in Python. We’ll cover the fundamentals of Pandas, reading single CSV files, importing multiple files, and finally, concatenating them into a single, unified DataFrame.

Table of Contents

  1. What is Pandas?
  2. Reading a Single CSV File
  3. Reading Multiple CSV Files
  4. Concatenating DataFrames
  5. Handling Potential Errors

1. What is Pandas?

Pandas is a cornerstone library in Python’s data science ecosystem. It provides high-performance, easy-to-use data structures and data analysis tools. The core data structure is the DataFrame, a two-dimensional labeled data structure similar to a spreadsheet or SQL table. Pandas simplifies working with structured data from various sources, including CSV files, Excel spreadsheets, and databases.

2. Reading a Single CSV File

Before tackling multiple files, let’s read a single CSV:


import pandas as pd

file_path = 'your_file.csv'  # Replace with your file path
df = pd.read_csv(file_path)
print(df.head())

This imports Pandas, specifies the file path, reads the CSV using pd.read_csv(), and displays the first five rows using df.head().

3. Reading Multiple CSV Files

To read multiple CSV files from a directory, we utilize the glob module:


import pandas as pd
import glob

directory = 'path/to/your/csv/files/'  # Replace with your directory
csv_files = glob.glob(directory + '*.csv')
dfs = []

for file in csv_files:
    try:
        df = pd.read_csv(file)
        dfs.append(df)
    except pd.errors.EmptyDataError:
        print(f"Warning: Skipping empty file: {file}")
    except pd.errors.ParserError:
        print(f"Warning: Skipping file with parsing errors: {file}")

print(f"Number of DataFrames read: {len(dfs)}")

This code finds all CSV files in the specified directory, reads each into a DataFrame, and appends it to a list. The try-except block handles potential errors like empty files or parsing errors, preventing the script from crashing.

4. Concatenating DataFrames

Finally, we combine the individual DataFrames:


combined_df = pd.concat(dfs, ignore_index=True)
print(combined_df.head())
combined_df.to_csv('combined_data.csv', index=False) #Optional: Save to a new CSV

pd.concat(dfs, ignore_index=True) concatenates all DataFrames in the dfs list. ignore_index=True resets the index for a clean, continuous index. The optional to_csv() saves the result.

5. Handling Potential Errors

Robust scripts anticipate issues. Adding error handling, as shown in the multiple file reading section, is crucial. Consider adding checks for the directory’s existence and handling different types of file reading errors (e.g., incorrect delimiters, missing columns). This ensures your script is more reliable and less prone to unexpected failures.

Leave a Reply

Your email address will not be published. Required fields are marked *