This tutorial demonstrates how to efficiently import multiple CSV files into a Pandas DataFrame in Python. We’ll cover the fundamentals of Pandas, reading single CSV files, importing multiple files, and finally, concatenating them into a single, unified DataFrame.
Table of Contents
- What is Pandas?
- Reading a Single CSV File
- Reading Multiple CSV Files
- Concatenating DataFrames
- Handling Potential Errors
1. What is Pandas?
Pandas is a cornerstone library in Python’s data science ecosystem. It provides high-performance, easy-to-use data structures and data analysis tools. The core data structure is the DataFrame, a two-dimensional labeled data structure similar to a spreadsheet or SQL table. Pandas simplifies working with structured data from various sources, including CSV files, Excel spreadsheets, and databases.
2. Reading a Single CSV File
Before tackling multiple files, let’s read a single CSV:
import pandas as pd
file_path = 'your_file.csv' # Replace with your file path
df = pd.read_csv(file_path)
print(df.head())
This imports Pandas, specifies the file path, reads the CSV using pd.read_csv()
, and displays the first five rows using df.head()
.
3. Reading Multiple CSV Files
To read multiple CSV files from a directory, we utilize the glob
module:
import pandas as pd
import glob
directory = 'path/to/your/csv/files/' # Replace with your directory
csv_files = glob.glob(directory + '*.csv')
dfs = []
for file in csv_files:
try:
df = pd.read_csv(file)
dfs.append(df)
except pd.errors.EmptyDataError:
print(f"Warning: Skipping empty file: {file}")
except pd.errors.ParserError:
print(f"Warning: Skipping file with parsing errors: {file}")
print(f"Number of DataFrames read: {len(dfs)}")
This code finds all CSV files in the specified directory, reads each into a DataFrame, and appends it to a list. The try-except
block handles potential errors like empty files or parsing errors, preventing the script from crashing.
4. Concatenating DataFrames
Finally, we combine the individual DataFrames:
combined_df = pd.concat(dfs, ignore_index=True)
print(combined_df.head())
combined_df.to_csv('combined_data.csv', index=False) #Optional: Save to a new CSV
pd.concat(dfs, ignore_index=True)
concatenates all DataFrames in the dfs
list. ignore_index=True
resets the index for a clean, continuous index. The optional to_csv()
saves the result.
5. Handling Potential Errors
Robust scripts anticipate issues. Adding error handling, as shown in the multiple file reading section, is crucial. Consider adding checks for the directory’s existence and handling different types of file reading errors (e.g., incorrect delimiters, missing columns). This ensures your script is more reliable and less prone to unexpected failures.