
Python Pandas and SQL form the foundation for data analysis, machine learning, and ETL pipelines. Handling large DataFrames and running complex database queries requires efficiency without sacrificing code clarity.
Embedding SQL queries in Pandas workflows accelerates filtering, aggregation, and joins while maintaining Python’s flexibility and result consistency.
This guide covers pandasql setup and Pandas’ native SQL methods, presents real-world DataFrame query examples, outlines best practices to optimize analytics workflows and reporting.
Why Combine Python Pandas and SQL?
Pandas is a Python library built for data manipulation and analysis. It’s the go-to for slicing, dicing, and transforming tabular data. SQL (Structured Query Language), on the other hand, is the gold standard for querying relational databases-think MySQL, PostgreSQL, SQLite, and more.

Here’s why blending these two is a game-changer:
The Bridge: pandasql and Native Pandas SQL Integration
pandasql enables the execution of SQL queries directly on Pandas DataFrames, eliminating the need to export data, provision a separate database, or adopt additional APIs; users simply write SQL statements, receive a resulting DataFrame, and proceed uninterrupted.
Installing pandasql
python
pip install pandasql
Now you’re ready to blend SQL and Pandas like a pro.
Getting Started: Basic Usage
Let’s walk through a simple example. Suppose you’ve got a DataFrame:
python
import pandas as pd
import pandasql as psql
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
query = "SELECT * FROM df"
result = psql.sqldf(query, locals())
print(result)
This returns the full DataFrame, just like df.head()
but using SQL syntax. You can now filter, group, and join just like you would in a database.
Real-World Data Analysis with Pandas and SQL
Let’s level up with a practical dataset. Imagine you’re analysing a car sales dataset with columns like brand
, model
, year
, price
, mileage
, and more.
Loading and Exploring Data
python
import pandas as pd
import pandasql as ps
car_data = pd.read_csv("cars_datasets.csv")
print(car_data.head())
print(car_data.info())
print(car_data.isnull().sum())
You’ll see the column names, data types, and any missing values-essential for quality data analysis.
Running SQL Queries on DataFrames
Top 10 Most Expensive Cars
python
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand, model, year, price
FROM car_data
ORDER BY price DESC
LIMIT 10
""")
Average Price by Brand
python
q("""
SELECT brand, ROUND(AVG(price), 2) AS avg_price
FROM car_data
GROUP BY brand
ORDER BY avg_price DESC
""")
Cars Manufactured After 2015
python
q("""
SELECT *
FROM car_data
WHERE year > 2015
ORDER BY year DESC
""")
Total Cars by Brand
python
q("""
SELECT brand, COUNT(*) as total_listed
FROM car_data
GROUP BY brand
ORDER BY total_listed DESC
LIMIT 5
""")
Grouping by Condition
python
q("""
SELECT condition, ROUND(AVG(price), 2) AS avg_price, COUNT(*) as listings
FROM car_data
GROUP BY condition
ORDER BY avg_price DESC
""")
Average Mileage and Price by Brand
python
q("""
SELECT brand,
ROUND(AVG(mileage), 2) AS avg_mileage,
ROUND(AVG(price), 2) AS avg_price,
COUNT(*) AS total_listings
FROM car_data
GROUP BY brand
ORDER BY avg_price DESC
LIMIT 10
""")
Price per Mile
python
q("""
SELECT brand,
ROUND(AVG(price/mileage), 4) AS price_per_mile,
COUNT(*) AS total
FROM car_data
WHERE mileage > 0
GROUP BY brand
ORDER BY price_per_mile DESC
LIMIT 10
""")
Visualising Data by State
You can even use widgets and Plotly for interactive dashboards:
python
import plotly.express as px
import ipywidgets as widgets
state_dropdown = widgets.Dropdown(
options=car_data['state'].unique().tolist(),
value=car_data['state'].unique()[0],
description='Select State:',
layout=widgets.Layout(width='50%')
)
def plot_avg_price_state(state_selected):
query = f"""
SELECT brand, AVG(price) AS avg_price
FROM car_data
WHERE state = '{state_selected}'
GROUP BY brand
ORDER BY avg_price DESC
"""
result = q(query)
fig = px.bar(result, x='brand', y='avg_price', color='brand',
title=f"Average Car Price in {state_selected}")
fig.show()
widgets.interact(plot_avg_price_state, state_selected=state_dropdown)
This makes your analysis interactive and visually appealing-perfect for dashboards or presentations.
Beyond pandasql: Native Pandas SQL Operations
While pandasql is ace for quick SQL-style queries, Pandas also supports direct SQL integration for working with actual databases (like SQLite, PostgreSQL, MySQL):
Example: Reading and Writing to SQL
python
import pandas as pd
import sqlite3
# Connect to SQLite database
conn = sqlite3.connect(":memory:")
# Create a table and insert data
conn.execute("CREATE TABLE Students (id INTEGER, Name TEXT, Marks REAL, Age INTEGER)")
conn.execute("INSERT INTO Students VALUES (1, 'Kiran', 80, 16), (2, 'Priya', 60, 14), (3, 'Naveen', 82, 15)")
# Read from SQL
df = pd.read_sql("SELECT * FROM Students", conn)
print(df)
# Write to SQL
df.to_sql("Students_Copy", conn, if_exists="replace", index=False)
This approach is perfect for ETL pipelines, reporting, and production data workflows.
Advanced Use Cases: ETL, Machine Learning, and Dashboards
Combining SQL and Pandas isn’t just about querying-it’s about building smarter workflows:
Pandasql vs. Pure Pandas: When to Use What?
Feature | pandasql (SQL) | Pure Pandas |
---|---|---|
Syntax | SQL (familiar to many) | Python (flexible, powerful) |
Readability | High for complex queries | Can get verbose |
Performance | Slower on very large datasets | Faster, optimised for Python |
Joins/Grouping | Very intuitive | More code, but more options |
Integration | Great for quick analysis | Best for production workflows |
Limitations and Best Practices
Final Thoughts
The integrated use of Python Pandas and SQL represents an essential competency for data analysts, AI engineers, and research professionals. This methodology aligns relational database querying with Pandas’ powerful DataFrame operations, enhancing both efficiency and code clarity. By leveraging tools such as pandasql alongside Pandas’ native SQL integration, teams can execute exploratory data analysis (EDA), robust ETL workflows, and machine learning pipelines within a cohesive environment.
Stats to remember:
Adopting this dual approach ensures scalable, maintainable analytics processes and positions teams for long-term success.
Want to keep your AI and data skills sharp?
Explore more tutorials on LLMs, prompt engineering, RAG, and AI agent workflows. Stay tuned for more guides and hands-on examples from the AI MOJO