It’s Time to Say Goodbye to Pandas
Pandas revolutionized data analysis in Python. It made working with tabular data intuitive and accessible. But pandas was designed in 2008 for a different era—single-threaded execution, eager evaluation, and datasets that fit comfortably in memory.
Today’s data is bigger. Today’s machines have more cores. And pandas can’t take advantage of either.
The problems with pandas:
- Single-threaded: Your 8-core machine sits mostly idle
- Eager evaluation: Every operation executes immediately, even if you’ll filter 90% away later
- No query optimization: Operations run in the order you write them, not the optimal order
- Memory inefficient: Loads everything into RAM, often making multiple copies
- Weak types: The dreaded
objectdtype, nullable integer awkwardness
Enter Polars and DuckDB—modern DataFrame libraries built for modern hardware. They offer:
- Lazy evaluation: Build a query plan, let the optimizer figure out the best execution
- Predicate pushdown: Filter data at the source before loading into memory
- Projection pushdown: Only read the columns you actually need
- Automatic parallelization: Use all your CPU cores by default
- Strong type systems: Know exactly what you’re working with
In this post, we’ll benchmark all three libraries on the Iowa Liquor Sales dataset—a real-world transactional dataset with millions of rows. Every operation is timed, so you can see exactly how much performance you’re leaving on the table with pandas.
Setup and Timing Infrastructure
First, let’s set up our environment and create a consistent timing mechanism. Every benchmark will use this same wrapper so results are directly comparable.
!pwd/Users/bswr/smithy/portfolio/site/src/content/post/dataframe_showdown
import time
import pandas as pd
import polars as pl
import duckdb
from contextlib import contextmanager
# Configure your data path here
DATA_PATH = "/Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/"
# Timing results stored for final comparison
BENCHMARKS = {}
@contextmanager
def benchmark(name: str, library: str):
"""Context manager to time operations consistently."""
start = time.perf_counter()
yield
elapsed = time.perf_counter() - start
if name not in BENCHMARKS:
BENCHMARKS[name] = {}
BENCHMARKS[name][library] = elapsed
print(f"{library}: {elapsed:.4f}s")Type System Comparison
One of the most frustrating aspects of pandas is its loose type system. Strings become object, nullable integers require special dtypes, and type inference can silently produce unexpected results.
Let’s see how each library handles schema inference on our parquet file:
%%time
# Pandas: Notice the 'object' dtype for strings
print("=== Pandas Schema ===")
df_temp = pd.read_parquet(DATA_PATH)
print(df_temp.dtypes)
print(f"\nShape: {df_temp.shape}")
del df_temp=== Pandas Schema ===
invoice_and_item_number object
date object
store_number int64
store_name object
address object
city object
zip_code object
store_location object
county_number float64
county object
category float64
category_name object
vendor_number float64
vendor_name object
item_number float64
item_description object
pack int64
bottle_volume_ml int64
state_bottle_cost object
state_bottle_retail object
bottles_sold int64
sale_dollars object
volume_sold_liters object
volume_sold_gallons object
dtype: object
Shape: (32629654, 24)
CPU times: user 33 s, sys: 8.24 s, total: 41.2 s
Wall time: 31.5 s
%%time
# Polars: Explicit types - Utf8 for strings, proper numeric types
print("=== Polars Schema ===")
print(pl.scan_parquet(DATA_PATH).collect_schema())=== Polars Schema ===
Schema({'invoice_and_item_number': String, 'date': Date, 'store_number': Int64, 'store_name': String, 'address': String, 'city': String, 'zip_code': String, 'store_location': String, 'county_number': Int64, 'county': String, 'category': Int64, 'category_name': String, 'vendor_number': Int64, 'vendor_name': String, 'item_number': Int64, 'item_description': String, 'pack': Int64, 'bottle_volume_ml': Int64, 'state_bottle_cost': Decimal(precision=38, scale=9), 'state_bottle_retail': Decimal(precision=38, scale=9), 'bottles_sold': Int64, 'sale_dollars': Decimal(precision=38, scale=9), 'volume_sold_liters': Decimal(precision=38, scale=9), 'volume_sold_gallons': Decimal(precision=38, scale=9)})
CPU times: user 2.51 ms, sys: 8.79 ms, total: 11.3 ms
Wall time: 28.6 ms
%%time
# DuckDB: SQL-style types with DESCRIBE
print("=== DuckDB Schema ===")
print(duckdb.sql(f"DESCRIBE SELECT * FROM '{DATA_PATH}*'"))=== DuckDB Schema ===
---------------------------------------------------------------------------
CatalogException Traceback (most recent call last)
File <timed exec>:3
CatalogException: Catalog Error: Table with name /Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/* does not exist!
Did you mean "pragma_database_list"?
Benchmark #1: Full Data Read
Our first benchmark is simple: read the entire parquet file into memory. This establishes a baseline and shows how each library handles raw I/O.
print("=== Benchmark #1: Full Data Read ===\n")
with benchmark("full_read", "pandas"):
df_pandas = pd.read_parquet(DATA_PATH)
with benchmark("full_read", "polars"):
df_polars = pl.read_parquet(DATA_PATH)
with benchmark("full_read", "duckdb"):
df_duckdb = duckdb.read_parquet(DATA_PATH + "*").df()
print(f"\nRows loaded: {len(df_pandas):,}")=== Benchmark #1: Full Data Read ===
pandas: 27.9592s
polars: 2.6359s
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Rows loaded: 32,629,654
duckdb: 59.7854s
Lazy Evaluation vs Eager Execution
Here’s where modern libraries shine. Pandas executes every operation immediately—even if you’re about to throw away 90% of the data. Polars and DuckDB build a query plan first, then optimize before executing.
Watch how “scanning” a parquet file is instant—no data is loaded until you call .collect() or execute the query:
# Polars lazy scan - builds query plan only (instant!)
start = time.perf_counter()
lf_polars = pl.scan_parquet(DATA_PATH)
print(f"Polars scan_parquet: {time.perf_counter() - start:.6f}s (no data loaded!)")
print(f"Type: {type(lf_polars).__name__}")
# DuckDB relation - also lazy
start = time.perf_counter()
rel_duckdb = duckdb.read_parquet(DATA_PATH+"*")
print(f"DuckDB read_parquet: {time.perf_counter() - start:.6f}s (lazy relation)")
print(f"Type: {type(rel_duckdb).__name__}")Polars scan_parquet: 0.000509s (no data loaded!)
Type: LazyFrame
DuckDB read_parquet: 0.007091s (lazy relation)
Type: DuckDBPyRelation
# Polars shows you the query plan before execution
query = (
pl.scan_parquet(DATA_PATH)
.filter(pl.col("category_name") == "VODKA")
.select(["store_name", "sale_dollars"])
.group_by("store_name")
.agg(pl.col("sale_dollars").sum())
)
print("=== Polars Query Plan ===")
print(query.explain())=== Polars Query Plan ===
AGGREGATE
[col("sale_dollars").sum()] BY [col("store_name")] FROM
simple π 2/3 ["store_name", "sale_dollars"]
Parquet SCAN [/Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/iowa_liquor_sales_2012-01.parquet, ... 165 other sources]
PROJECT 3/24 COLUMNS
SELECTION: [(col("category_name")) == (String(VODKA))]
Benchmark #2: Predicate Pushdown (Filtered Read)
Predicate pushdown means pushing filter conditions down to the data source. Instead of loading all data and then filtering, the filter is applied during the read—so only matching rows ever enter memory.
Parquet files store data in row groups with statistics (min/max values). Smart libraries skip entire row groups that can’t match your filter. Pandas doesn’t do this.
print("=== Benchmark #2: Filtered Read (Predicate Pushdown) ===\n")
# Pandas: Must load ALL data, then filter
with benchmark("filter_category", "pandas"):
df = pd.read_parquet(DATA_PATH)
result_pandas = df[df["category_name"] == "VODKA"]
del df
# Polars: Pushes filter to parquet reader
with benchmark("filter_category", "polars"):
result_polars = (
pl.scan_parquet(DATA_PATH)
.filter(pl.col("category_name") == "VODKA")
.collect()
)
# DuckDB: Same pushdown via SQL
with benchmark("filter_category", "duckdb"):
result_duckdb = duckdb.sql(
f"SELECT * FROM '{DATA_PATH}' WHERE category_name = 'VODKA'"
).df()
print(f"\nFiltered rows: {len(result_pandas):,} (VODKA only)")=== Benchmark #2: Filtered Read (Predicate Pushdown) ===
Benchmark #3: Projection Pushdown (Column Selection)
Projection pushdown means only reading the columns you need. Parquet is a columnar format—columns are stored separately on disk. If you only need 2 columns out of 20, why read the other 18?
Polars and DuckDB automatically detect which columns your query uses and only read those from disk.
print("=== Benchmark #3: Column Selection (Projection Pushdown) ===\n")
# Pandas: Reads all columns, then selects (some optimization with 'columns' param)
with benchmark("select_columns", "pandas"):
result_pandas = pd.read_parquet(DATA_PATH, columns=["store_name", "sale_dollars"])
# Polars: Only reads 2 columns from disk
with benchmark("select_columns", "polars"):
result_polars = (
pl.scan_parquet(DATA_PATH)
.select(["store_name", "sale_dollars"])
.collect()
)
# DuckDB: Same projection pushdown
with benchmark("select_columns", "duckdb"):
result_duckdb = duckdb.sql(
f"SELECT store_name, sale_dollars FROM '{DATA_PATH}'"
).df()
print(f"\nColumns selected: {list(result_pandas.columns)}")Benchmark #4: Expressions vs Apply
The apply() function is pandas’ escape hatch for row-wise operations. It’s also a massive performance trap—it drops down to Python loops, bypassing all vectorization.
Even when pandas users write “proper” vectorized code, Polars and DuckDB still win because their expression engines are compiled and optimized.
print("=== Benchmark #4: Row Calculations ===\n")
# First, let's see how bad apply() really is (on a sample to not wait forever)
sample_size = 100_000
df_sample = df_pandas.head(sample_size).copy()
print(f"Testing on {sample_size:,} rows first (apply is SLOW)...\n")
with benchmark("row_calc_apply", "pandas"):
df_sample["margin"] = df_sample.apply(
lambda row: row["sale_dollars"] - row["state_bottle_cost"], axis=1
)print("Now vectorized operations on FULL dataset:\n")
# Pandas vectorized (the right way)
with benchmark("row_calc_vectorized", "pandas"):
df_pandas["margin"] = df_pandas["sale_dollars"] - df_pandas["state_bottle_cost"]
# Polars expression
with benchmark("row_calc_vectorized", "polars"):
df_polars = df_polars.with_columns(
(pl.col("sale_dollars") - pl.col("state_bottle_cost")).alias("margin")
)
# DuckDB SQL expression
with benchmark("row_calc_vectorized", "duckdb"):
result = duckdb.sql("""
SELECT *, sale_dollars - state_bottle_cost AS margin
FROM df_polars
""").df()Benchmark #5: Window Functions (Parallel Column Operations)
Window functions compute aggregates over groups while keeping all rows. In pandas, you use .groupby().transform(). In Polars, you use .over().
The key difference: Polars executes multiple window operations in parallel—each column computation runs on a separate thread.
print("=== Benchmark #5: Window Functions ===\n")
# Pandas: Sequential window operations
with benchmark("window_ops", "pandas"):
df_temp = df_pandas.copy()
df_temp["store_total"] = df_temp.groupby("store_name")["sale_dollars"].transform("sum")
df_temp["category_avg"] = df_temp.groupby("category_name")["bottles_sold"].transform("mean")
del df_temp
# Polars: Parallel window operations with .over()
with benchmark("window_ops", "polars"):
result_polars = df_polars.with_columns([
pl.col("sale_dollars").sum().over("store_name").alias("store_total"),
pl.col("bottles_sold").mean().over("category_name").alias("category_avg"),
])
# DuckDB: Window functions via SQL
with benchmark("window_ops", "duckdb"):
result_duckdb = duckdb.sql("""
SELECT *,
SUM(sale_dollars) OVER (PARTITION BY store_name) as store_total,
AVG(bottles_sold) OVER (PARTITION BY category_name) as category_avg
FROM df_polars
""").df()Benchmark #6: Complex Pipeline (Method Chaining)
Real data work involves multi-step transformations. Let’s compare a typical pipeline: extract year from date, group by year and category, aggregate sales, and sort.
Notice how Polars method chaining reads naturally top-to-bottom, while pandas often requires intermediate variables or awkward patterns like .assign().
print("=== Benchmark #6: Complex Pipeline ===\n")
# Pandas: Method chaining gets awkward
with benchmark("complex_pipeline", "pandas"):
result_pandas = (
df_pandas
.assign(year=df_pandas["date"].dt.year)
.groupby(["year", "category_name"])
.agg({"sale_dollars": "sum"})
.reset_index()
.sort_values("sale_dollars", ascending=False)
)
# Polars: Natural chaining with lazy evaluation
with benchmark("complex_pipeline", "polars"):
result_polars = (
df_polars.lazy()
.with_columns(pl.col("date").dt.year().alias("year"))
.group_by(["year", "category_name"])
.agg(pl.col("sale_dollars").sum())
.sort("sale_dollars", descending=True)
.collect()
)
# DuckDB: SQL is naturally declarative
with benchmark("complex_pipeline", "duckdb"):
result_duckdb = duckdb.sql("""
SELECT
YEAR(date) as year,
category_name,
SUM(sale_dollars) as sale_dollars
FROM df_polars
GROUP BY 1, 2
ORDER BY sale_dollars DESC
""").df()
print(f"\nTop 5 results:")
print(result_polars.head())Benchmark #7: Group-By Aggregation
Group-by operations are the bread and butter of data analysis. Here we compute multiple aggregations per group—total sales, average bottles sold, and transaction count—grouped by store and category.
print("=== Benchmark #7: Group-By Aggregation ===\n")
# Pandas
with benchmark("groupby_agg", "pandas"):
result_pandas = (
df_pandas
.groupby(["store_name", "category_name"])
.agg(
total_sales=("sale_dollars", "sum"),
avg_bottles=("bottles_sold", "mean"),
txn_count=("invoice_line_no", "count")
)
.reset_index()
.sort_values("total_sales", ascending=False)
)
# Polars
with benchmark("groupby_agg", "polars"):
result_polars = (
df_polars.lazy()
.group_by(["store_name", "category_name"])
.agg([
pl.col("sale_dollars").sum().alias("total_sales"),
pl.col("bottles_sold").mean().alias("avg_bottles"),
pl.col("invoice_line_no").count().alias("txn_count"),
])
.sort("total_sales", descending=True)
.collect()
)
# DuckDB
with benchmark("groupby_agg", "duckdb"):
result_duckdb = duckdb.sql("""
SELECT
store_name,
category_name,
SUM(sale_dollars) as total_sales,
AVG(bottles_sold) as avg_bottles,
COUNT(invoice_line_no) as txn_count
FROM df_polars
GROUP BY 1, 2
ORDER BY total_sales DESC
""").df()
print(f"\nUnique store-category combinations: {len(result_pandas):,}")Results Summary
Let’s compile all our benchmark results into a single comparison table, including speedup ratios relative to pandas:
print("=== BENCHMARK RESULTS ===\n")
results = []
for op, times in BENCHMARKS.items():
row = {"operation": op}
row["pandas (s)"] = f"{times.get('pandas', 0):.4f}"
row["polars (s)"] = f"{times.get('polars', 0):.4f}"
row["duckdb (s)"] = f"{times.get('duckdb', 0):.4f}"
if "pandas" in times:
pandas_time = times["pandas"]
polars_time = times.get("polars", pandas_time)
duckdb_time = times.get("duckdb", pandas_time)
row["polars speedup"] = f"{pandas_time / polars_time:.1f}x"
row["duckdb speedup"] = f"{pandas_time / duckdb_time:.1f}x"
results.append(row)
summary_df = pd.DataFrame(results)
print(summary_df.to_markdown(index=False))Feature Comparison
Beyond raw speed, here’s how the three libraries compare on key features:
| Feature | Pandas | Polars | DuckDB |
|---|---|---|---|
| Type System | Weak (object dtype) | Strong (explicit types) | Strong (SQL types) |
| Lazy Evaluation | No | Yes | Yes |
| Predicate Pushdown | No | Yes | Yes |
| Projection Pushdown | Partial | Yes | Yes |
| Parallel Execution | No | Yes (automatic) | Yes (automatic) |
| Method Chaining | Awkward | Natural | N/A (SQL) |
| Memory Efficiency | Poor | Excellent | Excellent |
| Query Optimization | No | Yes | Yes |
Conclusion
The benchmarks speak for themselves. Polars and DuckDB consistently outperform pandas by 5-20x on common operations—not because they’re magic, but because they’re designed for modern hardware and modern data sizes.
When to Use What
Use Polars when:
- Building data pipelines in Python
- You want a DataFrame API (similar to pandas but better)
- You need method chaining and lazy evaluation
- Your data fits in memory (or can be streamed)
Use DuckDB when:
- You prefer SQL or need SQL compatibility
- You’re doing ad-hoc analysis
- You want to query parquet/CSV files directly without loading
- You need to integrate with existing SQL workflows
Use Pandas when:
- Working with legacy codebases
- Data is small (<100MB) and simplicity matters more than speed
- You need a specific pandas-only library integration
The Migration Path
The good news: migrating is easier than you think.
- Start with new projects - Use Polars or DuckDB for new work
- Identify bottlenecks - Profile your pandas code, find the slow parts
- Convert hot paths - Replace slow pandas operations with Polars/DuckDB
- Polars has pandas interop -
df.to_pandas()andpl.from_pandas()make gradual migration painless
Resources
Stop leaving performance on the table. Your data deserves better than pandas.