Use polars in 2025
pandas
has served me very well in my data analytics career, it was, and still is the absolute powerhouse for (almost) every thing you need for day-to-day analytics. Even when it occasionally falls short on some advanced analytical workload, there’s a huge ecosystem that you can easily plug into for just a bit more complexity like scipy
, numpy
and scikit-learn
.
That being said, I haven’t been using pandas
professionally for a year now, as the new kid in town has improved so many aspects of pandas
that I didn’t know I dislike. Whether you’re a veteran or a newbie in the field of data analytics, I would strongly suggest you try polars
in your next analytical project.
What’s polars
I will keep this section short. polars
is a data analysis module written in rust
, it provides a similar DataFrame
structure as pandas
does, but offers much faster performance.
To use polars
in your next project, checkout this link.
Why polars
or What’s wrong with pandas
Before I go into the details I want to make it perfectly clear that I have huge respect for the team behind pandas
, the product is absolutely amazing. Over the years there has been many changes and improvements in the industry, and therefore I believe polars
is a better fit in 2025.
For me personally, there is three major reasons why polars
is better:
- Speed
- API design
- Type system
Speed
There’s not much to say here, pandas
itself is never known as the fastest framework on this planet, polars
is much faster in almost every operations. There is enough benchmark to demonstrate this.
Performance wise, polars
benefits a lot from its rust
core, the lazy evaluation optimization and the underlying Arrow memory model (which pandas
also support in the latest version). Just simply try it and you can really feel the difference.
API design
The API design of pandas
can sometimes be a bit confusing. There are two parts that causes me most troubles when dealing with pandas
-
inplace=True
/inplace=False
Almost every
pandas
API comes with ainplace
parameter. This offer the flexibility on if you the operation to return a newDataFrame
, or modify the current one directly. On paper this seems perfect, but this creates some ambiguities in your code base.One ambiguity here is that your IDE don’t exactly know for sure the return type of the function. The source code can only specify the return type as
pd.DataFrame | None
, this completely breaks the auto-completion and the best solution that I found is to manually specify the type after every function call withinplace
.Another ambiguity happens when you need to work with someone else on the similar code bases, as you may have different preference on
inplace=True
orinplace=False
. Or even worse, your colleague does not know howinplace
change the behavior. I have quite some debugging session only to drill down to a mis-match betweeninplace
and assignment. -
.reset_index()
pandas
provide aindex
feature for faster filtering and selection on rows, this could be helpful but in many situations I found myself just wish theindex
can be operated just as any other regular columns. Let alone the complex situation where there’s amulti-index
coming from agroupby
operation. I tried to utilize theindex
properly but after so many projects, I just defaulted to rundf.reset_index(inplace=True)
after any operations that could potentially change theindex
.
In general, pandas
provide flexibilities on inplace operation and index column, but I found the flexibilty not worth the inconvenience. polars
on other hand has a more straightforward API design, there is no inplace action, every operation will return a new DataFrame
. There is no specific index, every column is treated the same. As a result, it relief a lot of mental burden for me when I’m writing the code.
Type system
pandas
type system is … a bit unclear at best. The numerical values use the np.float/np.int
type, string values is object
, datetime values use pd.Datetime
. Most of the cases you need to do the type casting yourself and there is no a standardized method on null values, as you have np.nan
for numerical, null
for string and pd.NaT
for datetime. All of these create a mental burden over the type system.
polars
on the other hand has a more out-of-box solution on column types with a universal null
value, an example as below:
csv file, noted that there is a missing value in every column.
float_col,int_col,str_col,date_col
,2,a,2025-01-01
2.0,,b,2025-02-01
3.0,4,,2025-03-01
4.0,5,d,
Using pandas
to read the csv file and trying to parse the date:
pd_df = pd.read_csv('./sample_df.csv', parse_dates=True)
print("Pandas schema:")
print(pd_df.dtypes)
print(pd_df)
Noted that while pandas
does read the null values and use nan
to represent it, it failed to recognize the types between str_col
and date_col
.
Pandas schema:
float_col float64
int_col float64
str_col object
date_col object
dtype: object
float_col int_col str_col date_col
0 NaN 2.0 a 2025-01-01
1 2.0 NaN b 2025-02-01
2 3.0 4.0 NaN 2025-03-01
3 4.0 5.0 d NaN
To properly parse the date, you have to manually run pd.to_datetime
to cast the types, which will change the missing value from NaN
to NaT
, none of these are a deal breaker but small things like this pile up over time and create mental burden.
Using polars
, it will give you the proper type and universal null value at first try.
pl_df = pl.read_csv('./sample_df.csv', try_parse_dates=True)
print("Polars schema:")
print(pl_df.schema)
print(pl_df)
Polars schema:
Schema({'float_col': Float64, 'int_col': Int64, 'str_col': String, 'date_col': Date})
shape: (4, 4)
┌───────────┬─────────┬─────────┬────────────┐
│ float_col ┆ int_col ┆ str_col ┆ date_col │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ str ┆ date │
╞═══════════╪═════════╪═════════╪════════════╡
│ null ┆ 2 ┆ a ┆ 2025-01-01 │
│ 2.0 ┆ null ┆ b ┆ 2025-02-01 │
│ 3.0 ┆ 4 ┆ null ┆ 2025-03-01 │
│ 4.0 ┆ 5 ┆ d ┆ null │
└───────────┴─────────┴─────────┴────────────┘
polars
is not perfect
After saying so many good things about polars
, I have to admit polars
is not perfect as well, it is still a developing framework with tons of potential and room for improvements. Currently I have two biggest complaints about polars
.
-
The
pl.col
is a bit too verbatim,polars
is very strict and requires developers to make explicit effort to clarify if a string should be treated as column name or literal value. I’m looking forward to a more simple solution other than usingpl.col()
andpl.lit()
everywhere. -
The type system is too strict sometimes.
I ran into this weird bug in my last project where I’m trying to make year-over-year comparision over same time period, for example to compare Jan-14th to Feb-14th for both 2024 and 2025.
My way around this is to create a helper numerical column using
date_helper=pl.col('event_date').dt.month * 100 + pl.col('event_date').dt.day
. By doing so I can simply dofilter(pl.col('date_helper') >= 107)
to filter both years data. (I’m sure there’s a better way to do achieve the similar functionality, but this is how I do it)Above method ran good in
pandas
, but inpolars
I couldn’t get the correct result. As it turns out,.dt.month
gives you apl.Int8
which is between-128
to127
, and therefore mydate_helper
column will overflow for every month other than January, the solution is that I need to manually cast these intopl.Int32
orpl.Int16
to prevent the overflow.This is trade-off between a safe, efficient type system and ease-of-use. I hope
polars
can come up with a better balance between these two.
While your opinion might vary, I would strongly suggest you to try out polars
in your next project.