Use polars in 2025
pandas has served me very well in my data analytics career, it was, and still is the absolute powerhouse for (almost) every thing you need for day-to-day analytics. Even when it occasionally falls short on some advanced analytical workload, there’s a huge ecosystem that you can easily plug into for just a bit more complexity like scipy, numpy and scikit-learn.
That being said, I haven’t been using pandas professionally for a year now, as the new kid in town has improved so many aspects of pandas that I didn’t know I dislike. Whether you’re a veteran or a newbie in the field of data analytics, I would strongly suggest you try polars in your next analytical project.
What’s polars
I will keep this section short. polars is a data analysis module written in rust, it provides a similar DataFrame structure as pandas does, but offers much faster performance.
To use polars in your next project, checkout this link.
Why polars or What’s wrong with pandas
Before I go into the details I want to make it perfectly clear that I have huge respect for the team behind pandas, the product is absolutely amazing. Over the years there has been many changes and improvements in the industry, and therefore I believe polars is a better fit in 2025.
For me personally, there is three major reasons why polars is better:
- Speed
- API design
- Type system
Speed
There’s not much to say here, pandas itself is never known as the fastest framework on this planet, polars is much faster in almost every operations. There is enough benchmark to demonstrate this.
Performance wise, polars benefits a lot from its rust core, the lazy evaluation optimization and the underlying Arrow memory model (which pandas also support in the latest version). Just simply try it and you can really feel the difference.
API design
The API design of pandas can sometimes be a bit confusing. There are two parts that causes me most troubles when dealing with pandas
-
inplace=True/inplace=FalseAlmost every
pandasAPI comes with ainplaceparameter. This offer the flexibility on if you the operation to return a newDataFrame, or modify the current one directly. On paper this seems perfect, but this creates some ambiguities in your code base.One ambiguity here is that your IDE don’t exactly know for sure the return type of the function. The source code can only specify the return type as
pd.DataFrame | None, this completely breaks the auto-completion and the best solution that I found is to manually specify the type after every function call withinplace.Another ambiguity happens when you need to work with someone else on the similar code bases, as you may have different preference on
inplace=Trueorinplace=False. Or even worse, your colleague does not know howinplacechange the behavior. I have quite some debugging session only to drill down to a mis-match betweeninplaceand assignment. -
.reset_index()pandasprovide aindexfeature for faster filtering and selection on rows, this could be helpful but in many situations I found myself just wish theindexcan be operated just as any other regular columns. Let alone the complex situation where there’s amulti-indexcoming from agroupbyoperation. I tried to utilize theindexproperly but after so many projects, I just defaulted to rundf.reset_index(inplace=True)after any operations that could potentially change theindex.
In general, pandas provide flexibilities on inplace operation and index column, but I found the flexibilty not worth the inconvenience. polars on other hand has a more straightforward API design, there is no inplace action, every operation will return a new DataFrame. There is no specific index, every column is treated the same. As a result, it relief a lot of mental burden for me when I’m writing the code.
Type system
pandas type system is … a bit unclear at best. The numerical values use the np.float/np.int type, string values is object, datetime values use pd.Datetime. Most of the cases you need to do the type casting yourself and there is no a standardized method on null values, as you have np.nan for numerical, null for string and pd.NaT for datetime. All of these create a mental burden over the type system.
polars on the other hand has a more out-of-box solution on column types with a universal null value, an example as below:
csv file, noted that there is a missing value in every column.
float_col,int_col,str_col,date_col
,2,a,2025-01-01
2.0,,b,2025-02-01
3.0,4,,2025-03-01
4.0,5,d,
Using pandas to read the csv file and trying to parse the date:
pd_df = pd.read_csv('./sample_df.csv', parse_dates=True)
print("Pandas schema:")
print(pd_df.dtypes)
print(pd_df)
Noted that while pandas does read the null values and use nan to represent it, it failed to recognize the types between str_col and date_col.
Pandas schema:
float_col float64
int_col float64
str_col object
date_col object
dtype: object
float_col int_col str_col date_col
0 NaN 2.0 a 2025-01-01
1 2.0 NaN b 2025-02-01
2 3.0 4.0 NaN 2025-03-01
3 4.0 5.0 d NaN
To properly parse the date, you have to manually run pd.to_datetime to cast the types, which will change the missing value from NaN to NaT, none of these are a deal breaker but small things like this pile up over time and create mental burden.
Using polars, it will give you the proper type and universal null value at first try.
pl_df = pl.read_csv('./sample_df.csv', try_parse_dates=True)
print("Polars schema:")
print(pl_df.schema)
print(pl_df)
Polars schema:
Schema({'float_col': Float64, 'int_col': Int64, 'str_col': String, 'date_col': Date})
shape: (4, 4)
┌───────────┬─────────┬─────────┬────────────┐
│ float_col ┆ int_col ┆ str_col ┆ date_col │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ str ┆ date │
╞═══════════╪═════════╪═════════╪════════════╡
│ null ┆ 2 ┆ a ┆ 2025-01-01 │
│ 2.0 ┆ null ┆ b ┆ 2025-02-01 │
│ 3.0 ┆ 4 ┆ null ┆ 2025-03-01 │
│ 4.0 ┆ 5 ┆ d ┆ null │
└───────────┴─────────┴─────────┴────────────┘
polars is not perfect
After saying so many good things about polars, I have to admit polars is not perfect as well, it is still a developing framework with tons of potential and room for improvements. Currently I have two biggest complaints about polars.
-
The
pl.colis a bit too verbatim,polarsis very strict and requires developers to make explicit effort to clarify if a string should be treated as column name or literal value. I’m looking forward to a more simple solution other than usingpl.col()andpl.lit()everywhere. -
The type system is too strict sometimes.
I ran into this weird bug in my last project where I’m trying to make year-over-year comparision over same time period, for example to compare Jan-14th to Feb-14th for both 2024 and 2025.
My way around this is to create a helper numerical column using
date_helper=pl.col('event_date').dt.month * 100 + pl.col('event_date').dt.day. By doing so I can simply dofilter(pl.col('date_helper') >= 107)to filter both years data. (I’m sure there’s a better way to do achieve the similar functionality, but this is how I do it)Above method ran good in
pandas, but inpolarsI couldn’t get the correct result. As it turns out,.dt.monthgives you apl.Int8which is between-128to127, and therefore mydate_helpercolumn will overflow for every month other than January, the solution is that I need to manually cast these intopl.Int32orpl.Int16to prevent the overflow.This is trade-off between a safe, efficient type system and ease-of-use. I hope
polarscan come up with a better balance between these two.
While your opinion might vary, I would strongly suggest you to try out polars in your next project.