Pandas is the main driving force behind the ever more popular Exploratory Data Analysis (EDA) workloads [1], [2], [3], [4], but there is still no good optimization techniques for Pandas for this setting. EDA workloads are special because they are ad-hoc and unrefined. The code written is diverse, and may include anything from pure Python to arbitrary combinations of multiple libraries. Furthermore, EDA is also supposed to be quick! The code should be fast to write and fast to run. EDA is not the time to learn new APIs and it is not the time to carefully optimize your code.
These characteristics make current solutions fall short in optimizing Pandas for the EDA setting. They require either learning new APIs, or they incur significant overheads even when running on powerful hardware [5]. We introduce Dias, a novel, extremely lightweight, and yet effective, optimizer of Pandas code, catered to EDA workloads.
Dias is a dynamic rewriter of Pandas code. Dias is not a replacement for Pandas, but rather a process that runs in the background of your Jupyter notebook, looking for code patterns it can rewrite into faster equivalents. Upon recognizing one, Dias rewrites the code automatically and correctly.
Dias has several advantages. First, it can offer substantial performance improvements, possibly reaching 100x or 1000x speedups. Second, Dias is extremely lightweight as it incurs virtually no runtime or memory overheads. Thus, Dias generally won't make your code slower than vanilla Pandas. Third, Dias inherently does not suffer from a lack of API support, because it is not a replacement for Pandas. If Dias does not understand a piece of code, it will leave it untouched. Finally, you don't have to know anything about Dias to understand the cause of an optimization. The code Dias outputs is still standard Python/Pandas code. You can simply ask Dias to show you the rewritten version and even copy-and-paste it to a new cell and experiment.
For a more detailed analysis of Dias, please take a look at our paper.
Just pip install dias
and add the %%rewrite
magic function in all of your cells. That's it! Let's see an example.
Suppose we have the following notebook which populates a DataFrame
with random data and then calls a function 1, with
apply()
, on every row.
import pandas as pd
import dias.rewriter
import numpy as np
rand_arr = np.random.rand(2_500_000,20)
df = pd.DataFrame(rand_arr)
%%time
def weighted_rating(x, m=50, C=5.6):
v = x[0]
R = x[9]
return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)
CPU times: user 9.97 s, sys: 90.9 ms, total: 10.1 sWall time: 10.1 s
The operation is quite slow, taking about 10s. Now, we will leave the
cell untouched and just add the
%%rewrite
magic function.
%%time
%%rewrite
def weighted_rating(x, m=50, C=5.6):
v = x[0]
R = x[9]
return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)
CPU times: user 59.6 ms, sys: 7.57 ms, total: 67.2 msWall time: 66 ms
We can see that just like that we got a 153x speedup. And this experiment was
done on a laptop. We can also ask Dias to see what it did with
%%rewrite verbose
.
%%rewrite verbose
def weighted_rating(x, m=50, C=5.6):
v = x[0]
R = x[9]
return (v/(v+m) * R) + (m/(m+v) * C)
_ = df.apply(weighted_rating, axis=1)
def weighted_rating(x, m=50, C=5.6):
v = x[0]
R = x[9]
return v / (v + m) * R + m / (m + v) * C
_ = weighted_rating(df)
Dias recognized that instead of calling the function individually for
every row (which is what apply()
does), it can simply
apply the function directly to the whole DataFrame.
This is basically all you need to know to use Dias. However, we do recommend that you take a look at our documentation and examples.
Given that EDA's popularity is rapidly growing, it should be of no surprise that industrial and academic communities have devoted considerable effort into optimizing Pandas, usually by shipping Pandas replacements. Examples include Modin, Dask, and Koalas, which e.g., can scale the workload out to multiple servers.
Unfortunately, these libraries can incur significant overheads. As we detail in the Dias paper, these can include significant runtime and memory overheads 2, but also an overhead in the human effort, as the user has to learn new APIs. In our opinion, these overheads are acceptable when the data preparation pipeline is fixed, because the amortized performance gain, especially when moving to huge datasets, rationalizes the trade-off.
However, at the time of EDA, this trade-off is probably not worth it, because learning new APIs, requiring resources that are quite more demanding than consumer machines (or the limited resources of Kaggle and Google Colab), and incurring significant runtime overheads, can impair the quick-and-dirty nature of EDA. This situation is what led us to a new research direction, looking for an alternative, lightweight optimization technique.
We should clarify that Dias is not a replacement for these frameworks. For example, Dias does not scale up as the number of cores, or memory, are increased, and Dias won't be able load any dataset that Pandas cannot load. Dias and these other techniques/frameworks, are simply intended for different settings (and they are also conceptually orthogonal).
Dias is an ongoing research project by the ADAPT group @ UIUC. You can help by sending us notebooks that you want to speed up and we will our best to make Dias do it automatically! Moreover, if you are aware of a pattern that can be rewritten to a faster version, please consider submitting a relevant issue.
We also welcome feedback from all backgrounds, including industry specialists, data analysts and academics. Please reach out to sb54@illinois.edu to share your opinion!
adidas-retail-eda
. Dias and Pandas consume less than
2GB of memory for this notebook, while Modin skyrockets its memory
consumption to almost 90GB. At the same time, Modin is at least 4x
slower than both Pandas and Dias.↩