Monthly Python Data Engineering, October 2024
Monthly news from the Python Data Engineering world.
Hi and welcome to this new issue of the newsletter,
This issue lands a bit later than usual, after the month of October actually already completed. I did my best to survive while the rest of the family had flu, fever, coughing and a whole other set of entertaining activities typically associated with Autumn, but at the end of October I felt envious and wanted to end up bedridden with fever myself. So that’s why you see this issue going out only now.
As promised last month, the second chapter was published in the How Data Platforms Work book. This chapter covers compute nodes based on Key Columns, and as usual the implementation is available as part of the DataPyground project.
Key Highlight
This month Apache Arrow had an interesting new major release, with Arrow crossing the Rubicon. Pandas and NumPy are now both optional dependencies (pandas has been for a while).
Life of PyArrow has always been tightly coupled with Pandas and NumPy, and originally one of the most common use cases for PyArrow was actually to load Parquet data into Pandas and NumPy. Nowadays Arrow is mature enough that it has its own independent life, and as an interoperability format it’s successful enough that it’s not longer on Arrow shoulders to implement support for all the libraries it has to interact with. Viceversa it’s up to the libraries themselves to support Arrow as a format. In this scenario it makes sense to make numpy and pandas import-export capabilities something separated and optional, as the chances the platform using Arrow isn’t involving NumPy at all are now much higher compared to the past.
Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com
News
Arrow has seen a new major version released, version 18.0.0 introduces a ton of improvements across all language implementations and some of them are very notable. PyArrow has been focusing on reducing its size over the course of the last releases and more or less from version 15 to version 18 we can see a 30-40% size reduction of the environment (even further when Conda is used if you install the pyarrow core package only). Version 18.0.0 continued this trend by making NumPy an optional runtime dependency, thus avoiding the need to install numpy at all unless you want to import-export data from numpy into arrow. Support for different devices (ie CPU vs GPU) also continues to improve with objects failing more gracefully when performing operations not supported by the device and helpers to copy data between different devices. Niche interesting addition was the support for providing Substrait Extended Expressions to Datasets, which sets the foundations to allow filtering and projecting arrow datasets via a Substrait query.
Narwhals has an average of 10 releases, with the latest one being 1.13.1, this month most of the work seems to be aimed toward consolidating what was implemented during the course of previous month, with most changes being refinements and improvements to existing capabilities. But there is still space for some nice additions like support for the Arrow PyCapsule protocol to import in Narwhals any dataframe entity compatible with it.
Polars released multiple versions with 1.12.0 being the latest one. As usual the various versions introduced a significant amount of improvements and changes. People working with financial data might appreciate the improved support for writing nested high precision decimals, or the incredible speed up on some functions, but I personally believe that the internal changes related to flight IPC messages and the addition of an IPC sink for the streaming engine are going to open interesting use cases where Polars plays the Pivot in complex data pipelines.
Spark released version 3.4.4, which is primarily a maintenance release with exclusively bug fixes. But there was space for some minor improvements in aggregations and group by that are welcomed.
GreatTables had a minor update in version 0.13.0, this version adds the helpful capability to be able to format some values as images for inclusion of images into tables.
PyScript had two releases this month, with the latest one being 2024.10.2. Those releases bring interesting performance improvements, especially related to start time and to the updated Pyodide core. An interesting addition for people playing around and running experiments was the new donkey feature, which allows to quickly run commands in a console with minimum setup.
Dask had one release this month: 2024.10.0. The release has minor improvements and fixes, but for anyone working with Dask and Numba there is a major speed up in cases where the two were interacting with each other.
Delta has released version 0.21.0, this is mostly a bugfixes release, but there are some notable improvements like a significant reduction in memory consumption of checkpoints.
CUDF released version 24.10.0 (and 24.10.01 bugfix) which has a significant set of changes. More work has happened to align CUDF api to the behaviour of Pandas 2.0. CUDF is also moving forward in reducing its tight coupling with PyArrow, delegating to IPC and PyCapsule systems specifications of PyArrow the role of interoperating with CUDF data instead of having dedicated support for PyArrow objects. And more implementation has been moved into pylibcudf. In terms of purely new features not much has happened apart, probably, the support for named aggregations. There are also some quality of life improvements for developers working on Polars and CUDF integrations, making Polars enabled by default in development environments, signaling that CUDF is investing heavily in polars support.
Lance is approaching version 0.19.2, with 0.19.1 and 0.18.3 having been released recently. There were a lot of new features added, and Lance is a fast changing landscape. Some interesting additions include an improved support for storing large binary objects and support for detached commits. GPU acceleration was also extended to some other supported operations like product quantization.
Pantab 5.1.0 has been recently released, this is a minor change that enables the capability to choose which database version to use for hyper files. By default pantab will use version 2 when creating them, but it’s possible to pass a custom one.