Monthly Python Data Engineering, August 2024

Monthly news from the Python Data Engineering world.

Aug 19, 2024

Hi,

I’m Alessandro, the author of Crafting Test-Driven Software with Python and Modern Python Standard Library Cookbook, and a long term Open Source Python developer working on frameworks and libraries for data intensive applications and this is the second issue of the Monthly Python Data Engineering newsletter.

Key Highlight

Datafusion Comet had its first release. There are many projects ongoing to accelerate Spark: From Intel sponsored Gluten, to RAPIDS Spark accelerator, to other projects that take a different approach by implementing spark compatibles APIs on top of other execution engines. Out of those, Comet is probably the one that has the chance to provide the best balance between performance improvements and ease of use. While it does not yet accelerate all parts of a Spark execution plan, it accelerates the most impactful steps and it’s fairly easy to setup. Once the team will start packaging binary releases it’s reasonable to expect it will become much more adopted.

News

Datafusion released version 40.0.0, the release is surely major, with a speed improvement of 2x over the TPC-H benchmark. This version also adds the ability to expand the SQL parser to add custom nodes and the ability to provide a custom Parquet access plan, this is very helpful when reading remote files as it allows to skip accessing those files that don’t match custom predicates.
Arrow has released version 17.0.0 in late July, this version introduces some important improvements. First of all ArrowDeviceArrayStream can now be imported and exported and is available in Python too, thus making easier to support interoperability with libraries that store data in GPU memory or other devices memory. Builtin support for OpenTelemetry was added (but manually compiling Arrow is required) and to improved interoperability with Machine Learning and “AI” libraries it’s now available an optimized conversion path from RecordBatches (tabular data) to multidimensional Tensors/Matrices, both row-major and column-major.
Narwhals is moving fast as usual and it has multiple releases. The releases mostly had a focus on adding support for Dask distributed dataframes. But there are other interesting additions like the support for the Arrow-PyCapsule protocol to import and export data.
Polars released version 1.5.0. The last set of releases has some interesting improvements for anyone working with dates and a few performance improvements when working with Parquet, like using memory mapping when possible. Like Narwhals, Polars too added full support for the Arrow-PyCapsule protocol to exchange data. This makes possible to use Polars with any library compatible with Arrow PyCapsule even if the library doesn’t support Polars itself.
Ibis released version 9.3.0, it’s primarily a bugfix release, but it has some interesting additions like the support for order_by in aggregations that are influenced by ordering like first or last.
Panel has released version 1.4.5. This release is mostly a bugfix release, with some performance improvements. Panel is one of the most interesting dashboards solution to me due to its moltitude of deploy options, including export to static HTML which makes very easy to share dashboards.
PyScript introduced a breaking change in 2024.8.2 by switching to asynchronous api by default. Code relying on a synchronous behavior will need to pass async=False explicit.y
Cython released version 3.0.11, as most Cython releases this is primarily a bugfix release. The most notable thing seems to be that the Cython team did some fixes to enable compilation on Python 3.13
Dask released version 2024.8.0 being the most recent one. The release has some efficency improvements, including a performance boost to the slice operation.
Deltalake version 0.19 has some major changes and improvements. The default writer engine is now the rust one, and the PyArrow is deprecated and will be removed in the future. Support for CDF in write, delete and merge operations was also added. CDF is practically a journal of changes to a table.
CUDF released version 24.08.02, the 24.08 series of releases has a lot of changes and new features. The team has been aligning to the Pandas 2.0 APIs, and in general the compatibility of Dataframe operations to pandas syntax is improving quickly. CUDF implementation is also being rationalized by moving pieces to pylibcudf, which is a Cython wrapper to libcudf on top of which CUDF is being built. cudf-polars is also being introduced with improvements, with features like support for CSV files being added. Support for Arrow in I/O was also deprecated.
Avro has released version 1.12.0, this release is mostly a bugfix and reliability improvements release. But there are some interesting additions like the support for compiling the Rust implementation for WASM.
Lance has seen a few releases this past month, with the latest one being the beta5 for version 0.17. The recent releases saw some nice improvements, like the coalescing of multiple reads to speed up random access, support for loading huggingface image datasets, and the continued effort of moving forward with V2 format.
LanceDB is approaching version 0.13, which is currently in beta stage. Full Text Search capabilities have been migrated from Tantivy to Lance-Index and HuggingFace compatible Transformers have been added. Also embedding functions are now supported on remote tables.