Monthly Python Data Engineering, April 2025

Monthly news from the Python Data Engineering world.

May 08, 2025

Hi and welcome to this new issue of the newsletter. This month I wanted to have the chance to announce a little project I have been working on during the past few months, so the newsletter ended up going out a little later than usual.
But I’m happy to say that I was finally able to publish Orbital on PyPi.
Orbital lets you run trained scikit-learn pipelines entirely in SQL. No Python runtime, no dependencies. Perfect for in-database ML in production or regulated environments, and is based on some of the projects that are frequently announced in this Newsletter.

Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com

Want to know more about me and why I curate this newsletter?
Check out my personal website at https://alessandro.molina.fyi/

Key Highlight

This month, Polars, DataFusion, and Lance were particularly active with updates worth highlighting. Polars kept improving its streaming engine, with faster joins, better group-by performance, and more efficient rolling aggregations. It also added support for PyTorch tensors and extended its SQL capabilities. DataFusion introduced user-defined window functions and released a new Rust-based TPC-H data generator that's significantly faster than existing tools. Lance focused on I/O performance and observability, improving how it handles small files and adds traceability to execution. Overall, a strong month for projects focused on performance and efficiency.

News

Apache Arrow released version 20.0.0, a major milestone. This version stabilizes the C++ dataset and Substrait consumers, adds support for Arrow Flight SQL prepared statements in Java, and significantly improves the Python bindings and streaming I/O. It also drops support for C++11 and requires C++17 now, so prepare your build environments accordingly. Most changes in the RCs were related to test coverage and build fixes.

PyShiny version 1.4.0 adds full support for application bookmarking — a long-awaited feature that lets users persist and restore app state. This is especially useful for reproducible dashboards. It also improves interactivity in ui.Chat() and MarkdownStream(), deprecates old stream result methods, and improves layout rendering logic in Shiny Express apps.
Ibis version 10.5.0 brings new backend improvements and API polish. It now supports JSON via variant in the Databricks backend, basic map operations for DataFusion and RisingWave, and has switched Postgres map handling from HSTORE to JSONB. There are also updates to temporal casting and new support for user-defined scalar subqueries.
Narwhals had a busy month with seven releases from 1.34.0 through 1.38.0. Highlights include Spark Connect support, enum datatype improvements, and better group-by and fill-null handling across DuckDB, Pandas, and Spark backends. On the performance side, caching improvements and import optimizations make Narwhals leaner and faster. Several typing and compatibility fixes ensure smooth integration with pyarrow, cudf, and other DataFrame ecosystems.
Polars had multiple updates, with Rust 0.47.1 and Python versions 1.27.0 through 1.29.0. These include massive streaming engine performance boosts, like fast paths for inner joins, bitmap-based semi-joins, optimized rolling aggregations, and cache-aware group-by operations. A new rolling_kurtosis function landed, and new SQL string functions (SPLIT_PART, STRING_TO_ARRAY) were added. There’s even native support for PyTorch tensors.
Datashader version 0.18.0 drops Python 3.9 support and introduces a basic JIT implementation for simulation rendering, useful for scientific visualization pipelines.
Panel version 1.6.3 improves performance when rendering nested layouts and ESM-based components, which is great news for app developers dealing with dynamic content. There are also bug fixes for Tabulator headers, Modal events, and Markdown layout quirks.
HvPlot version 0.11.3 focused on improving documentation with a full API reference and tutorials based on the Diátaxis framework. It also patched some tooltip and OHLC axis formatting bugs and started migrating developer tooling to Pixi.
Cython released 3.1.0rc1 and rc2 as it prepares for a stable release. These candidates mostly include internal bug fixes and are part of the ongoing transition toward Python 3.12 compatibility.
Dash shipped two patch releases, 3.0.3 and 3.0.4, focused on stability. Fixes target props hashing issues, graph resizing glitches, and long callback cancellation logic.
Dask made two releases (2025.4.0 and 2025.4.1) that improve performance and correctness for graph construction and rolling window ops. These updates also ensure smoother integration with xarray and da.from_delayed, which benefits users working with large arrays or lazy computation pipelines.
cuDF version 25.04.00 comes with breaking changes like removing deprecated APIs, enforcing deprecations, and adding full support for cudf-polars interoperability. Many fixes improve stability across joins, rolling operations, and Arrow interoperability. Developers working with GPU-accelerated DataFrames should test thoroughly before upgrading.
Lance had 16 versions released (0.25.3-beta.4 to 0.27.0-beta.5), with major performance upgrades like reading tiny files in a single IOP and reduced retry thrashing. New features include blob metadata configuration and tracing spans in execution nodes, which are useful in debugging and profiling large-scale vector databases.
LanceDB followed suit with 22 releases (0.19.0-beta.6 to 0.22.1-beta.3), adding table stats APIs, tag-based versioning, ColPali embedding support, and integration with the latest Lance releases. LanceDB is definitely moving fast trying to set itself as a strong contender for vector search infrastructure.
Pantab version 5.2.1 fixed a nasty crash that could occur when reading Hyper files, which is a big win for data engineers using Tableau pipelines or managing Hyper extracts.
DuckDB version 1.2.2 is a large bugfix release. It resolves deadlocks with from_parquet, CSV edge cases, and filter pushdown bugs in Arrow and SQL. Developers using DuckDB in Python pipelines will appreciate the fix for PyArrow NaN filtering and the added safety in schema normalization logic.
LazyCSV released version 1.1.7, a minor patch update. This project is worth watching if you're ingesting CSV data into pandas-like environments and want faster initial reads.
Apache DataFusion published three blog posts this month:
- Comet 0.8.0 introduces an accelerator for Spark that uses DataFusion as a backend — expect performance boosts for large queries.
- User-defined Window Functions now allow implementing custom logic over partitions, which opens doors for analytical use cases.
- TPC-H Generator brings the fastest open-source TPC-H data generation tool (tpchgen-rs) — written in Rust, it's up to 14x faster than DuckDB’s.