Monthly Python Data Engineering, June 2025

Monthly news from the Python Data Engineering world.

Jul 09, 2025

Hi and welcome to this new issue of the newsletter!
As usual I want to say sorry to all of you for being so unreliable in terms of the publishing date 😅
I’m alone publishing this and I move it forward in the spare time between work and family, as that is a bit unpredictable, it’s time to admit that I can’t commit to a specific day of the month.
My commitment to publishing a new issue. every month remains unchanged, but expect issues to go out on the first 10 days of the month.
Thanks for your continuous interest and support to the initiative!

Want to know more about me and why I curate this newsletter?
Check out my personal website at https://alessandro.molina.fyi/

Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com

Key Highlight

This month felt like a big step forward for speed and flexibility in the Python data world. Polars finally ditched its old streaming engine, so groupby and streaming stuff is much faster now, and there’s solid support for Iceberg positional deletes and some nice fixes for tricky data types. Lance and LanceDB kept pushing on vector search and FTS: Lots of new index types (IVF_HNSW_FLAT, IVF_SQ), better tokenizer for FTS, and upserts are way quicker. Also, the Dataset config API got a revamp, so it’s easier to tune things for your use case. On the Spark side, DataFusion Comet 0.9.0 really impressed me: you get native support for complex types in Parquet scans, much better shuffle acceleration, and built-in tracing to help debug performance issues. All in all, it’s a good month if you like your data infra fast and easy to tweak!

News

Apache DataFusion Comet released version 0.9.0, marking a significant milestone with 139 commits from 24 contributors over ~10 weeks. This Spark accelerator now supports complex types (structs, maps, arrays) in Parquet scans and delivers improved shuffle acceleration with broader native execution coverage. The release achieves ~97% test suite coverage across multiple modules with 24,000+ Spark SQL tests passing. Notable additions include support for new expressions like ArrayUnion, BitCount, MapValues, and ToPrettyString, plus built-in tracing capabilities to analyze performance and memory usage patterns1.
Ibis shipped version 10.6.0 with substantial improvements across multiple backends. The release adds support for specific UUID value creation, BigQuery job configuration enhancements including job_id_prefix functionality, and significant DataFusion backend improvements with BitwiseNot, Clip, Greatest, Least operations, and support for anys/alls aggregations. Performance optimizations include improved ArrayIndex SQL generation for DuckDB and better rich rendering capabilities for data exploration workflows.
Narwhals had an active month with seven releases from v1.42.0 through v1.46.0. Key highlights include the addition of Expr.dt.offset_by() for temporal operations, support for quantile and ewm_mean in window contexts, and enhanced DataFrame functionality with DataFrame().lazy("ibis") support. Performance improvements include caching and reuse optimizations for Implementation._backend_version, while new string operations like zfill and date conversion with str.to_date expand data manipulation capabilities. The releases also simplified exceptions by consolidating error types into a consistent InvalidOperationError.
Polars released multiple versions including Rust Polars 0.49.1, 0.49.0, and Python Polars 1.31.0. The 0.49.0 release introduced breaking changes by removing the old streaming engine and added significant performance improvements for streaming groupby operations. New features include native Iceberg positional deletes implementation, IVF_HNSW_FLAT and IVF_SQ index support, and enhanced support for large_string/large_binary types in lance format v2.1. The Python releases include DataType expressions support and improved error handling for truncate operations when mixing temporal units.
Pandas released version 2.3.1, a maintenance release focusing on improvements and fixes to the future string data type, which is a preview feature for the upcoming pandas 3.0. This release includes compatibility updates and stability improvements that data engineers should consider when planning migrations to pandas 3.0.
Panel shipped version 1.7.2 with important optimizations for React and ESM-based components. The release adds support for passing bytes and BytesIO objects to Audio and Video panes, header tooltips for Tabulator tables, and optimized layout calculations for ESM components. Infrastructure improvements include automatic Comm unblocking on WebSocket re-connect and better support for nested ReactComponents with Shadow DOM bypass capabilities.
HoloViews released version 1.21.0 with several data visualization enhancements. The release adds sample information on hover for rasterized/datashaded plots, introduces dendrogram plotting capabilities, and adds logarithmic support for Histogram operations. This version also deprecates several features including the streamz interface, autoloading RC file, and IPython magic, with removal planned for version 1.23.0. The minimum Python version requirement has been bumped to 3.10.
PyScript released versions 2025.7.2 and 2025.7.1. Version 2025.7.2 updated to MicroPython 1.26.0-preview-293, fixing a regression from 2025.7.1. The 2025.7.1 release introduced the new PyScript Bridge helper for importing Python modules from JavaScript, added a packages_cache = "passthrough" option for faster Pyodide bootstrap, and updated orchestration to bring back service-worker capabilities without requiring additional dependencies.
Cython released version 3.1.2, a maintenance release focusing on bug fixes and stability improvements for the Python-to-C compiler that's essential for high-performance data processing extensions.
Dash released versions 3.1.1 and 3.1.0. Version 3.1.0 introduced significant new features including support for async callbacks and page layouts (install with pip install dash[async]), the ability to pass allow_optional to Input and State components, and improved WSGI compliance for better deployment compatibility. Performance improvements include enhanced flatten_grouping operations and 80% speedup in function operations, while bug fixes address query string parsing regressions and persistence storage issues.
Lance had extensive development activity with ten releases from v0.29.1-beta.2 through v0.31.1-beta.3. The v0.31.0 release included breaking changes to the Dataset configuration API and introduced IVF_HNSW_FLAT and IVF_SQ index implementations, N-Gram tokenizer for FTS, and support for large_string/large_binary in lance format v2.1. Performance improvements include fast upsert operations with no indices and optimized kmeans algorithms. The v0.30.0 release moved file metadata cache to bytes capacity and added auto-remap indexes functionality.
LanceDB released ten versions spanning both Node/Rust and Python implementations, from Python LanceDB v0.23.1-beta.0 to Python LanceDB v0.24.1-beta.0. The v0.21.0/v0.24.0 releases switched the default FTS to native Lance FTS, adding support for prefix matching, must_not clauses, and various FTS features. Performance improvements include batched Ollama embed calls and better error handling for object store URI parsing.
DuckDB released versions 1.3.2 and 1.3.1 as bug fix releases for the 1.3.0 "Ossivalis" release. Version 1.3.2 includes fixes for dynamic filter handling, CSV fuzzer issues, and improved join order computation. Both releases maintain backward compatibility with database files from v0.9 and include various stability improvements for data engineering workloads.
DataFusion Table Providers released versions 0.6.1 and 0.6.0. Version 0.6.1 added a read-only table provider for ClickHouse, while version 0.6.0 updated DataFusion to version 48 and datafusion-federation to 0.4.3, providing expanded connectivity options for DataFusion-based applications.
Pandera released version 0.25.0 with a major highlight: full support for Ibis table validation. This enables in-database validation across all Ibis backends including PostgreSQL, Snowflake, BigQuery, MySQL, and more, allowing data engineers to validate data at scale before fetching for downstream processing. The release also adds Polars pydantic integration with native JSON schema generation and various bug fixes including proper handling of the PANDERA_VALIDATION_ENABLED=False environment variable.