Monthly Python Data Engineering, September 2024
Monthly news from the Python Data Engineering world.
Hi and welcome to the third issue of the newsletter,
We all know how inspiring it can be to write code in the summer, outside while wind gently caresses skin and sun shines bright. Summer has always been the time of the year where I have been more prolific in terms of Open Source projects.
This time it hasn’t been different, and I decided to start a new initiative attached to this Newsletter. A free book with the intention of explaining how Data Platforms work internally and explaining it in a way that is explicitly targeted to Python developers, both junior and more senior ones. The resulting How Data Platforms Work book will be published together with this newsletter, one chapter each month.
As many people prefer to learn by doing, I also decided to pair the book with the DataPyground project which aims to be a collaborative learning experience to create a data platform in pure python and without dependencies (apart from pyarrow as the in-memory data format) for whoever is willing to play around with creating its own Data Platform. Anyone can contribute new pieces to DataPyground, the only constraint is that readability and documentation must the main goal of whatever is implemented. The project is inspired by the literate programming concept, and algorithms ease of understanding is more important than performance for the project (even though some compute engine based operations already perform better than their pandas equivalent)
Key Highlight
This month Ibis has removed Pandas backend. This is a major change, as Ibis has been historically attached to Pandas. But pandas was also the reason why Ibis was big to install and sometimes it slowed down due to pandas eager execution model. The new default backend for Dataframe operations is DuckDB, which offers a faster execution environment and lighter dependencies.
Interestingly Narwhals, which offers a standardized interface to access diverse compute engines, similarly to what Ibis does, has interoperability with Ibis, so it’s possible to combine the two libraries when necessary.
Want to signal interesting libraries and frameworks for the newsletter? Reply to the newsletter email at alessandromolina@substack.com
News
GreatTables, one of the most convenient libraries to display tabular data for Python users, has released version 0.11 (and 0.11.1), this are mostly bugfix releases, the formatting of HTML Tables has been slightly changed and the save method now has support for saving tables on more recent Chrome versions.
Shiny for Python, has seen release 1.1. Most of the changes are related to the ui.Chat, the conversational interface, which makes a lot of sense in modern days of GPT and AI agents. But another interesting feature is the ability to easily use templates from Github repositories.
Ibis has released version 9.5.0 and 9.4.0, this version has mostly minor improvements but introduces some interesting capabilities like deferred literals definitions. Some people might also see resolved a blocker, which was related to not being able to join tables across different databases of the same backend.
Narwhals has been very prolific as usual, with 10 different releases occurring during the past month. Last one being version 1.8.2. Due to the frequent release cycle, the individual releases are usually fairly small, but over the last 10 releases there are some notable improvements like improved support to interchanging data with Ibis and DuckDB via the Dataframe interchange protocol and being able to take a sample of the data in a Dataframe. Also when using CUDF it is now possible to export data in Apache Arrow format.
Polars released versions 1.7, 1.8 and 1.8.1, this versions include some improvements to pushdown of projections when reading Parquet data and improvements to joins support, especially by adding the IEJoin algorithm for non-equi joins too. Coming from version 1.6 we also have the support for Altair as the plotting library in
Dataframe.plot
.Pandas released versions 2.2.3, this is mostly a bugfix release in the 2.2 series and not much has changed.
Holoviz Panel released version 1.5.0, this version adds a major feature by adding the
panel.custom
utility, which makes much easier to create custom components based on React as part of the dashboards.Plotly Dash had two releases in the 2.18 series, this releases add support for Typescript 5.5 and improves support for callbacks, especially in the error handling area.
Dask released version 2024.8 and 2024.9, those versions don’t add any major new feature but consolidated implementation of existing ones and ported Dask to more recent NumPy and PyArrow versions. Notable changes are in the area of chunking, which might lead to performance improvements in some distributed execution contexts.
Delta released versions 0.19 and 0.20, those versions add some interesting improvements like support for Arrow
ExtensionTypes
, also updates to the log stores now support conditionalPUT
, thus reducing unecessary rewrites of unchanged data.CUDF released version 24.08.03 which has a long list of changes. Multiple APIs have been aligned to the signature and behavior they have in Pandas 2.x. Arrow support for I/O has been deprecated, and CUDF is preparing for a future where Arrow is removed. The Parquet reader can now be used in chunked mode. Prefetching can be experimentally configured. Pandas datetime data can now be timezone aware. And last, but not least a long list of stability improvements and additions to
cudf-polars
.Lance is getting ready for version 0.18.1, with a few beta being released. This versions add improvements in support for GPU accelerated operations, adds
CreateIndex
as supported commit type. During the last month version 0.17 was released too, which added many new features, including support for pushdown of offset and limit and querying phrases in full text search, together with various performance improvements.LanceDB is also in a beta flow, getting ready for version 0.14, the most interesting additions are related to being able to connecting to remote databases, with many APIs now supporting that. The release also builds on top of previously released version 0.13 that added a few improvements, like the migration of full text search support from Tantivy to Lance-Index, and a new API for manual hybrid queries (combining vector similarity searches with scalar filtering). Full Text Search support was also exposed on remote tables and async tables.