Monthly Python Data Engineering, October 2024

Monthly news from the Python Data Engineering world.

Nov 04, 2024

Hi and welcome to this new issue of the newsletter,

This issue lands a bit later than usual, after the month of October actually already completed. I did my best to survive while the rest of the family had flu, fever, coughing and a whole other set of entertaining activities typically associated with Autumn, but at the end of October I felt envious and wanted to end up bedridden with fever myself. So that’s why you see this issue going out only now.

As promised last month, the second chapter was published in the How Data Platforms Work book. This chapter covers compute nodes based on Key Columns, and as usual the implementation is available as part of the DataPyground project.

Key Highlight

This month Apache Arrow had an interesting new major release, with Arrow crossing the Rubicon. Pandas and NumPy are now both optional dependencies (pandas has been for a while).
Life of PyArrow has always been tightly coupled with Pandas and NumPy, and originally one of the most common use cases for PyArrow was actually to load Parquet data into Pandas and NumPy. Nowadays Arrow is mature enough that it has its own independent life, and as an interoperability format it’s successful enough that it’s not longer on Arrow shoulders to implement support for all the libraries it has to interact with. Viceversa it’s up to the libraries themselves to support Arrow as a format. In this scenario it makes sense to make numpy and pandas import-export capabilities something separated and optional, as the chances the platform using Arrow isn’t involving NumPy at all are now much higher compared to the past.

Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com

Monthly Python Data Engineering

Monthly Python Data Engineering, October 2024

Monthly news from the Python Data Engineering world.

Key Highlight

News