169 | PyCon Canada 2019

PySpark: avoiding common pitfalls and keeping your sanity

by Jonathan Rioux

Machine Learning & Data Science Tools, Testing, and Practices

For a Python developer, using PySpark can often feel foreign, like driving a race car in sandals. You see the power, yet it feels like you're fighting against the machine. This talk is about battle stories using PySpark from development to production, and how my many errors can lead to better code on your end. In no particular order, I'll discuss about speeding up your development, avoiding 'friendly enemies' and testing your code. You'll see how to avoid embarrassing mistakes by seeing me making them, and you'll leave a more insightful PySpark developer.

Jonathan is the data science practice lead for EPAM Canada, a global engineering consultancy. He worked in insurance, analytics and data science for a little over a decade. He is passionate about programming languages and how they allow to map more and more complex ideas. Jonathan is the author of Data at scale with PySpark (Manning, scheduled for 2020)

If you are the author of this talk and want to make an edit, feel free to send us a PR!