In this session we cover ways to optimize PySpark code. This includes descriptions of situations where slowness may occur, for example, uneven partitions and skewed joins. To combat these issues I explain repartitioning/coalescing and broadcast joins. I also explain how to place your data in memory or on disk to cache commonly used data sets. Finally, I show the interface where you can monitor memory and CPU usage to make sure you are using the optimal cluster size.
Lastly, I show how to use multiple languages inside of one Databricks notebooks including SQL and R code.
To gain access to code, data, and course materials visit https://kelseyemnett.com/2021/05/30/o....
Watch video Optimizing PySpark Code online without registration, duration hours minute second in high quality. This video was added by user Data Analysis Lab 30 May 2021, don't forget to share it with your friends and acquaintances, it has been viewed on our site 1,034 once and liked it like people.