In this session we cover ways to optimize PySpark code. This includes descriptions of situations where slowness may occur, for example, uneven partitions and skewed joins. To combat these issues I explain repartitioning/coalescing and broadcast joins. I also explain how to place your data in memory or on disk to cache commonly used data sets. Finally, I show the interface where you can monitor memory and CPU usage to make sure you are using the optimal cluster size.
Lastly, I show how to use multiple languages inside of one Databricks notebooks including SQL and R code.
To gain access to code, data, and course materials visit https://kelseyemnett.com/2021/05/30/o....
Смотрите видео Optimizing PySpark Code онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь Data Analysis Lab 30 Май 2021, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 1,034 раз и оно понравилось like людям.