A parallelized collection in Spark represents a distributed dataset of items that can be operated in parallel, in different nodes in the Spark Cluster.
The example on the video shows how to use SparkContext object to create a parallelized collection from a list of words.
Once the RDD (Resilient Distributed Dataset) has been created, it is possible to interact with it by performing different transformations and actions that are available in the Spark API.
The example on the video shows how to create a new RDD from the primary RDD by excluding words with lengths that are less than three characters.
For this Spark filter and lambda functions has been used.
RDD datasets can be operated in parallel. An important parameter that can be used when creating parallelized collection is the number of partitions to split the dataset into. Spark executes one task in every partition; a common approach is to use between two and four partitions by CPU, although Spark attempts to set this number automatically.
Used:
Python 3.5.3
Enthought Canopy
Spark 2.3.2
Prepared by Vytautas Bielinskas.
Watch video Spark with Python Course. Lesson 1. Create Parallelized Collection RDD online without registration, duration hours minute second in high quality. This video was added by user Data Science Garage 22 December 2018, don't forget to share it with your friends and acquaintances, it has been viewed on our site 2,902 once and liked it 11 people.