Coding a Paper - Ep. 2: Processing data to keep GPUs busy

Опубликовано: 01 Февраль 2024
на канале: ChrisMcCormickAI
1,151
48

It’s time to start coding our paper! Luckily we’ll tackle the most exciting part first, which is…processing our dataset? Although it’s not the most compelling part of building a model, selecting a dataset and setting up a data processing pipeline are necessary steps that enable us to quickly run and test a model.

Why build a pipeline? While pre-processing an entire dataset can eliminate some bottlenecks, the large datasets you encounter with LLMs can make that infeasible. Pre-processing also adds a time consuming initial step that turns out may not be necessary! By building a Data Processing pipeline to prepare our data on the fly, we can go straight into training (and be prepared to handle bigger datasets!)

Building this pipeline is its own skill set, so we’ll talk through the core concepts and look at how to identify bottlenecks, apply optimizations, and keep our GPUs busy! Data processing interfaces aren’t always intuitive, but by understanding the hardware and requirements in a tool-agnostic way we can simplify our thinking.


Links:
Link to Colab Notebook: https://colab.research.google.com/dri...
You can follow me on twitter:   / nickcdryan  
Check out the membership site for a full course version of the series (coming soon) and lots of other NLP content and code! https://www.chrismccormick.ai/membership
Data processing pipeline diagram: https://drive.google.com/file/d/1r2uL...

Chapters:
00:00 intro: what's in this episode?
01:40 data processing pipeline
02:55 problems with the data processing pipeline
06:40 the big picture of data processing
08:08 making the GPU go fast
11:02 optimizations options
15:55 our process: how to make training fast
17:32 a success story using our process
20:38 baseline data processing for memorizing transformers
21:34 our target dataset
22:56 streaming the dataset
24:13 memorizing transformers data pipeline
27:50 filtering the data by length
30:45 tokenization
33:14 processing our data
39:11 generating training examples
43:19 fake model and training loop


Смотрите видео Coding a Paper - Ep. 2: Processing data to keep GPUs busy онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь ChrisMcCormickAI 01 Февраль 2024, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 1,151 раз и оно понравилось 48 людям.