CONTROLLABLE TEXT GENERATION: Text Generation Fundamentals

Опубликовано: 01 Январь 1970
на канале: Murat Karakaya Akademi

1,688

For all tutorials: muratkarakaya.net
COLAB NOTEBOOK: https://colab.research.google.com/dri...
Controllable Text Generation Playlist:    • Controllable Text Generation with Tra...
On Medium:   / controllable-text-generation-in-deep-learn...
On Github: https://kmkarakaya.github.io/Deep-Lea...
Text Generation Playlist:    • Text Generation in Deep Learning with...
LSTM Playlist:    • All About LSTM
Sequence-to-Sequence Learning:   / part-a-introduction-to-seq2seq-learning-a-...

00:00 Controllable Text Generation with Transformers tutorial series
06:03 What is text generation?
08:10 What is a prompt?
09:05 What is a Corpus?
10:55 What is a Token?
14:46 What is Text Tokenization?
18:54 What is a Language Model?
21:57 How does a Language Model generate text?
24:27 What is Word-based and Char-based Text Generation?
25:44 Which Level of Tokenization (Word or Character based) should be used?
29:07 What is Sampling?
41:46 What kinds of Language Models do exist in Artificial Neural Networks?
42:38 Which Language Model to use?
48:35 What are the Text Generation Types?
49:46 What is Controllable Text Generation?
52:02 Summary
53:34 Bye

In this series, we will focus on developing TensorFlow (TF) / Keras implementation of Controllable Text Generation from scratch.
Part A: Fundamentals of Controllable Text Generation:
A1 A Review of Text Generation
A2 An Introduction to Controllable Text Generation
Part B: A Tensorflow Data Pipeline for Word Level Controllable Text Generation
Part C: Sample Implementations of Controllable Text Generation with TensorFlow & Keras:
C1 Approach: Input Update + Language Model: LSTM
C2 Approach: Input Update + Language Model: Encoder-Decoder
C3 Approach: Input Update + Language Model: Transformer (GPT3)
What is text generation?
In the simplest form, you train a Deep Learning (DL) model to generate random but hopefully meaningful text.

Text generation is a subfield of natural language processing (NLP). It leverages knowledge in computational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements.
What is a prompt?
Prompt is the initial text input to the trained model so that it can complete the prompt by generating suitable text.

What is a Corpus?
A corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts.

For example, Large Movie Review corpus consists of 25,000 highly polar movie reviews for training, and 25,000 for testing to train a Language Model for sentiment analysis.
What is a Token?
In general, a token is a string of contiguous characters between two spaces, or between a space and punctuation marks.

A token can also be any number (an integer, or real).

All other symbols are tokens themselves except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations.
What is Text Tokenization?
Tokenization is a way of separating a piece of text into smaller units called tokens.

Basically for training a language model, we prepare the training data as follows:

we collect, clean, and structure the data
this data is called the corpus
we decide
the token size (word, character, or n-gram)
the maximum number of tokens in each sample
the number of distinct tokens in the dictionary (vocabulary size)
we tokenize the corpus into chunks of tokens (sequences)considering the maximum size (length) At the end of tokenization process, we have sequences of tokens as samples (inputs or outputs for the LM)
a vocabulary consisting of maximum n number of frequent tokens in the corpus an index list to represent each token in the dictionary
What is a Language Model?
A language model is at the core of many NLP tasks and is simply a probability distribution over a sequence of words

In this current context, the model trained to generate text is mostly called a Language Model (LM).

In a broader context, a statistical language model is a probability distribution over sequences of tokens (i.e., words or characters).
How does a Language Model generate text?
In general, we first train a LM then make it to generate text (inference).

What is Word-based and Char-based Text Generation?
We can set the token size at the word level or character level.
Below, you see that we opt out character-based tokenization.
Which Level of Tokenization (Word or Character based) should be used?
What is Sampling? Sampling means randomly picking the next word according to its conditional probability distribution.

Смотрите видео CONTROLLABLE TEXT GENERATION: Text Generation Fundamentals онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь Murat Karakaya Akademi 01 Январь 1970, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 1,688 раз и оно понравилось 43 людям.

2,305