How to Build & Use TensorFlow Data Pipeline for Character-Level Text Generation

Published: 01 January 1970
on channel: Murat Karakaya Akademi

891

Colab Notebook: https://colab.research.google.com/dri...
Text Generation Playlist: • Text Generation in Deep Learning with...
TensorFlow Input Pipeline Playlist: • TensorFlow Data Pipeline: How to Desi...

This tutorial is the second part of the "Text Generation in Deep Learning with Tensorflow & Keras" series. In this series, we have been covering all the topics related to Text Generation with sample implementations in Python. In this tutorial, we will focus on how to build an Efficient TensorFlow Data Pipeline for Character-Level Text Generation.
First, we will download a sample corpus (text file).
After opening the file and reading it line-by-line, we will convert it to a single line of text.
Then, we will split the text into input character sequence (X) and output character (y).
Using tf data API and Keras TextVectorization methods, we will
preprocess the text,
convert the characters into integer representation,
prepare the training dataset,
optimize the data pipeline.
Thus, in the end, we will be ready to train a Language Model for character-level text generation.
If you would like to learn more about Deep Learning with practical coding examples, please subscribe to Murat Karakaya Akademi YouTube Channel or follow my blog on Medium. Do not forget to turn on notifications so that you will be notified when new parts are uploaded.
You can access this Colab Notebook using the link given in the video description below.
What is a Data Pipeline? Data Pipeline is an automated process that involves in extracting, transforming, combining, validating, and loading data for further analysis and visualization.o
It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.
It can process multiple data streams at once.
In short, it is an absolute necessity for today’s data-driven solutions.
If you are not familiar with data pipelines, you can check my tutorials in English or Turkish.
Why Tensorflow Data Pipeline?
The tf.data API enables us
to build complex input pipelines from simple, reusable pieces.
to handle large amounts of data, read from different data formats, and perform complex transformations.
What can be done in a Text Data Pipeline?
The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.
What will we do in this Text Data pipeline?
We will create a data pipeline to prepare training data for character-level text generator.
Thus, in the pipeline, we will
open & load corpus (text file)
convert the text into a sequence of characters
remove unwanted characters such as punctuations, HTML tags, white spaces, etc.
generate input (X) and output (y) pairs as character sequences
concatenate input (X) and output (y) into train data
cache, prefetch, and batch the train data for performance
LOAD TEXT DATA LINE BY LINE
Since our aim is to prepare a train dataset for a character-level text generator, we need to convert the line-by-line text into char-by-char text.
Therefore, we first combine all line-by-line text as a single text:we mentioned that we can train a language model and generate new text by using two different units:
character level
word level
That is, you can split your text into sequence of characters or words.
In this tutorial, we will focus on character level tokenization.
If you would like to learn how to create word level tokenization please take a look at Part C.
Check the size of the corpus
The first sequence (input_chars) is the input data (X) to the model which will receive a fixed-size (maxlen) character sequence
We need to process these datasets before feeding them into a model.
Here, we will use the Keras preprocessing layer "TextVectorization".
Why do we use Keras preprocessing layer?
Because:
The Keras preprocessing layers API allows developers to build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.
With Keras preprocessing layers, we can build and export models that are truly end-to-end: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own.
In the next part, we will create the end-to-end Text Generation model and we will see the benefits of using Keras preprocessing layers.
What are the preprocessing steps?
The processing of each sample contains the following steps:
standardize each sample (usually lowercasing + punctuation stripping):
split each sample into substrings (usually words):
index tokens (associate a unique int value with each token)
transform each sample using this index, either into a vector of ints or a dense float vector.

Watch video How to Build & Use TensorFlow Data Pipeline for Character-Level Text Generation online without registration, duration hours minute second in high quality. This video was added by user Murat Karakaya Akademi 01 January 1970, don't forget to share it with your friends and acquaintances, it has been viewed on our site 891 once and liked it 20 people.

60,618

700