How to Build & Use TensorFlow Data Pipeline for Image Processing

Опубликовано: 01 Январь 1970
на канале: Murat Karakaya Akademi
4,342
75

For all tutorials: muratkarakaya.net
Colab Notebook: https://colab.research.google.com/dri...
TensorFlow Input Pipelines Playlist:    • TensorFlow Data Pipeline: How to Desi...  
In this tutorial, we will focus on how to Build Efficient TensorFlow Data Pipelines for Image Datasets in Deep Learning with Tensorflow & Keras.

First, we will review the tf.data library. Then, we will download a sample image and label files. After gathering all the image file paths in the directories, we will merge file names with lables to create the train and test datasets. Using tf.data.Dataset methods, we will learn how to map, prefetch, cache, and batch the datasets correctly so that the data input pipeline will be efficient in terms of time and performance. We will discuss how map, prefetch, cache, and batch functions affect the performance of the tf.data.Dataset input pipeline performance.

Moreover, we will see how to use TensorBoard add-on "TF Profiler" for monitoring the performance and bottlenecks of the tf.data input pipeline.

If you would like to learn more about Deep Learning with practical coding examples, please subscribe to my YouTube Channel or follow my blog on Medium. Do not forget to turn on Notifications so that you will be notified when new parts are uploaded.

You can access this Colab Notebook using the link given in the video description below.

Colab Notebook: https://colab.research.google.com/dri...

Now we can gather the image file names and paths by traversing the images/ folders. There are two options to load file list from image directory using tf.data.Dataset module as follows. Create the file list dataset by Dataset.from_tensor_slices() We can use Path and glob methods in pathlib module for browsing the image folder and compile all the image file names and paths. In most cases, we would like to test the input pipeline with a small portion of the data to ensure the performance and accuracy of the tf.data pipeline design. Here, we select a very small set of images and observe the effects of map, prefect, cache, and batch methods. We can supply the image file list to tf.data.Dataset.from_tensor_slices() for creating a dataset of file paths. Create the file list dataset by Dataset.list_files() As stated in the tf.data.Dataset documentation:

list_files(file_pattern, shuffle=None, seed=None) is used for creating a dataset of all files matching one or more glob patterns.

If your filenames have already been globbed, use Dataset.from_tensor_slices(filenames) instead, as re-globbing every filename with list_files may result in poor performance with remote storage systems."

Since, we have already globbed the file names & paths above, we will not use the below approach.


In this method, we can not specify the number of selected samples. map(map_func) method
Maps map_func across the elements of this dataset. As stated in the official documentation:

"To use Python code inside of the function you have a few options:

1) Rely on AutoGraph to convert Python code into an equivalent graph computation. The downside of this approach is that AutoGraph can convert some but not all Python code.

2) Use tf.py_function, which allows you to write arbitrary Python code but will generally result in worse performance than 1)."

Here, I use tf.py_function because, unfortunately, AutoGraph did not work on the combine_images_labels() funtion properly. Furthermore, as noted in the official documentation:

"Performance can often be improved by setting num_parallel_calls so that map will use multiple threads to process elements. If deterministic order isn't required, it can also improve performance to set deterministic=False."

Since reading files from hard disk takes much time, we apply these suggested methods.s discussed in Better performance with the tf.data API, there are several ways to increase the performance of input data pipelines in tf.data.

By increasing the performance of the input pipeline, we can decrease the overall train and test processing time.

The general strategy to overlap the input pipeline steps (reading file paths, uploading and processing image data, filtering lable info, converting them to image data and label tuples, etc.) with batch computation during train or test phase. Here, I summarize the most related methods very shortly:Prefetching: Prefetching overlaps the preprocessing and model execution of a training step.
Caching: The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. This will save some operations (like file opening and data reading) from being executed during each epoch.
Batching: Combines consecutive elements of this dataset into batches.
The exact order of these transformation depends on several factor. For more details please refer to Better performance with the tf.data API


Смотрите видео How to Build & Use TensorFlow Data Pipeline for Image Processing онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь Murat Karakaya Akademi 01 Январь 1970, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 4,342 раз и оно понравилось 75 людям.