Optimize deep learning training Part I : Choosing the right data loading framework and file format

Haneul Kim
7 min readMay 28, 2023
Picture by Haneul Kim

Going through textbook and training models on sample dataset which are usually small enough to fit in your computer or server’s memory is straight forward and all is well. When it comes implementing it in your business with large datasets there are many considerations.

Three metrics we will consider:

  • Speed: making sure model is trained with upto date data.
  • Memory consumption : As number of rows and columns gets enormous how are you going to split them and feed it for training so it can be trained without “Out of Memory Error”
  • Data storage: Is your dataset well compressed and split into optimal file size for efficient storage cost and to minimize IO?

In part I of this series we are going to compare different file formats and data loading framework on a local machine.

File formats:

  • Parquet
  • TFRecords

Data loading framework

  • tf.data.TFRecordDataset API
  • Petastorm

Brief overview before comparison. I recommend reading the documentation and references I’ve metioned at the end for in-depth understanding.

Parquet

  • Hybrid-Based storage file format designed for efficient storage and retrieval.
  • Provide efficient data compression and encoding for handling complex data in bulk.
  • Uses record shreadding and assembly algorithm behind the scenes explained in Dremel paper [2]
  • Support both projection(selecting subset of columns) and row predicates(filtering rows).

Picture below from [1] visualizes differences between Row,column,hybrid based file formats very well.

[1]

Hybrid-based file formats store row_group_size rows for each column. So for example we store ‘a’, ‘b’ from first column then 1,2 from second column and so on since row_group_size is 2.

TFRecords

  • Contains sequence of records in protocol buffer message format tf.train.Example which represents data in dictionary mapping {feature_name:value, …}. value is each feature converted into tf.train.Feature with correct format tf.train.BytesList, tf.train.FloatList, or tf.train.Int64List.
  • Simply, each row(record) is converted into tf.train.Example then written to TFRecord file. Note, other serializable format can be used as long as it can be deserialized in tensorflow.
  • Main advantage : can store various data types such as images, video, audio, text, etc…
  • Protocol buffer is language/platform independent format allowing easy serialization developed by Google.

Grasping concept of TFRecord can be challenging at first however if you see the code example all the dots will connect, so hold on.

tf.data.TFRecordDataset API

  • Method that creates tf.data.Dataset from tfrecord files streaming of TFRecord files from disk therefore batch of data are processed and trained without out of memory error.
  • Reads record in TFRecord file as bytes therefore need to deserialization.

Petastorm

  • Data access library developed by Uber, stream parquet files into ML frameworks.
  • Easy integration with parquet and pyspark dataframe. Most companies preprocess data using big data tools such as spark thus often preprocessed data are in format of parquet or pyspark dataframe, this makes petastorm an attractive candidate.
  • Supports multiple functionalities required for efficient training such as : selective column reads, parallelism strategies(threads, processes), N-grams, Row predicates, shuffling, caching and more.

Now we will use h&m dataset[4] to train simple DNN model. Note, as this post is to compare efficiency, model performance is not considered. Only few columns are selected for demonstration purposes.

All code can be found in github.

computer spec : 16Gb RAM, 6cores, 12 logical processors.

preprocessed dataset look as follows:

  • train_df shape = (25 430 659, 7)
  • test_df shape = (6 357 655, 7)

Pre-processed training data will be stored in each file format three different ways.

  1. One big file.
  2. Optimal size (close to 128Mb): note it should depend on your server spec, dataset. Note: This is a naive approach, to be more conside you must look at row group of each parquet as well.
  3. large number of small files
we can see parquet files are more compactly stored

Splitting into n parquet files using pandas is straight forward. chunk_size was calculated by hand.

Generating TFRecord files take a bit more work:

As you can see we are

  1. iterating pandas dataframe
  2. converting each column value into tf.train.Feature message type
  3. wrap feature_name: tf.train.Feature mapping with tf.train.Example which is a standardized format for storing/reading the data.
  4. Serialize tf.train.Example and write to tfrecord file.

This not only require extra code however it takes a long time to write, for me it took 37m43s to write train_df into single tfrecord file where as writing to parquet took less than a minute. So I would not recommend writing custom for-loop based conversion, especially is you are running it in production. Luckily, there are several ways to speed up the process:

  • multi-processing
  • when working with spark dataframe there are 3rd party packages like spark-tensorflow-connector or spark-tfrecord[8] that writes pyspark into tfrecord files. I have not tested its efficiency yet if you try, let me know please :)

Similar to parquet, TFrecord file has a recommended size [5]

In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10 MB+ and ideally 100 MB+) so that you can benefit from I/O prefetching.

Therefore we’ve written tfrecord into single, optimal, multi files.

We will use following model to train it on different file types with different file sizes.

Training with petastorm

First, I will explain how petastorm streams parquet files, optimially sharded ones (optimal sized files). Code for training on single file and multi-files is in github.

imports:

  • make_reader() : for reading parquet that is created with petastorm. It streams single data where you can use .batch().
  • make_batch_reader() : for reading parquet files. Streams batches of data, each row_group is considered a batch.
  • make_petastorm_dataset() : pass in reader instance and uses it to create tf.data.Dataset

kwargs dictionaries are parameters for make_batch_reader()

  • reader_pool_type : multi-threading/processing. deafult is multi-threading
  • schema_fields : This is projection, it allows petastorm to only retrieve columns we need.
  • workers_count : number of threads/processes.
  • shuffle_rows : shuffles within each row_group.
  • shuffle_row_groups : shuffles by randomly selecting row_groups from all parquet files.
[3]

One tip when working with petastorm is, read code from github as there aren’t a lot of upto date resources. Good thing is that it is written in python therefore easy to follow.

  • parquet_path should be full path, either single string or list multiple file paths.
  • line5~6: since we are using make_batch_dataset() data is streamed in batches of row_group thus need to unbatch() first in order to batch into size we want. I’ve tried various ways to get rid of unbatch() however no luck yet. Again, please tell me if you’ve found the answer to it as I’ve even posted on stackoverflow[10].
  • parse_dataset() : each element petastorm’s inferred schema view obtained when iterating through dataset contains each features as eagor tensor where it is accessed with getattr or .col_nm. Best way would be to replace for-loop with slicing however order of features seem to change each time.

Training from TFRecords

Similar to petastorm we need to parse serialized example written in tfrecord files and in order to do that we need to provide feature description.

  • parse_tfrecord() : for deserializing tf.train.Example and making input to be tuple of dictionary(for features) and tf.tensor for target feature. Note as it does not provide projection we need to pop unwanted features, therefore often times it would be better to save tf.train.Example with only necessary features.

Also since we are inputting tuple with dictionary and tensor, need to make slight changes of input layer.

Now we are finally ready to train the model.

  • We have added .shuffle() because right off the bat it is not supported. However similar to petastorm you could shard tfrecord files then read it using .list_files().interleave() which reads multiple files parallely and shuffles at file level automatically.

Below is summary of training on Parquet and TFRecord on different number of files for 3epochs.

Conclusion

We’ve compares different file formats and data loading frameworks on local environment. Data pipeline are important part of machine learning life cycle and having deep understanding will allow you to scale machine learning training in efficient manner.

Based on experiment:

  1. Parquet seems to be more compact and easier to write than tfrecord files. If working with only tabular data, parquet seems to be a great choice.
  2. Sharding files does not seem to make a difference. However this could be because I’ve tested in local environment where retrieving from internal file storage wouldn’t be expensive. If data would’ve been stored in S3 then sharding into optimal number of files and size would provide large benefits.

Next, only considering optimal sized parquet and tfrecord files we will use tfprofiler to see where possible improvements to our data pipelines could be made, try to improve performance and see which data loading framework scales better.

--

--

Haneul Kim

Data Scientist passionate about helping the environment.