Benchmarking FloydHub instances

Benchmarking FloydHub instances

This post compares all the CPU and GPU instances offered by FloydHub, so that you can choose the right instance type for your training job.

Benchmark

For our benchmark we decided to use the same tests as used by the Tensorflow project. The Tensorflow benchmark process is explained here. AlexNet model was tested using the ImageNet data set for this benchmark. We are planning to add results from other models like InceptionV3 and ResNet-50 soon. You can find the FloydHub project with the benchmark runs here and the Github repo here.

Update (12/14/2017): We have now added benchmark results for InceptionV3, Resnet-50, Resnet-152 and VGG-16.

Update (08/08/2018): Added benchmark results with Mixed Precision Training for GPU2.

Sneak preview of results

Here is the quick results from benchmarking before we go in to the details. You can compare the performance of each instance by the number of images it
can process per second.

Instance Batch Size Images/second
CPU 32 9
CPU2 64 34
GPU 512 662
GPU2 1024 4700
GPU2 (MXP) 2048 7200

Continue reading to learn more about the setup.

Performance Tuning

We also followed various performance optimizations recommended in the Tensorflow best practices guide

Data Format

For our benchmarks we will be using the NHWC format for the CPU instances and NCHW for the GPU instances to take advantage of cuDNN. This is again as recommended by Tensorflow.

Fused Ops

Using fused batch norm can result in 12-30% speed up. In our code we enabled this by setting:

bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')

You can learn about this in Tensorflow documentation.

Mixed precision training

Another optimization we did is to use Mixed precision training: this techinque lowers the required resources by using lower-precision arithmetic. This decreases the required amount of memory and enables training of larger models or training with larger minibatches. It also shortens the training or inference time because half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. This is an experimental feature but gives significant speed boost.

Building Tensorflow from source

To optimize the performance it is recommneded to build and install Tensorflow from source. All the FloydHub environments are
built from source and optimized for the specific instance type.

Instances

For this post we will be comparing 2 CPU instances (CPU, CPU2) and 2 GPU instances (GPU, GPU2):

Instance CPU Cores Memory GPU Type GPU Memory
CPU 2 8 GB - -
CPU2 8 32 GB - -
GPU 4 64 GB Nvidia Tesla K-80 12 GB dedicated Memory
GPU2 8 64 GB Nvidia Tesla V100 16 GB dedicated Memory

Data

We used synthetic data for all the tests. Synthetic data removes disk I/O as a variable and historically the numbers for real data is very close to synthetic data. But we are also planning to run another round of benchmark tests with real data soon.

Synthetic data is randomly generated by using tf.truncated_normal then normalized in range [-1,1] and set to the same shape as the data expected ImageNet.

Sample image generated this way:

synthetic

Results

Alexnet

Below are the results of testing Alexnet with the synthetic data. As you can see the batch sizes are different - we picked the batch size the utilizes the memory available in each instance as much as possible.

Instance Batch Size Images/second
CPU 32 9
CPU2 64 34
GPU 512 662
GPU2 1024 4700
GPU2 (MXP) 2048 7200

The GPU instances are orders of magnitude faster than CPU instances. And the new GPU2 instance performs about 7x the standard GPU instance.

alexnet-1

Note:

Full training over Imagenet (about 100 epochs on 1.2 images dataset) takes about 7-8h on GPU2 and more than 2 days on GPU.

INCEPTION V3

Here are the benchmark numbers for InceptionV3.

Instance Batch Size Images/second
CPU 4 1
CPU2 16 2.8
GPU 64 31
GPU2 128 253
GPU2 (MXP) 256 495

inceptionv3

Note:

Full training over Imagenet (about 100 epochs on 1.2 images dataset) takes about 5 days on GPU2 and more than 45 days on GPU.

Resnet-50

Instance Batch Size Images/second
CPU 4 1.4
CPU2 8 2.7
GPU 64 52
GPU2 128 387
GPU2 (MXP) 256 717

resnet50

Note:

A full training over Imagenet (about 90 epochs on 1.2 images dataset) takes 3 days on GPU2 and about 24 days on GPU.

Resnet-152

Instance Batch Size Images/second
CPU 1 0.5
CPU2 2 1.5
GPU 32 20
GPU2 64 152
GPU2 (MXP) 128 330

resnet152

Note:

A full training over Imagenet (about 100 epochs on 1.2 images dataset) takes 9 days on GPU2 and about 70 days on GPU.

VGG-16

Instance Batch Size Images/second
CPU 2 0.5
CPU2 4 1.5
GPU 32 36
GPU2 64 234
GPU2 (MXP) 128 420

vgg16

Note:

A full training over Imagenet (about 74 epochs on 1.2 images dataset) takes 5 days on GPU2 and about 38 days on GPU.

Up Next

We are planning to keep this post as a live document. When we run new benchmarks on FloydHub we will update the results here. We are hoping you can always refer to this page when making a decision about which FloydHub instance to choose for your next project.