Benchmarking FloydHub instances
This post compares all the CPU and GPU instances offered by FloydHub, so that you can choose the right instance type for your training job.
Benchmark
For our benchmark we decided to use the same tests as used by the Tensorflow project. The Tensorflow benchmark process is explained here. AlexNet model was tested using the ImageNet data set for this benchmark. We are planning to add results from other models like InceptionV3 and ResNet-50 soon. You can find the FloydHub project with the benchmark runs here and the Github repo here.
Update (12/14/2017): We have now added benchmark results for InceptionV3, Resnet-50, Resnet-152 and VGG-16.
Update (08/08/2018): Added benchmark results with Mixed Precision Training for GPU2.
Sneak preview of results
Here is the quick results from benchmarking before we go in to the details. You can compare the performance of each instance by the number of images it
can process per second.
Instance | Batch Size | Images/second |
---|---|---|
CPU | 32 | 9 |
CPU2 | 64 | 34 |
GPU | 512 | 662 |
GPU2 | 1024 | 4700 |
GPU2 (MXP) | 2048 | 7200 |
Continue reading to learn more about the setup.
Performance Tuning
We also followed various performance optimizations recommended in the Tensorflow best practices guide
Data Format
For our benchmarks we will be using the NHWC
format for the CPU instances and NCHW
for the GPU instances to take advantage of cuDNN. This is again as recommended by Tensorflow.
Fused Ops
Using fused batch norm can result in 12-30% speed up. In our code we enabled this by setting:
bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
You can learn about this in Tensorflow documentation.
Mixed precision training
Another optimization we did is to use Mixed precision training: this techinque lowers the required resources by using lower-precision arithmetic. This decreases the required amount of memory and enables training of larger models or training with larger minibatches. It also shortens the training or inference time because half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. This is an experimental feature but gives significant speed boost.
Building Tensorflow from source
To optimize the performance it is recommneded to build and install Tensorflow from source. All the FloydHub environments are
built from source and optimized for the specific instance type.
Instances
For this post we will be comparing 2 CPU instances (CPU, CPU2) and 2 GPU instances (GPU, GPU2):
Instance | CPU Cores | Memory | GPU Type | GPU Memory |
---|---|---|---|---|
CPU | 2 | 8 GB | - | - |
CPU2 | 8 | 32 GB | - | - |
GPU | 4 | 64 GB | Nvidia Tesla K-80 | 12 GB dedicated Memory |
GPU2 | 8 | 64 GB | Nvidia Tesla V100 | 16 GB dedicated Memory |
Data
We used synthetic data for all the tests. Synthetic data removes disk I/O as a variable and historically the numbers for real data is very close to synthetic data. But we are also planning to run another round of benchmark tests with real data soon.
Synthetic data is randomly generated by using tf.truncated_normal then normalized in range [-1,1] and set to the same shape as the data expected ImageNet.
Sample image generated this way:
Results
Alexnet
Below are the results of testing Alexnet with the synthetic data. As you can see the batch sizes are different - we picked the batch size the utilizes the memory available in each instance as much as possible.
Instance | Batch Size | Images/second |
---|---|---|
CPU | 32 | 9 |
CPU2 | 64 | 34 |
GPU | 512 | 662 |
GPU2 | 1024 | 4700 |
GPU2 (MXP) | 2048 | 7200 |
The GPU instances are orders of magnitude faster than CPU instances. And the new GPU2 instance performs about 7x the standard GPU instance.
Note:
Full training over Imagenet (about 100 epochs on 1.2 images dataset) takes about 7-8h on GPU2 and more than 2 days on GPU.
INCEPTION V3
Here are the benchmark numbers for InceptionV3.
Instance | Batch Size | Images/second |
---|---|---|
CPU | 4 | 1 |
CPU2 | 16 | 2.8 |
GPU | 64 | 31 |
GPU2 | 128 | 253 |
GPU2 (MXP) | 256 | 495 |
Note:
Full training over Imagenet (about 100 epochs on 1.2 images dataset) takes about 5 days on GPU2 and more than 45 days on GPU.
Resnet-50
Instance | Batch Size | Images/second |
---|---|---|
CPU | 4 | 1.4 |
CPU2 | 8 | 2.7 |
GPU | 64 | 52 |
GPU2 | 128 | 387 |
GPU2 (MXP) | 256 | 717 |
Note:
A full training over Imagenet (about 90 epochs on 1.2 images dataset) takes 3 days on GPU2 and about 24 days on GPU.
Resnet-152
Instance | Batch Size | Images/second |
---|---|---|
CPU | 1 | 0.5 |
CPU2 | 2 | 1.5 |
GPU | 32 | 20 |
GPU2 | 64 | 152 |
GPU2 (MXP) | 128 | 330 |
Note:
A full training over Imagenet (about 100 epochs on 1.2 images dataset) takes 9 days on GPU2 and about 70 days on GPU.
VGG-16
Instance | Batch Size | Images/second |
---|---|---|
CPU | 2 | 0.5 |
CPU2 | 4 | 1.5 |
GPU | 32 | 36 |
GPU2 | 64 | 234 |
GPU2 (MXP) | 128 | 420 |
Note:
A full training over Imagenet (about 74 epochs on 1.2 images dataset) takes 5 days on GPU2 and about 38 days on GPU.
Up Next
We are planning to keep this post as a live document. When we run new benchmarks on FloydHub we will update the results here. We are hoping you can always refer to this page when making a decision about which FloydHub instance to choose for your next project.