Archive for March, 2018

Transfer Learning sample from Microsoft AI framework – CNTK

March 27, 2018

nsfer learning is really a great way to save on resources by transfer of learning from an existing model. WMore info here:

Transfer Learning is a useful technique when, for instance, you know you need to classify incoming images into different categories, but you do not have enough data to train a Deep Neural Network (DNN) from scratch. Training DNNs takes a lot of data, all of it labeled, and often you will not have that kind of data on hand. If your problem is similar to one for which a network has already been trained, though, you can use Transfer Learning to modify that network to your problem with a fraction of the labeled images (we are talking tens instead of thousands).

What is Transfer Learning?

With Transfer Learning, we use an existing trained model and adapt it to our own problem. We are essentially building upon the features and concepts that were learned during the training of the base model. With a Convolutional DNN (ResNet_18 in this case), we are using the features learned from ImageNet data and cutting off the final classification layer, replacing it with a new dense layer that will predict the class labels of our new domain.

The input to the old and the new prediction layer is the same, we simply reuse the trained features. Then we train this modified network, either only the new weights of the new prediction layer or all weights of the entire network.

This can be used, for instance, when we have a small set of images that are in a similar domain to an existing trained model. Training a Deep Neural Network from scratch requires tens of thousands of images, but training one that has already learned features in the domain you are adapting it to requires far fewer.

In our case, this means adapting a network trained on ImageNet images (dogs, cats, birds, etc.) to flowers, or sheep/wolves. However, Transfer Learning has also been successfully used to adapt existing neural models for translation, speech synthesis, and many other domains – it is a convenient way to bootstrap your learning process.

Here is an example:



CNTK contains a transfer learning sample. I have a machine with Nvidia GPU GeForce GTX 960M

Here is what Happens when I start this sample with python:

Activate CNTK

C:\CNTK-Samples-2-4\Examples\Image\TransferLearning>conda activate cntk-py36

(cntk-py36) C:\CNTK-Samples-2-4\Examples\Image\TransferLearning>


(cntk-py36) C:\CNTK-Samples-2-4\Examples\Image\TransferLearning>python
Traceback (most recent call last):
File “”, line 9, in <module>
import cntk as C
ModuleNotFoundError: No module named ‘cntk’


This gets resolved by picking python from the anaconda.


cntk-py36) C:\Users\Dell\Downloads\ethereum-mining-windows>where python

(cntk-py36) C:\CNTK-Samples-2-4\Examples\Image\TransferLearning>C:\Users\Dell\Anaconda3\python.exe


Selected GPU[0] GeForce GTX 960M as the process wide default device.
Build info:

Built time: Jan 31 2018 14:57:35
 Last modified date: Wed Jan 31 01:10:27 2018
 Build type: Release
 Build target: GPU
 With 1bit-SGD: no
 With ASGD: yes
 Math lib: mkl
 CUDA version: 9.0.0
 CUDNN version: 7.0.5
 Build Branch: HEAD
 Build SHA1: a70455c7abe76596853f8e6a77a4d6de1e3ba76e
 MPI distribution: Microsoft MPI
 MPI version: 7.0.12437.6
Training transfer learning model for 20 epochs (epoch_size = 6149).
Training 15949478 parameters in 68 parameter tensors.
CUDA failure 2: out of memory ; GPU=0 ; hostname=DESKTOP-IA3HLGI ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * AsMultipleOf(numElements, 2))
Traceback (most recent call last):
 File "", line 217, in <module>
 max_epochs, freeze=freeze_weights)
 File "", line 130, in train_model
 trainer.train_minibatch(data) # update model with it
 File "C:\Users\Dell\Anaconda3\lib\site-packages\cntk\train\", line 181, in train_minibatch
 arguments, device)
 File "C:\Users\Dell\Anaconda3\lib\site-packages\cntk\", line 2975, in train_minibatch_overload_for_minibatchdata
 return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, *args)
RuntimeError: CUDA failure 2: out of memory ; GPU=0 ; hostname=DESKTOP-IA3HLGI ; expr=cudaMalloc((void**) &deviceBufferPtr, sizeof(AllocatedElemType) * AsMultipleOf(numElements, 2))

 > Microsoft::MSR::CNTK::CudaTimer:: Stop
 - Microsoft::MSR::CNTK::CudaTimer:: Stop (x2)
 - Microsoft::MSR::CNTK::GPUMatrix<float>:: Resize
 - Microsoft::MSR::CNTK::Matrix<float>:: Resize
 - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>
 - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: shared_from_this (x3)
 - CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
 - CNTK:: CreateTrainer
 - CNTK::Trainer:: TotalNumberOfUnitsSeen
 - CNTK::Trainer:: TrainMinibatch (x2)
 - PyInit__cntk_py (x2)

I run into a problem with memory. My graphics card has a dedicated memory of 2Gb  as you can see below:

Apparently that isn’t enough. Next question I have is does CUDA use dedicated GPU memory or shared system memory? As my graphics card has 3.9 GB of that:

So my total Graphics memory is 5.9 GB


It turns out that I am running out of Dedicated GPU memory not shared system memory.

I couldn’t find nany GPU setting but I tied reducing the mini batch size and that the trick. In the  source code on line 41 and changing the the variable as ‘mb_size = 50’ to ‘mb_size = 30’ did the trick. Your mileage may vary based on  your GPU so do experiment.