Understanding Quantization with PyTorch: Post-Training Quantization and Quantization-Aware Training Explained

Posted by

Quantization explained with PyTorch

Quantization explained with PyTorch

Quantization is the process of reducing the precision of a neural network’s parameters and/or activations. This can significantly reduce model size and improve inference performance on hardware with limited computational resources, such as mobile devices and embedded systems.

Post-Training Quantization

Post-training quantization is a technique where a pre-trained model is quantized after it has been trained. This can be done by converting the model’s floating-point weights and biases to lower precision integers, such as 8-bit or 16-bit integers. PyTorch provides tools for post-training quantization, making it easy to apply this technique to existing models.

Quantization-Aware Training

Quantization-aware training is a technique where the model is trained with the awareness that it will be quantized at some point. This allows the model to learn quantization-friendly representations, resulting in better accuracy after quantization. PyTorch provides support for quantization-aware training through tools such as the “torch.quantization” module, making it possible to train models with quantization in mind from the beginning.

Overall, quantization is an important technique for optimizing neural network models for deployment on resource-constrained hardware. With PyTorch’s support for both post-training quantization and quantization-aware training, it is easy to apply quantization to your models and achieve better performance on a wide range of devices.

0 0 votes
Article Rating
21 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@myaseena
10 months ago

Really high quality exposition. Also thanks for providing the slides.

@ankush4617
10 months ago

I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!!

I’m hoping that you will have a video on Mixtral MoE soon 😊

@007Paulius
10 months ago

Thanks

@tubercn
10 months ago

Thanks, Great video🐱‍🏍🐱‍🏍
But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently

@manishsharma2211
10 months ago

beautiful again, thanks for sharing these

@tryit-wv8ui
10 months ago

Umar, I just wanna thank you for all your stuff. You gimme back the desire to master deep learning. If one dayn you can make a video of what is LangChain (even though it is not deep learning domain) it would amazing! Thank you again!

@elieelezra2734
10 months ago

Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?

@venkateshr6127
10 months ago

Could you please make a video on how to make tokenizers for other languages than English please.

@zendr0
10 months ago

If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤

@swiftmindai
10 months ago

I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W – the weight matrix]

@dzvsow2643
10 months ago

Aslamu aleykum Brother.
Thanks for your videos!
I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again

@HeyFaheem
10 months ago

You are a hidden gem, my brotherr

@Erosis
10 months ago

You're making all of my lecture materials pointless! (But keep up the great work!)

@bluecup25
10 months ago

Thank you, super clear

@user-hd7xp1qg3j
10 months ago

One request could you explain mixture of experts I bet you can breakdown the explanation good

@Sonn0Suyy
10 months ago

Thank your for your sharing.

@krystofjakubek9376
10 months ago

Great video!

Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code.

HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold:
1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well.
2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions).
However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster.
Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).

@aminamoudjar4561
10 months ago

Very helpful thank you so much

@user-kg9zs1xh3u
10 months ago

vary good

@AbdennacerAyeb
10 months ago

Keep Going. This is perfect. Thank you for the effort you are making