Quantization explained with PyTorch
Quantization is the process of reducing the precision of a neural network’s parameters and/or activations. This can significantly reduce model size and improve inference performance on hardware with limited computational resources, such as mobile devices and embedded systems.
Post-Training Quantization
Post-training quantization is a technique where a pre-trained model is quantized after it has been trained. This can be done by converting the model’s floating-point weights and biases to lower precision integers, such as 8-bit or 16-bit integers. PyTorch provides tools for post-training quantization, making it easy to apply this technique to existing models.
Quantization-Aware Training
Quantization-aware training is a technique where the model is trained with the awareness that it will be quantized at some point. This allows the model to learn quantization-friendly representations, resulting in better accuracy after quantization. PyTorch provides support for quantization-aware training through tools such as the “torch.quantization” module, making it possible to train models with quantization in mind from the beginning.
Overall, quantization is an important technique for optimizing neural network models for deployment on resource-constrained hardware. With PyTorch’s support for both post-training quantization and quantization-aware training, it is easy to apply quantization to your models and achieve better performance on a wide range of devices.
Really high quality exposition. Also thanks for providing the slides.
I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!!
I’m hoping that you will have a video on Mixtral MoE soon 😊
Thanks
Thanks, Great video🐱🏍🐱🏍
But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently
beautiful again, thanks for sharing these
Umar, I just wanna thank you for all your stuff. You gimme back the desire to master deep learning. If one dayn you can make a video of what is LangChain (even though it is not deep learning domain) it would amazing! Thank you again!
Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?
Could you please make a video on how to make tokenizers for other languages than English please.
If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤
I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W – the weight matrix]
Aslamu aleykum Brother.
Thanks for your videos!
I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again
You are a hidden gem, my brotherr
You're making all of my lecture materials pointless! (But keep up the great work!)
Thank you, super clear
One request could you explain mixture of experts I bet you can breakdown the explanation good
Thank your for your sharing.
Great video!
Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code.
HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold:
1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well.
2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions).
However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster.
Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).
Very helpful thank you so much
vary good
Keep Going. This is perfect. Thank you for the effort you are making