Understanding Quantization with PyTorch: Post-Training Quantization and Quantization-Aware Training Explained

Posted by

Alfalfa

–

December 26, 2023

Quantization explained with PyTorch

Quantization is the process of reducing the precision of a neural network’s parameters and/or activations. This can significantly reduce model size and improve inference performance on hardware with limited computational resources, such as mobile devices and embedded systems.

Post-Training Quantization

Post-training quantization is a technique where a pre-trained model is quantized after it has been trained. This can be done by converting the model’s floating-point weights and biases to lower precision integers, such as 8-bit or 16-bit integers. PyTorch provides tools for post-training quantization, making it easy to apply this technique to existing models.

Quantization-Aware Training

Quantization-aware training is a technique where the model is trained with the awareness that it will be quantized at some point. This allows the model to learn quantization-friendly representations, resulting in better accuracy after quantization. PyTorch provides support for quantization-aware training through tools such as the “torch.quantization” module, making it possible to train models with quantization in mind from the beginning.

Overall, quantization is an important technique for optimizing neural network models for deployment on resource-constrained hardware. With PyTorch’s support for both post-training quantization and quantization-aware training, it is easy to apply quantization to your models and achieve better performance on a wide range of devices.

and, Bottle, Deep Learning, django, explained, fastapi,, flask, Keras, Kivy, machine learning, post training quantization, post-training, PyQt, PySimpleGUI, python, PyTorch, quantization, quantization aware training, quantization-aware, scikit-learn, TensorFlow, Tkinter, training, understanding, with

Alfalfa

0 0 votes

Article Rating

21 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

@myaseena

10 months ago

Really high quality exposition. Also thanks for providing the slides.

@ankush4617

10 months ago

I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!!

I’m hoping that you will have a video on Mixtral MoE soon 😊

@007Paulius

10 months ago

Thanks

@tubercn

10 months ago

Thanks, Great video🐱‍🏍🐱‍🏍
But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently

@manishsharma2211

10 months ago

beautiful again, thanks for sharing these

@tryit-wv8ui

10 months ago

Umar, I just wanna thank you for all your stuff. You gimme back the desire to master deep learning. If one dayn you can make a video of what is LangChain (even though it is not deep learning domain) it would amazing! Thank you again!

@elieelezra2734

10 months ago

Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?

@venkateshr6127

10 months ago

Could you please make a video on how to make tokenizers for other languages than English please.

@zendr0

10 months ago

If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤

@swiftmindai

10 months ago

I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W – the weight matrix]

@dzvsow2643

10 months ago

Aslamu aleykum Brother.
Thanks for your videos!
I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again

@HeyFaheem

10 months ago

You are a hidden gem, my brotherr

@Erosis

10 months ago

You're making all of my lecture materials pointless! (But keep up the great work!)

@bluecup25

10 months ago

Thank you, super clear

@user-hd7xp1qg3j

10 months ago

One request could you explain mixture of experts I bet you can breakdown the explanation good

@Sonn0Suyy

10 months ago

Thank your for your sharing.

@krystofjakubek9376

10 months ago

Great video!

Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code.

HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold:
1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well.
2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions).
However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster.
Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).

@aminamoudjar4561

10 months ago

Very helpful thank you so much

@user-kg9zs1xh3u

10 months ago

vary good

@AbdennacerAyeb

10 months ago

Keep Going. This is perfect. Thank you for the effort you are making

Understanding Quantization with PyTorch: Post-Training Quantization and Quantization-Aware Training Explained

Quantization explained with PyTorch

Post-Training Quantization

Quantization-Aware Training

Like this:

Recent Posts

Categories

Tags

Advanced Desktop Media Player built using Python, PySide, PyQt, and Qt Designer – QT Media Player

KERAS!! Duel Fisik Antara Timnas Indonesia dan Palestina #shorts #ngeshortsbareng

سرعة التطوير مع React Js | Explication en arabe

Advanced Desktop Media Player built using Python, PySide, PyQt, and Qt Designer – QT Media Player

KERAS!! Duel Fisik Antara Timnas Indonesia dan Palestina #shorts #ngeshortsbareng

سرعة التطوير مع React Js | Explication en arabe

Advanced Desktop Media Player built using Python, PySide, PyQt, and Qt Designer – QT Media Player

KERAS!! Duel Fisik Antara Timnas Indonesia dan Palestina #shorts #ngeshortsbareng

سرعة التطوير مع React Js | Explication en arabe

Advanced Desktop Media Player built using Python, PySide, PyQt, and Qt Designer – QT Media Player

KERAS!! Duel Fisik Antara Timnas Indonesia dan Palestina #shorts #ngeshortsbareng

سرعة التطوير مع React Js | Explication en arabe

Understanding Quantization with PyTorch: Post-Training Quantization and Quantization-Aware Training Explained

Quantization explained with PyTorch

Post-Training Quantization

Quantization-Aware Training

Share this:

Like this:

Recent Posts

Categories

Tags