GPT-Fast – blazingly fast inference with PyTorch (w/ Horace He)
Are you looking for a way to speed up your PyTorch models for faster inference times? Look no further than GPT-Fast, a new library developed by Horace He that promises blazingly fast inference speeds for PyTorch models.
With GPT-Fast, you can expect significant improvements in inference times without compromising on the accuracy or quality of your models. This library is designed to take advantage of the latest hardware advancements and optimizations in PyTorch to deliver unparalleled speed and efficiency.
Key Features of GPT-Fast:
- Significantly faster inference times compared to traditional PyTorch models
- Efficient use of hardware resources for optimal performance
- Easy to integrate with existing PyTorch projects
How to Get Started with GPT-Fast:
To start using GPT-Fast in your PyTorch projects, simply install the library using pip:
pip install gpt-fast
Once installed, you can import the library in your Python code and start leveraging its capabilities to accelerate the inference process for your PyTorch models.
With GPT-Fast, you can expect a seamless integration experience and immediate improvements in inference times, making it an essential tool for anyone looking to optimize their PyTorch models for speed and efficiency.
Give GPT-Fast a try today and experience the difference it can make in your PyTorch projects!
Super interesting talk!! Do u guys have any idea how the compilation-generated decoding kernel compares against custom kernels like Flash-Decoding or Flash-Decoding++?
But ctranslate2 as i understand still faster?
Wow! It was very educational and practical!
I liked the graphics in the presentation!
Great job by both of you!
Thanks!
One thing that was not super clear to me. Are we loading the next weight matrix (assuming there is enough SRAM), as the previous matmul+activation is being computed?
Wow, this presentation was excellent. Straight to the point. No over-complicating, no over-simplifying, no trying to sound smart by obscuring simple things. Thank you Horace!
I didn't understand one of the points made. In a couple of occasions Horace mentions that we are loading all the weights (into the registers I assume) with every token – that's also what the diagram shows at https://youtu.be/18YupYsH5vY?t=1972 . Is that what's happening? Can the registers load all the model weights at once? If that were the case why do you need to load them every time instead of leaving them untouched. I hope that's a not too stupid of a question.
How does PPL look at int4 quants? Also, given GPTQ, how high is the tps with gpt-fast?
awesome talks, can Triton target TPUs?
Is there any discord for this channel community ?
Your questions about why fast-gpt is faster than the cuda version: kernel fusion, merging kernels into one is faster than multiple hand written ones
speculative decoding is major thing, right? If so, not very fair comparison…
Horace He joined us to walk us through what can one do with native PyTorch when it comes to accelerating inference!