Improving FastAPI Performance for Running Hugging Face ML Models with Concurrent Users

Posted by

Alfalfa

–

September 4, 2024

FastAPI is a popular web framework for building APIs in Python that is known for its fast performance and scalability. When building APIs that utilize Hugging Face ML models, it is important to optimize the performance to handle concurrent users efficiently.

In this tutorial, we will discuss some strategies for optimizing FastAPI for concurrent users when running Hugging Face ML models.

Use asynchronous programming: FastAPI is built on top of Starlette, a modern ASGI framework, which makes it easy to write asynchronous code using Python’s async/await syntax. By leveraging async functions, you can handle multiple concurrent users more efficiently without blocking the main thread. This is important when dealing with ML models that can have high computational overhead.

Here’s an example of using an async function to handle requests in FastAPI:

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def predict(text: str):
    result = await predict_model(text)
    return {"prediction": result}

Utilize background tasks: FastAPI provides support for running background tasks using the BackgroundTasks class. This allows you to offload long-running tasks, such as loading a model or making predictions, to separate threads to avoid blocking the main thread.

Here’s an example of using background tasks in FastAPI:

from fastapi import FastAPI, BackgroundTasks

app = FastAPI()

def load_model():
    # Load model here

@app.get("/")
async def predict(text: str, background_tasks: BackgroundTasks):
    background_tasks.add_task(load_model)
    result = make_prediction(text)
    return {"prediction": result}

Use caching: If your ML model predictions are computationally expensive and don’t change frequently, you can cache the results to avoid recomputing them for every request. FastAPI provides built-in support for caching using tools like cachetools or fastapi_cache, which can help improve response times for concurrent users.

Here’s an example of using caching in FastAPI:

from fastapi import FastAPI
from fastapi_cache import caches, close_caches
from fastapi_cache.backends.memory import MemoryCacheBackend

app = FastAPI()

cache = caches.get('memory')
if not cache:
    cache = caches.add('memory', backend=MemoryCacheBackend())

@app.get("/")
async def predict(text: str):
    if result := cache.get(text):
        return {"prediction": result}

    result = make_prediction(text)
    cache.set(text, result, expire_time=3600)
    return {"prediction": result}

Scale horizontally: If your API receives a large number of concurrent users, you may need to scale horizontally by running multiple instances of your FastAPI application behind a load balancer. This allows you to distribute the load across multiple servers and handle more concurrent users effectively.
Use efficient ML models: When working with Hugging Face ML models, consider using more efficient models or model architectures that are optimized for inference speed, such as DistilBERT or MobileBERT. This can help reduce the computational overhead and improve response times for concurrent users.

By following these strategies for optimizing FastAPI for concurrent users when running Hugging Face ML models, you can build a high-performance API that can handle a large number of requests efficiently.

Bottle, concurrent,, django, face, fastapi,, flask, for, hugging, improving, Keras, Kivy, models, performance:, PyQt, PySimpleGUI, python, PyTorch, running, scikit-learn, TensorFlow, Tkinter, users, with

Alfalfa

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

@exmachina767

2 months ago

Yes, programmers should be aware that async/await is single-threaded and is meant for I/O-bound operations. If you call a blocking I/O-bound or a CPU-bound operation, the main thread will block preventing any other request from being processed by the event loop.

In general, any blocking operation (code which will not run relatively fast), should be transformed into a non-blocking one. The easiest case is for blocking I/O operations, for which there are many libraries (both for local filesystem operations and for network ones). For non-I/O related operations, one can use multiprocessing (wrapped by an executor that will plug into the async/await system, making the adaptation relatively simple).

async def run_cpu_bound_task(x):
loop = asyncio.get_event_loop()
# Use ProcessPoolExecutor to run the CPU-bound task
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_bound_task, x)
return result

@juvewan

2 months ago

FastAPI by default is multi-threaded, it runs in a threadpool. If you change your endpoints from "async def" just normal "def", then while you are running inference(Hugging Face API call), the get stats endpoint should return instantly.

@hodiks

2 months ago

Hello,
what about running another python subprocess which extract data and waiting for a response, that shouldn't block the current thread.Or it's bad idea?

@marka9424

2 months ago

Great video – how do you scale this to handle 500 requests per second with only 4 workers?

@shaheerzaman620

2 months ago

these fastapi + ml videos are great!

Improving FastAPI Performance for Running Hugging Face ML Models with Concurrent Users

Like this:

Recent Posts

Categories

Tags

How to Update and Downgrade Node.js on Mac and Windows: A Complete Node.js Tutorial

Как установить PyQT на Windows?

Ingin Menjadi Penjaga Gawang Unggul? Begini Cara Menghadapi Tendangan Sangat Keras!

How to Update and Downgrade Node.js on Mac and Windows: A Complete Node.js Tutorial

Как установить PyQT на Windows?

Ingin Menjadi Penjaga Gawang Unggul? Begini Cara Menghadapi Tendangan Sangat Keras!

How to Update and Downgrade Node.js on Mac and Windows: A Complete Node.js Tutorial

Как установить PyQT на Windows?

Ingin Menjadi Penjaga Gawang Unggul? Begini Cara Menghadapi Tendangan Sangat Keras!

How to Update and Downgrade Node.js on Mac and Windows: A Complete Node.js Tutorial

Как установить PyQT на Windows?

Ingin Menjadi Penjaga Gawang Unggul? Begini Cara Menghadapi Tendangan Sangat Keras!

Improving FastAPI Performance for Running Hugging Face ML Models with Concurrent Users

Share this:

Like this:

Recent Posts

Categories

Tags