FastAPI is a popular web framework for building APIs in Python that is known for its fast performance and scalability. When building APIs that utilize Hugging Face ML models, it is important to optimize the performance to handle concurrent users efficiently.
In this tutorial, we will discuss some strategies for optimizing FastAPI for concurrent users when running Hugging Face ML models.
- Use asynchronous programming: FastAPI is built on top of Starlette, a modern ASGI framework, which makes it easy to write asynchronous code using Python’s async/await syntax. By leveraging async functions, you can handle multiple concurrent users more efficiently without blocking the main thread. This is important when dealing with ML models that can have high computational overhead.
Here’s an example of using an async function to handle requests in FastAPI:
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
async def predict(text: str):
result = await predict_model(text)
return {"prediction": result}
- Utilize background tasks: FastAPI provides support for running background tasks using the
BackgroundTasks
class. This allows you to offload long-running tasks, such as loading a model or making predictions, to separate threads to avoid blocking the main thread.
Here’s an example of using background tasks in FastAPI:
from fastapi import FastAPI, BackgroundTasks
app = FastAPI()
def load_model():
# Load model here
@app.get("/")
async def predict(text: str, background_tasks: BackgroundTasks):
background_tasks.add_task(load_model)
result = make_prediction(text)
return {"prediction": result}
- Use caching: If your ML model predictions are computationally expensive and don’t change frequently, you can cache the results to avoid recomputing them for every request. FastAPI provides built-in support for caching using tools like
cachetools
orfastapi_cache
, which can help improve response times for concurrent users.
Here’s an example of using caching in FastAPI:
from fastapi import FastAPI
from fastapi_cache import caches, close_caches
from fastapi_cache.backends.memory import MemoryCacheBackend
app = FastAPI()
cache = caches.get('memory')
if not cache:
cache = caches.add('memory', backend=MemoryCacheBackend())
@app.get("/")
async def predict(text: str):
if result := cache.get(text):
return {"prediction": result}
result = make_prediction(text)
cache.set(text, result, expire_time=3600)
return {"prediction": result}
-
Scale horizontally: If your API receives a large number of concurrent users, you may need to scale horizontally by running multiple instances of your FastAPI application behind a load balancer. This allows you to distribute the load across multiple servers and handle more concurrent users effectively.
- Use efficient ML models: When working with Hugging Face ML models, consider using more efficient models or model architectures that are optimized for inference speed, such as DistilBERT or MobileBERT. This can help reduce the computational overhead and improve response times for concurrent users.
By following these strategies for optimizing FastAPI for concurrent users when running Hugging Face ML models, you can build a high-performance API that can handle a large number of requests efficiently.
Yes, programmers should be aware that async/await is single-threaded and is meant for I/O-bound operations. If you call a blocking I/O-bound or a CPU-bound operation, the main thread will block preventing any other request from being processed by the event loop.
In general, any blocking operation (code which will not run relatively fast), should be transformed into a non-blocking one. The easiest case is for blocking I/O operations, for which there are many libraries (both for local filesystem operations and for network ones). For non-I/O related operations, one can use multiprocessing (wrapped by an executor that will plug into the async/await system, making the adaptation relatively simple).
async def run_cpu_bound_task(x):
loop = asyncio.get_event_loop()
# Use ProcessPoolExecutor to run the CPU-bound task
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_bound_task, x)
return result
FastAPI by default is multi-threaded, it runs in a threadpool. If you change your endpoints from "async def" just normal "def", then while you are running inference(Hugging Face API call), the get stats endpoint should return instantly.
Hello,
what about running another python subprocess which extract data and waiting for a response, that shouldn't block the current thread.Or it's bad idea?
Great video – how do you scale this to handle 500 requests per second with only 4 workers?
these fastapi + ml videos are great!