In this tutorial, we will cover how to build a web scraping application using Python, NoSQL, and FastAPI. We will use FastAPI to create a simple web server that scrapes a website on a schedule and saves the data to a NoSQL database.
Prerequisites
To follow along with this tutorial, you will need the following tools installed on your machine:
- Python 3.x
- FastAPI
- MongoDB (or any other NoSQL database of your choice)
- Requests library
- APScheduler
Step 1: Setup the environment
First, create a new directory for your project and create a virtual environment inside it. You can do this by running the following commands:
mkdir web_scraping_app
cd web_scraping_app
python3 -m venv venv
Step 2: Install dependencies
Activate the virtual environment by running:
source venv/bin/activate
Then, install the required libraries using pip:
pip install fastapi uvicorn pymongo requests apscheduler
Step 3: Create a scraper module
Create a new Python module called scraper.py
and add the following code to it:
import requests
from bs4 import BeautifulSoup
def scrape_website(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Add your scraping logic here
return data
Step 4: Set up the FastAPI app
Create a new Python module called main.py
and add the following code to it:
from fastapi import FastAPI
from apscheduler.schedulers.background import BackgroundScheduler
from scraper import scrape_website
app = FastAPI()
scheduler = BackgroundScheduler()
scheduler.start()
@app.on_event("startup")
def start_scheduler():
scheduler.add_job(scrape_website, "interval", hours=1, args=[url])
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Step 5: Connect to the NoSQL database
Modify the scraper.py
module to save the scraped data to a NoSQL database. For example, if you are using MongoDB, you can add the following code to the module:
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["web_scraping_app"]
collection = db["scraped_data"]
def save_to_database(data):
collection.insert_one(data)
Step 6: Run the application
Start the FastAPI server by running:
python main.py
Your web scraping application will now run on a schedule, scraping the website every hour and saving the data to the NoSQL database.
Conclusion
In this tutorial, we have seen how to build a web scraping application using Python, NoSQL, and FastAPI. You can customize this application by adding more complex scraping logic, using a different NoSQL database, or scheduling the scraping at different intervals. Happy coding!
0:00:00 Welcome
00:00:58 Demo
00:12:58 Overview & Requirements
00:15:22 Project Setup
00:19:26 Start the Python & Cassandra Integration
00:25:11 Configure Python cassandra-driver
00:30:13 Your First Cassandra Model
00:36:08 Create Data using our Cassandra Model
00:43:20 Adding a New Column to an Existing Model
00:46:26 Using UUID1 as Primary Key
00:55:19 Using Jupyter with Cassandra Models
01:06:35 Using Pydantic for Data Validation and Cleaning
01:14:22 FastAPI & Environment Variables
01:21:26 FastAPI + Cassandra & Pydantic
01:36:01 Convert Cassandra UUID Field to Pydantic Datetime Strv
01:44:50 Endpoint to Ingest Data for FastAPI & AstraDB
01:56:41 Celery, Redis & Basic Task Offload
02:13:18 Integrate Cassandra Driver with Celery
02:23:47 Running Periodic Tasks
02:35:24 Basic Scraping with Selenium
02:45:35 Selenium & JavaScript Endless Scrolling
02:52:52 requests-html & Parsing Data
03:11:34 Implement the Scrape Client Parser
03:27:20 Putting it all together
03:36:58 Thank you
@CodingEntrepreneurs is there a easy way fixing issue on mac M1 with how creds are extracded from zip into temporary folder?
Hey how do I delete all that Old data from the DB?
This is pure gold dude, also breaking it up on phases.
Please how do i drop a table if i made an error
awesome
very very goood.
tnx
I really like how you split such a long tutorial for short parts. Especially, this music transition really draws my attention back. Love this beat, amazing job!
Its amazing, I learned a lot
Awesome! it's my wish to learn more from you. Thanks!
Justin for real a web scraping project (price comparison site) what you prefer to use : this configuration or a Django Rest Api + React Frontend ?
I had some issues with Apple's new architecture, like some dependencies and packages on different architecture.
But the tutorial is insane, thanks
It would be supberb if you could provide an alternative for Celery because Celery seems to be not working on Windows. I had to stop following the video because I couldn't make it work. But thanks a lot, the other things helped me big time
Hi, Could you please advise me some online/pdf book to learn FastAPI along with Cassandra? Thanks so much.
Hey Justin I'm from India and I start recently watching your videos.your videos are amazing love from India.🙂
Hey! Thank you so much! Can you show an example how to scrape AJAX pagination with selenium?
Amazing tutotial, except the music between chapters ))
Hello sir i really love your Videos. Sir i can't solve django rest framework email verification and reset password problem. Can you please help us about it?
is it really necessary to clone git repo
About the UUID1 datetime : that day 'started' the gregorian (current) calendar. BTW, thanks for this great tutorial !!!