Tutorial on Web Scraping with Python, NoSQL, and FastAPI with Scheduled Execution

Posted by


In this tutorial, we will cover how to build a web scraping application using Python, NoSQL, and FastAPI. We will use FastAPI to create a simple web server that scrapes a website on a schedule and saves the data to a NoSQL database.

Prerequisites

To follow along with this tutorial, you will need the following tools installed on your machine:

  • Python 3.x
  • FastAPI
  • MongoDB (or any other NoSQL database of your choice)
  • Requests library
  • APScheduler

Step 1: Setup the environment

First, create a new directory for your project and create a virtual environment inside it. You can do this by running the following commands:

mkdir web_scraping_app
cd web_scraping_app
python3 -m venv venv

Step 2: Install dependencies

Activate the virtual environment by running:

source venv/bin/activate

Then, install the required libraries using pip:

pip install fastapi uvicorn pymongo requests apscheduler

Step 3: Create a scraper module

Create a new Python module called scraper.py and add the following code to it:

import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Add your scraping logic here

    return data

Step 4: Set up the FastAPI app

Create a new Python module called main.py and add the following code to it:

from fastapi import FastAPI
from apscheduler.schedulers.background import BackgroundScheduler
from scraper import scrape_website

app = FastAPI()

scheduler = BackgroundScheduler()
scheduler.start()

@app.on_event("startup")
def start_scheduler():
    scheduler.add_job(scrape_website, "interval", hours=1, args=[url])

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 5: Connect to the NoSQL database

Modify the scraper.py module to save the scraped data to a NoSQL database. For example, if you are using MongoDB, you can add the following code to the module:

import pymongo

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["web_scraping_app"]
collection = db["scraped_data"]

def save_to_database(data):
    collection.insert_one(data)

Step 6: Run the application

Start the FastAPI server by running:

python main.py

Your web scraping application will now run on a schedule, scraping the website every hour and saving the data to the NoSQL database.

Conclusion

In this tutorial, we have seen how to build a web scraping application using Python, NoSQL, and FastAPI. You can customize this application by adding more complex scraping logic, using a different NoSQL database, or scheduling the scraping at different intervals. Happy coding!

0 0 votes
Article Rating
30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@CodingEntrepreneurs
1 month ago

0:00:00 Welcome
00:00:58 Demo
00:12:58 Overview & Requirements
00:15:22 Project Setup
00:19:26 Start the Python & Cassandra Integration
00:25:11 Configure Python cassandra-driver
00:30:13 Your First Cassandra Model
00:36:08 Create Data using our Cassandra Model
00:43:20 Adding a New Column to an Existing Model
00:46:26 Using UUID1 as Primary Key
00:55:19 Using Jupyter with Cassandra Models
01:06:35 Using Pydantic for Data Validation and Cleaning
01:14:22 FastAPI & Environment Variables
01:21:26 FastAPI + Cassandra & Pydantic
01:36:01 Convert Cassandra UUID Field to Pydantic Datetime Strv
01:44:50 Endpoint to Ingest Data for FastAPI & AstraDB
01:56:41 Celery, Redis & Basic Task Offload
02:13:18 Integrate Cassandra Driver with Celery
02:23:47 Running Periodic Tasks
02:35:24 Basic Scraping with Selenium
02:45:35 Selenium & JavaScript Endless Scrolling
02:52:52 requests-html & Parsing Data
03:11:34 Implement the Scrape Client Parser
03:27:20 Putting it all together
03:36:58 Thank you

@thecodingchronicles
1 month ago

@CodingEntrepreneurs is there a easy way fixing issue on mac M1 with how creds are extracded from zip into temporary folder?

@sahil_singhai
1 month ago

Hey how do I delete all that Old data from the DB?

@3wcdev878
1 month ago

This is pure gold dude, also breaking it up on phases.

@adesegunkoiki897
1 month ago

Please how do i drop a table if i made an error

@jorgezapata8540
1 month ago

awesome

@mhmdjavadsbrn4585
1 month ago

very very goood.
tnx

@konstantink3869
1 month ago

I really like how you split such a long tutorial for short parts. Especially, this music transition really draws my attention back. Love this beat, amazing job!

@tz2014
1 month ago

Its amazing, I learned a lot

@hayat_soft_skills
1 month ago

Awesome! it's my wish to learn more from you. Thanks!

@MrMardok
1 month ago

Justin for real a web scraping project (price comparison site) what you prefer to use : this configuration or a Django Rest Api + React Frontend ?

@edinhofilho22
1 month ago

I had some issues with Apple's new architecture, like some dependencies and packages on different architecture.
But the tutorial is insane, thanks

@vincentle8395
1 month ago

It would be supberb if you could provide an alternative for Celery because Celery seems to be not working on Windows. I had to stop following the video because I couldn't make it work. But thanks a lot, the other things helped me big time

@bumblebee2012able
1 month ago

Hi, Could you please advise me some online/pdf book to learn FastAPI along with Cassandra? Thanks so much.

@codetrap3036
1 month ago

Hey Justin I'm from India and I start recently watching your videos.your videos are amazing love from India.🙂

@nateriver8261
1 month ago

Hey! Thank you so much! Can you show an example how to scrape AJAX pagination with selenium?

@nateriver8261
1 month ago

Amazing tutotial, except the music between chapters ))

@official.letsfeellove
1 month ago

Hello sir i really love your Videos. Sir i can't solve django rest framework email verification and reset password problem. Can you please help us about it?

@sukumarsanu7169
1 month ago

is it really necessary to clone git repo

@asalvats
1 month ago

About the UUID1 datetime : that day 'started' the gregorian (current) calendar. BTW, thanks for this great tutorial !!!