How To Build a Django URL Data Scraper

In this tutorial, we’ll walk through the process of creating a Django web application for scraping data from a given URL. We’ll use Python, Django, and BeautifulSoup to build a simple yet effective web scraper. The application will allow users to input a URL, scrape data from the provided link, and display information such as the title, paragraphs, and extracted URLs.

Prerequisites

Before we begin, ensure that you have the following installed:

Python (https://www.python.org/)
Django (https://www.djangoproject.com/)
BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/)

Installation

Install Python: If you don’t have Python installed, download and install it from
- https://www.python.org/.
Install Django: Install Django using pip, the Python package manager.bashCopy code
- pip install django
Install BeautifulSoup: Install BeautifulSoup for HTML parsing.bashCopy code
- pip install beautifulsoup4

Project Setup

Let’s start by setting up our Django project and app:

Create a Django Project:
django-admin startproject url_scraper
cd url_scraper
Create a Django App:
python manage.py startapp scraper_app

Writing the Code

`scraper_app/forms.py`

In the forms.py file inside the scraper_app folder, we define a simple form for user input.

# scraper_app/
forms.py from django import forms class ScrapeForm(forms.Form):      link = forms.URLField(label='Enter URL', required=True)

`scraper_app/views.py`

In the views.py file, we implement the logic for scraping data from the provided URL.

# scraper_app/views.py
from django.shortcuts import render
from .forms import ScrapeForm
from bs4 import BeautifulSoup
import requests

def scrape_data(request):
    if request.method == 'POST':
        form = ScrapeForm(request.POST)
        if form.is_valid():
            link = form.cleaned_data['link']
            response = requests.get(link, verify=False)

            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')

                # Extract data from the HTML using BeautifulSoup
                urls = [a['href'] for a in soup.find_all('a', href=True)]

                # Remove empty and None URLs
                urls = [url for url in urls if url]

                scraped_data = {
                    'title': soup.title.text,
                    'paragraphs': [p.text for p in soup.find_all('p')],
                    'urls': urls,
                    'url_count': len(urls),
                }

                return render(request, 'scraped_data.html', {'data': scraped_data, 'form': form})
            else:
                return render(request, 'scraped_data.html', {'error': f'Error: {response.status_code}', 'form': form})
    else:
        form = ScrapeForm()

    return render(request, 'scrape_form.html', {'form': form})

`scraper_app/templates/scrape_form.html`

In the templates folder, create a file named scrape_form.html for the input form.

<!-- scraper_app/templates/scrape_form.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Django URL Data Scraper</title>
</head>
<body>
    <h1>Django URL Data Scraper</h1>
    <form method="post" action="{% url 'scrape_data' %}">
        {% csrf_token %}
        {{ form }}
        <button type="submit">Scrape Data</button>
    </form>
</body>
</html>

`scraper_app/templates/scraped_data.html`

Create another file named scraped_data.html in the templates folder for displaying the scraped data.

<!-- scraper_app/templates/scraped_data.html -->

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Scraped Data</title>

    <!-- Adding a playful and vibrant style -->
    <style>
        body {
            font-family: 'Comic Sans MS', cursive, sans-serif;
            background-color: #ffe6e6;
            color: #333;
            margin: 20px;
            text-align: center;
        }

        header {
            background-color: #ff4500;
            padding: 10px;
            margin-bottom: 20px;
        }

        h1 {
            color: #fff;
            margin-bottom: 20px;
        }

        h2 {
            color: #1e90ff;
            border-bottom: 2px solid #1e90ff;
            padding-bottom: 10px;
            margin-top: 20px;
        }

        ul {
            list-style-type: none;
            padding: 0;
        }

        li {
            margin-bottom: 10px;
            font-size: 18px;
        }

        a {
            color: #4caf50;
            text-decoration: none;
            font-weight: bold;
        }

        form {
            margin-top: 20px;
            display: flex;
            justify-content: center;
        }

        input {
            padding: 10px;
            font-size: 16px;
        }

        button {
            padding: 10px 20px;
            background-color: #ff4500;
            color: #fff;
            border: none;
            cursor: pointer;
            border-radius: 5px;
            font-size: 16px;
        }
    </style>
</head>
<body>
    <header>
        <h1>{{ data.title }}</h1>
        <form method="post" action="{% url 'scrape_data' %}">
            {% csrf_token %}
            {{ form }}
            <button type="submit">Scrape Again</button>
        </form>
    </header>

    {% if data.paragraphs %}
        <h2>Paragraphs:</h2>
        <ul>
            {% for paragraph in data.paragraphs %}
                <li>{{ paragraph }}</li>
            {% endfor %}
        </ul>
    {% endif %}

    {% if data.urls %}
        <h2>Extracted URLs ({{ data.url_count }} found):</h2>
        <ul>
            {% for url in data.urls %}
                <li><a href="{{ url }}" target="_blank">{{ url }}</a></li>
            {% endfor %}
        </ul>
    {% endif %}
</body>
</html>

`url_scraper/urls.py`

Finally, configure the project’s URLs in the urls.py file.

# url_scraper/urls.py
from django.contrib import admin
from django.urls import path
from scraper_app.views import scrape_data

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', scrape_data, name='scrape_data'),
]

Running the Application

Run the development server:bashCopy codepython manage.py runserver
Access the application at http://127.0.0.1:8000.

This tutorial guides you through the process of building a Django URL Data Scraper, providing a foundation for web scraping projects. Customize and expand upon this project based on your specific requirements

“Explore the Django URL Data Scraper tutorial and enhance your web scraping skills. Visit our blog for detailed steps and subscribe to our YouTube channel for more coding insights. Happy coding!”

. Happy coding!