Python Django Project – Learn to Build your own News Aggregator Web App

Python course with 57 real-time projects - Learn Python

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

After gaining knowledge from the Django tutorials, it’s time to implement and showcase that. In this Python django project, you will learn to build your own news aggregator web application by integrating Django with other technologies.

Although, some prerequisite is important.

Prerequisite

You need to have some basic knowledge of these libraries:

  • Django Framework
  • BeautifulSoup
  • requests module

What is a News Aggregator?

It is a web application which aggregates data (news articles) from multiple websites. Then presents the data in one location.

News aggregator service is a very important start of the day.

There are various publications and news sites online. They publish their content on multiple platforms. Now, imagine when you open 10-20 news sites every day. The time you waste to gain information. Information gain is everything in today’s world.

It can give you leverage over those who don’t have it. Now, is there a way we can make it easier? Yes!!

A news aggregator makes this task easier. In a news aggregator, you can select the websites you want to follow. Then the news aggregator collects the articles for you. And, you are just a click away to get information from various websites.

This task otherwise takes too much time on our schedule.

About the Django Project

A news aggregator is a combination of web crawlers and web applications. Both of these technologies have their implementation in Python. That makes it easier for us.

So, our news aggregator will work in 3 steps:

  1. It scrapes the web for the articles. (In this Django project, we are scraping a website called theonion)
  2. Then it stores the article’s images, links, and title.
  3. The stored objects in the database are served to the client. The client gets information in a nice template.

So, that’s how our web app will work.

You can find the complete source code of this Django project in this Github repository:

News Aggregator Files

This is a screenshot of the page.

news aggregator interface - django project

This might not look very interesting. There are lots of things we will need to do before getting this page.

Also, check out the page of theonion website before proceeding.

theonion website page - django project

So, let’s get started.

Steps to Build Django Project on News Aggregator App

Before starting, we will need to install some of the libraries. We will install the requests and BeautifulSoup libraries. You can install them using pip.

pip install bs4
pip install requests

libraries installation - django project

Now, we will make a new Python Django project named DataFlair_NewsAggregator. Then we will make new application news.

Commands:

django-admin startproject DataFlair_NewsAggregator

Move into the folder where manage.py is present.

python manage.py startapp news

Writing Models

We will be storing the urls and articles in our database. For that, we will need the model.

In news/models.py, create these models.

Code:

from django.db import models
class Headline(models.Model):
  title = models.CharField(max_length=200)
  image = models.URLField(null=True, blank=True)
  url = models.TextField()

  def __str__(self):
    return self.title

class headline - django project

Our models will be able to store three things:

  1. Title of the article
  2. URL of the origin or source
  3. URL of the article image

We are using simple model fields for that purpose. Also, the image field can be blank. The __str__() method will return the string representation of the object. These are simple Django concepts.

Now, let’s start with the steps for web crawlers.

Step 1 – Scrape the website

We will be scraping the website for getting articles. Web-Scraping means extracting data from the websites. We extract meaningful data from the websites. In this case, we will be extracting the articles from the theonion website.

To scrape the website, we will use beautifulsoup and requests module. These libraries are the bs4 and requests and modules are used for web crawling.

Open news/views.py file.

First, import these libraries before using them.

Code:

import requests
from django.shortcuts import render, redirect
from bs4 import BeautifulSoup as BSoup
from news.models import Headline

We will be making the first view function as scrape().

Code:

def scrape(request):
  session = requests.Session()
  session.headers = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
  url = "https://www.theonion.com/"

  content = session.get(url, verify=False).content
  soup = BSoup(content, "html.parser")
  News = soup.find_all('div', {"class":"curation-module__item"})
  for artcile in News:
    main = artcile.find_all('a')[0]
    link = main['href']
    image_src = str(main.find('img')['srcset']).split(" ")[-4]
    title = main['title']
    new_headline = Headline()
    new_headline.title = title
    new_headline.url = link
    new_headline.image = image_src
    new_headline.save()
  return redirect("../")

def scrape - django project

This view function uses modules like requests, bs4 and Django’s shortcuts.

We have imported the model Headline from news.models. Also, we have other libraries.

The first line of the function is a setting for requests framework. These settings are necessary. They will prevent the errors to stop the execution of the program. Then we write our view function scrape(). The scrape() method will scrape the news articles from the URL “theonion.com”.

The first variable is the session object of the requests module. These are essential to make a connection to the server. This is the abstraction provided by requests framework.

The session variables have headers as HTTP headers. These headers are used by our function to request the webpage. The scrapper acts like a normal http client to the news site. The User-Agent key is important here.

This HTTP header will tell the server information about the client. We are using Google bot for that purpose. When our client requests anything on the server, the server sees our request coming as a Google bot. You can configure it look like a browser User-Agent.

That won’t affect our use-case though. After that, we introduce the content variable. We store the webpage or response given by the server in content. Now, the beautifulsoup comes in.

The beautiful soup is a library that can extract data from HTML web pages. We create a soup object where we pass the HTML page. Alongside the HTML page, we also pass HTML parser as a parameter.

The HTML parser will parse the HTML as a BeautifulSoup object. In this object, we can access HTML elements and their texts.

In the News object, we return the <div> of a particular class. We selected this class from the webpage inspection. We inspected the webpage of the website theonion. Now, we select the elements which have the information we need.

div class - djnago project

As you can see from this image, by inspecting the element, we find a common class. The rest is just extracting information from that element.

Now we get 3 elements of this class. That means that the three articles are present in this class. These articles have a very general structure. Now, we will extract the information which we need. In this case, we have to extract the title, link, and image link.

Using a for loop, we can iterate over soup objects. In the for loop, the main variable will hold the link to the origin webpage. The main attribute gets the anchor tag. Since, the <div>s returned only have one <a>tag, we get most of our work done here.

The <a> tag contains title and href of the original link. We can access the href in <a> tag by writing main[‘href’].

Similarly, we can extract the title by main[‘title’]. Remember the main is the <a> tag beautifulsoup object.

Then we find the image URL. To get the image_src, we find the image in the main. This is all according to the webpage layout. We are not doing this because of syntax.

These are how the website has made its webpage. We are simply finding the elements and accessing them appropriately. You need to have some basics clear of beautiful soup and HTML.

So, once we get the image, we extract the srcset attribute from the same.

img srcset - python django project

The srcset attribute contains various sizes of images, as we can see in the image. There we have to extract the size of the image which is big enough for us. We select the one with 800 width.

We get a string that has the source of the image and its width. And, we can travel over that list using Python indexing. As you can see in the code, we use the split() on the string to get a list. There we use index [-4]. That will give us the URL of 80 width image. That is stored as string in the image_src variable.

Step 2 – Store the data in the database

We have made our model Headline for this purpose. Now we will be performing the standard storing procedure. We create a new Headline() object. There we fill the corresponding fields.

Code:

new_headline = Headline()
new_headline.title = title
new_headline.url = link
new_headline.image = image_src
new_headline.save()

This the standard code for storing in the database.

Step 3 – Serve the stored database objects

This step is very easy too. We create a new view function for this purpose. That is news_list() method. The code lies in the file news/views.py file.

Code:

def news_list(request):
    headlines = Headline.objects.all()[::-1]
    context = {
        'object_list': headlines,
    }
    return render(request, "news/home.html", context)

def news_list

Here is a simple Django code. We simply extract all the elements from the database. Since we want the latest info on top, we reverse the list. Then we simply pass the list in a context. The context is then given to home.html in folder news/template/news.

Writing Templates

Here is the code for home.html. In this template, we are using bootstrap and HTML. The code in the home.html:

<!DOCTYPE html>
<html>
<head>
    <title></title>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
</head>
<body>
    <div class="jumbotron">
        <center><h1>DataFlair News Aggregator</h1>
            <a href="{% url 'scrape' %}" class="btn btn-success">Get my morning news</a>
        </form>
    </center>
    </div>
  <div class="card-columns" style="padding: 10px; margin: 20px;">
    {% for object in object_list %}
    <div class="card" style="width: 18rem;border:5px black solid;">
  <img class="card-img-top" src = "{{ object.image }}">
  <div class="card-body">
    <h5 class="card-title"><div class="card-body">
      <a href="{{object.url}}"><h5 class="card-title">{{object.title}}</h5></a>
    </div></h5>
    </div>
  </div>
  {% endfor %}
</div>
</div>
    <script
src="http://code.jquery.com/jquery-3.3.1.min.js"
integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8="
    crossorigin="anonymous"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>
</body>
</html>

DOCTYPE html - django project

The basic knowledge of bootstrap and HTML can help here. It’s a simple Django template.

We have provided a link to the scrape view function. At line 10, the link to the scrape view function is provided. We will be defining our urls and then you will have a clearer picture.

Then at line 15, our news logic is written. Here we print the news objects one by one. The for loop is used for that purpose.

Configuring urls.py

Last, we configure our urls.py file. Make a file news/urls.py. Paste this code inside the urls.py.

Code:

from django.urls import path
from news.views import scrape, news_list
urlpatterns = [
  path('scrape/', scrape, name="scrape"),
  path('', news_list, name="home"),
]

from django.urls

Then we also need to connect this to main urls.py. Open DataFlair_NewsAggregator/urls.py file and paste this code inside that or update it.

Code:

from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', include("news.urls")),
]

from django.contrib - intermediate django project

This is the normal Django code to connect urls.

So, our Python example project is complete. Let’s run it and see the homepage. In this case, when we open server and run news_list view.

news_list - django project

Output:

news aggregator page

You can click on the links. That will take you to the original article page.

Now, you can configure this to gather your favorite article websites. Although, be wary of blocks. Many times, bots are not legally allowed to scrape content. So, web scraping comes at its own cost.

But, for our purpose, we now know some very cool basics. We also have a very interesting project to showcase. You can enhance this Django application as much as you can.

Summary

We have successfully completed the first project in Django. We are using web scraping and Django. This integration is as easy as invoking a function in Python.

You can make some more projects in Django using the same concepts. Django lets you integrate machine learning too.

How was your experience working on the Django project? Do share in the comment section.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

follow dataflair on YouTube

30 Responses

  1. Ekwoge Blaise says:

    I personally enjoyed the project given that I am a beginner in django, this has boosted my morals and confidence level. I would love to learn even more and get even more projects to get me going as I grow up my career. Thanks again for this tutorial

  2. SP says:

    how I replace the news articles with another different news, I am not able to change the news articles? plz explain..

    • DataFlair Team says:

      To refer to any other news website you can change the url int this line: url = “https ://www. theonion .com/”

      • Oscar says:

        That did not work. I even used invalid URLs and it keeps bringing the same old news.

        I noticed db.sqlite3 had a record in news_list table which I suspect is the source and not the URL.

  3. Tino says:

    Hi, thank you for this article it is very informative. I’m struggling with my models, where dol put this code:
    new_headline = Headline()
    new_headline.title = title
    new_headline.url = link
    new_headline.image = image_src
    new_headline.save()

  4. Yash Rai says:

    no need to put it any where, it is already present in views.py

  5. Khushbu says:

    I am getting an error which says ‘Headline’ has no ‘objects’ member in views.py
    Please guide me

    • DataFlair Team says:

      Check your Headline class in models.py file, Its name should strictly match with the one used in views.py file.

  6. cristine says:

    I am getting the very old content of the onion site which is being shown in blog not the latest one.How to change that?

  7. Josh says:

    Thanks for the tutorial, very helpful. However I am only getting the news sites as shown when this tutorial was made, it doesnt give the news on the website so therefor isnt actually giving the morning news…

    Do you know of any fixes? Thanks!

    • DataFlair Team says:

      Change the url specified in the views.py file and after that, you might need to change some more lines in vies.py to properly scrape from the new website. To change this, you need to observe the HTML code of the new website. To view the HTML code of any website -Right-click anywhere on its home page and then select inspect from the list displayed.

  8. Preksha Rai says:

    Same here.
    I am also not able to get the news after clicking on get morning news. Did you resolve it?

    • DataFlair Team says:

      Change the url specified in the views.py file and after that, you might need to change some more lines in vies.py to properly scrape from the new website. To change this, you need to observe the HTML code of the new website. To view the HTML code of any website -Right-click anywhere on its home page and then select inspect from the list displayed.

  9. mani kumar says:

    i need analysis phase and functional and non-functional requirements

  10. Saif says:

    Do you have a hospital aggretor based on this?

  11. Sakshi says:

    i cannot redirect to the original news link as the url which I’m getting after scrapping is not complete and after adding the base url in front of it ,it is not working it turns like localhost:8000/url_which_I’m_getting after scrapping instead of the original link to the news page. Can you please help me with this?

  12. Zafar Imam says:

    How to provide tutorial material, soft copy or hardcopy

  13. Manoj says:

    This project is about collecting the news articles from various websites at one place but here we are only collecting the articles from single website

  14. Safa Imam says:

    I personally enjoyed the project but I am getting this error [The view topProd.views.scrape didn’t return an HttpResponse object. It returned None instead]

  15. Saad says:

    Thank you for this tutorial! I was looking for a news aggregator project written in python and this one is a good example. I have a question. Basically a News Aggregator scrapes the web. Does it require a proxy as long as it scrapes only link, title and image?

  16. Zecil says:

    Can I collect news articles from multiple websites at the same time in this program?
    Also I’m facing difficulty collecting feeds from another website and I know that I have to change the requirement in views.py but I don’t know exactly what to observer in the html code while inspecting.

  17. Aish Patil says:

    please give me full code

  18. tharicq says:

    from which file can i run this project?

  19. Tejeswar says:

    Hello,I am getting an error that “django.db.utils.DatabaseError: file is not a database”.can you tell me how can i resolve it

  20. Oscar says:

    Several people asked the same question without a solid answer. This isn’t even scrapping new content, it seems like it’s reading from the Sqlite3 DB. Anyways, thanks for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *