Personalized News Summarizer

Published:

Github forks

Staying up to date with the latest news is necessary in today’s world, but it is also becoming increasingly difficult to do so. There are so many news sources, with difficult-to-navigate websites, and clickbait titles and descriptions that make it har to determine which articles are worth reading.

The Personalized News Summarizer tries to solve this problem by creating a simple web page with a clean UI that displays the latest news articles from a variety of sources and provides a summary of each article. The app is hosted on the Google Cloud Platform and can be accessed here (if my GCP credits have not expired yet).

Getting the News Articles

The first step is to get the news articles. There are many news APIs available, often with a free tier that allows us to retrieve a small number of articles per day, which is sufficient for our use case. I tried several APIs, including

Finally, I settled on the NewsData API, which provides real-time news articles from a variety of sources and allows us to filter the articles by country, category, and language. We use the API to query for the latest articles based on the categories and number of articles specified by the user. The following code snippet shows how we query the API for the latest articles.

class NewsDataAPI(NewsAPI):
    
    ...

    def news(self, news_query:NewsQuery, **kwargs)->list[Article]:
        category = self.CATEGORIES[news_query.category]['name']

        headers = {'X-ACCESS-KEY' : self.api_key}
        params = {'category':category, 'language':self.language, 'country':self.country}
        r = requests.get(self.news_url, headers=headers, params=params)
        r.raise_for_status()
        return [self._parse_article(a) for a in r.json()['results']]

These articles are then stored in a MongoDB database, which is hosted on MongoDB Atlas. We use the PyMongo library to connect to the database and store the articles in a collection. The article fetching function is hosted on GCP as a Cloud Function, which is triggered by a Cloud Scheduler job every day in the morning that pushes a message to a Pub/Sub topic to fetch the latest articles.

The cloud function gets information about the user’s preferences from the cloud event that triggered it, while the MongoDB credentials are provided as environment variables by the function’s secret manager, as shown below.

@functions_framework.cloud_event
def fetch_news(cloud_event):
    request_b64 = cloud_event.data['message']['data']
    request_str = base64.b64decode(request_b64).decode('utf-8')
    request_json = json.loads(request_str)

    # generate data
    data = generate_data({
        'num_articles' : request_json.get('num_articles', 2),
        'categories' : request_json.get('categories', 'General').split(',')
    })
    
    # write data to MongoDB
    with get_client('db_info.json', 'conn_info.json') as client:
        db = client['news']
        collection = db['articles']

        for cateogry, articles in data.items():
            documents = []
            for article in articles:
                documents.append({
                    'category' : cateogry,
                    'api_src' : article.api_src,
                    'url' : article.url,
                    'title' : article.title,
                    'source' : article.source,
                    'time' : article.time,
                    'description' : article.description,
                    'img_url' : article.img_url,
                    'article_text' : article.text
                })
            collection.insert_many(documents)

Summarizing the Articles

The next step is to summarize the articles. For this, we use the BART model from HuggingFace, which is fine-tuned on the CNN/DailyMail dataset to be able to summarize news articles. While the model is free to download and host, it becomes costly to host it on GCP, so we use the HuggingFace inference API to summarize the articles.

Again, we create a Cloud Function that is triggered by a Cloud Scheduler job every minute for an hour in the morning after the articles have been fetched. Each function call retrieves an unsummarized article from the database, summarizes it using the BART model, and updates the article in the database with the summary.

@functions_framework.cloud_event
def summarize(cloud_event):

    summarizer = NewsSummarizer(**summarizer_api_params)

    with get_client('db_info.json', 'conn_info.json') as client:
        collection = client['news']['articles']

        article = collection.find_one({'summary' : {'$exists':False}})

        if article is None:
            print('No articles to summarize')
            return
        
        article_text = article['article_text']
        summary = summarizer(article_text)

        update_result = collection.update_one(
            {'_id':article['_id']}, 
            {"$set": {'summary':summary}},
            upsert=False)
        
        assert update_result.matched_count == 1, 'No document found'
        assert update_result.modified_count == 1, 'No document modified'

    return f'Summarized article with id {article["_id"]}'

Creating the Web Page

Finally, we create a simple static web page that displays the summarized articles. We first create a Flask app that connects to the MongoDB database and retrieves the latest articles to display on the web page. We then use the Frozen Flask library to convert the Flask app into a static web page, which is then hosted on GCP as a Cloud Storage bucket.

@app.route('/')
def homepage():
    with open('site_args.json', 'r') as f:
        args = json.load(f)

    date = datetime.now().strftime('%B %d, %Y')

    with get_client('db_info.json', 'conn_info.json') as client:
        collection = client['news']['articles']
        docs = collection.find({'time' : {'$gt' : datetime.now() - timedelta(days=1)}})
        article_holders = {}
        for doc in docs:
            category = doc['category']
            article_holder = article_holders.get(category, ArticleHolder(category, []))
            article_holder.articles.append(Article(
                title=doc['title'],
                source=doc['source'],
                time_str=doc['time'].strftime('%l:%M %p on %b %d, %Y'),
                url=doc['url'],
                summary=doc.get('summary', 'Summary not available'),
                img_url=doc['img_url']
            ))
            article_holders[category] = article_holder

    article_holders = list(article_holders.values())

    bucket_name = args.get('bucket_name')
    return render_template('index.html',
        curr_date = date,
        article_holders = article_holders,
        stylesheet_url = get_url(bucket_name, 'static/main.css'),
        favicon_url = get_url(bucket_name, 'static/favicon.ico'))

If the user wants to read the full article, they can click on the article title, which will redirect them to the original article on the news website.