World News Search Engine

1. Introduction

The main idea for this project is to gather different news reports which are provided by different news media in different countries and then show the reports or articles to users by relevance. Users can choose to query through the news in different countries.

2. Data

All the data are collected from the NEWS API. NEWS API provides users with thousands of news from different countries. Also, they provided detailed documents for me to use their API.

Saving Cache
Making requests
Saving them to SQL
Screenshots for SQL

3. Methods

Ranking Functions

After processing the data, I need to rank the news for the queries which users enter into the system. The steps for tokenizing and building the corpus for different ranking functions are almost the same. And I have tried the classic Okapi BM25 function at first and that is proved to be the best in all my choice. I also tried Pivoted and some of my own versions of the ranking function which is proved to be less accurate than the Okapi BM25.


Query Choice

Since I have both descriptions and titles in my database. At first, I have no idea of which information I should use for doing the ranking. Since in my opinion, only using the title, only use descriptions, and using both title and descriptions all made sense. But after I tried the three different ways, I found only using titles have the best results. I guess this may result from the fact that in descriptions, authors like to write some interesting part of the news to attract people but this might cause the results to be confusing since they are not the main theme of the news.

Sample tests and Ideal Results

Another difficulty for me is to generate sample tests and produce Ideal Results. Similar to the process which we did in homework 3 part a, I manually made a file that contains 20 totally different queries. Then I go through the results to find the 10 most relevant documents for each query. The queries are used for sample tests and the results are used as the most perfect result to compare different ranking functions.

4. Results and Discussions


I calculated MAP and NCDG for each method I used. For the ranking function, I tried 4 different kinds of ranking functions and 3 different choices for choosing the right documents.


5. What’s Next

In the future, I plan to deploy the website on AWS. Also in the current stage, all the news is offline. I plan to make the system query directly from API. This is used for offering users with the latest news.