The main idea for this project is to gather different news reports which are provided by different news media in different countries and then show the reports or articles to users by relevance. Users can choose to query through the news in different countries.
There are many other kinds of news search engines but most of them are based on news from specific media or countries. (Such as New York Times news app) What I did is to gather news from all over the world to give users a new view of the news.
All the data are collected from the NEWS API. NEWS API provides users with thousands of news from different countries. Also, they provided detailed documents for me to use their API.
Here is the code for obtaining data from News API
After processing the data, I need to rank the news for the queries which users enter into the system. The steps for tokenizing and building the corpus for different ranking functions are almost the same. And I have tried the classic Okapi BM25 function at first and that is proved to be the best in all my choice. I also tried Pivoted and some of my own versions of the ranking function which is proved to be less accurate than the Okapi BM25.
Since I have both descriptions and titles in my database. At first, I have no idea of which information I should use for doing the ranking. Since in my opinion, only using the title, only use descriptions, and using both title and descriptions all made sense. But after I tried the three different ways, I found only using titles have the best results. I guess this may result from the fact that in descriptions, authors like to write some interesting part of the news to attract people but this might cause the results to be confusing since they are not the main theme of the news.
Sample tests and Ideal Results
Another difficulty for me is to generate sample tests and produce Ideal Results. Similar to the process which we did in homework 3 part a, I manually made a file that contains 20 totally different queries. Then I go through the results to find the 10 most relevant documents for each query. The queries are used for sample tests and the results are used as the most perfect result to compare different ranking functions.
4. Results and Discussions
I calculated MAP and NCDG for each method I used. For the ranking function, I tried 4 different kinds of ranking functions and 3 different choices for choosing the right documents.
The results for the ranking functions:
The first sample stands for the first ten samples and the second sample stands for the second ten samples. It seems that I didn’t get a better function than the BM25. Although I have tried a lot of new functions on my own it still can not exceed the results or BM25
The results for comparing the documents:
5. What’s Next
In the future, I plan to deploy the website on AWS. Also in the current stage, all the news is offline. I plan to make the system query directly from API. This is used for offering users with the latest news.