Thousands of news articles are published daily presenting us with new events or providing updates to events reported earlier. Decision and policy makers in companies and government alike need to be aware of what is going on in order to make informed decisions. It is impossible to keep track of the vast amount of information coming in daily, especially when relevant information stretches over long periods of time.
The NewsReader project team of VU Amsterdam wanted to develop an architecture capable of processing as many daily news items as fast as possible. The team aims to extract information about events automatically and link today’s news streams to information collected from earlier news articles and complementary resources such as encyclopedias or company profiles. To this end, they apply state-of-the-art language technology to process texts from daily incoming news. This provides the basis to build a ‘history recorder’. As such, it will provide the complete story line, pointing out forgotten details and possibly even finding new links between current situations and events that happened in the past.
With current technology it is impossible to keep track of the vast amount of information coming in daily. Processing one article typically takes about 6 minutes on one standard machine; with approximately 1 million new articles per day for English only (from LexisNexis), with a historical backlog of millions of articles, this produces an enormous amount of data. One of the main challenges in this project lies in scaling up linguistic processing and maximising the usage of computational resources to manage the daily stream of incoming information.
Part of this challenge is the optimisation of the linguistic pipeline consisting of 18 modules that are all interacting with each other in different ways. It is also important to choose resources capable of supporting the processing of articles coming in every day. Finally, to make sure the information from previous news articles can easily interact with new data coming in, a smart solution for storing the information is necessary.
In collaboration with the Netherlands eScience Center and Dutch Research & Education network SURF, Piek Vossen, the applicant of the project, and his colleagues looked at what the best possible approach would be. Hadoop is extremely effective at processing large data sets whose size is known in advance. The eScience Center and SURF specialists transferred the project to Hadoop, which runs on several virtual machines via HPC Cloud. They also shared their knowledge within the study group. Two members of the research team are now able to independently work with the Hadoop architecture.
The analysis of 3.5 million articles requires approximately one terabyte of storage in order to store all results and source documents. The local storage capacity of VU Amsterdam is used for this.
This is the first time we are able to work with such large amounts of data. The bottlenecks have now also been properly identified at different levels within the project for the first time. This project was supported by the Enlighten Your Research programs EYR4 and EYR-Global, which was organized with SURFnet, SURFsara, NLeSC and international R&E networks such as Internet2. The collaboration with Enlighten Your Research has given the team more insight into the possibilities and what needs to be done. Vossen: “I think we now know what it would take to make this a reality. Does this mean we can handle the volume of news? Not yet, but we are almost there. I don’t think we have made the most of it yet, but it won’t take long before we do.”
For more information please contact our contributor(s):