The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work in the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archiving Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.

PhD Topic Description

The purpose of this PhD is to develop intelligent and adaptive models, methods, algorithms, and tools for making the content acquisition process in Web archiving more effective and efficient. The objective is to leverage existing work (in particular inside the ARCOMEM consortium) on the extraction and analysis of events, entities, topics, opinions, perceptions, etc., to select and prioritize sources to be crawled. This will be combined with techniques that go beyond traditional page-level crawling, allowing object-level, goal-driven crawling of the Web. The focus is on unsupervised methods, that can scale to the whole Web. In particular, the following aspects will be covered:

  • assessing the relevance, importance, and coverage of available content with respect to a Web archiving task;
  • combining evidence to select or prioritize the crawling process;
  • accessing content at the level of objects inside Web pages and hidden behind deep Web forms or Web 2.0 applications.
Depending on the content to be archived (social networks, structured Web, deep Web, etc.), different solutions for assessing relevance, prioritizing the crawling, and extracting Web objects can be proposed.