This project focuses on enhancing existing document representations with representations of semantic content at a subdocument granularity; representing documents from multiple perspectives; and leveraging the attention and domain knowledge of end-users to improve document representations. This will lead to improvements in the retrievability of documents in future searches. The research will provide one of the first systematic studies of the potential added value of interpretation and labeling efforts contributed by a community of end users. In addition to evaluating the results of this particular approach to improving information retrieval, the work will contribute needed tools for evaluating incrementally-added indexing. An outcome of the research will be evaluation techniques that will be applicable to other research on incrementally-added indexing. The project builds a successful, interdisciplinary collaboration with the national, public, healthcare portal in Denmark, with strong participation from the government partner as well as physicians in Denmark who help to design the studies.
The goal of this project was to improve information retrieval in domain-specific web sites. Domain-specific web sites contain information about a particular domain (e.g., cancer diseases and treatments) and users tend to have more specific or focused information needs. As an example, compare cancer.org (the American Cancer Society’s website) which is a domain-specific website focused on cancer with a more general site such as YouTube (a site hosting videos on a huge array of topics). The research team included Dr. Marianne Lykke, Professor, Department of Communication and Psychology, Aalborg University in Aalborg, Denmark as co-project director. With a focus on domain-specific web sites, we studied whether end-user tagging could be used to improve information retrieval and whether a detailed, session-oriented click log of (anonymous) user sessions could be used to make recommendations to users. We partnered with the Danish Cancer Society in Copenhagen, Denmark for this project. The Danish Cancer Society agreed to implement end-user tagging and the capture of a detailed click log to support our research. These features were available on their public website (cancer.dk) from November 2011 through early 2013. We were able to conduct user studies regarding tagging using the cancer.dk live site and we were able to do detailed analysis of the click logs taken from the site while the log capture was operational. The intellectual merit of this project stems from our contributions to the library science literature and the digital library literature. With regard to end-user tagging, an early study (when the collaborative team had designed the tagging features for cancer.dk) confirmed that users understood what tags are, understood the proposed tagging features on the cancer.dk site, and could imagine tagging pages as well as browsing through tags to see pages tagged by other users. They also said they wanted to tag pages that they would like to find again (which means that they might be for personal use as opposed to helping someone else find information that they were looking for). Later studies in the project uncovered that users use tags for a wide variety of purposes including requesting that more information be placed on the site, indicating their opinion of the subject or the content on the page, explaining the content, and other purposes. The users also indicated that they would browse existing tags for various reasons: to search for information, to retrieve known content (perhaps from their own tags), and to find explanations of the content. Note that, operationally, the Danish Cancer Society observed that it took some effort to eliminate inappropriate tags (i.e., spam). We also explored whether or not tags (generally) add content that is not already present on the page. Using a public website for tags (delicious.com), we extracted all tags for the American Cancer Society’s web site (cancer.org) and a public recipe site (recipes.com). We compared the page content with the tags and we compared the distance between pages based on the link structure vs. the distance between pages tagged with the same tag (where distance was measured as number of clicks). We found that at least 90% of the time, tagging pages with a common tag brought them closer together (in terms of number of clicks) than the links in the site structure. We also found that tags add information that is not available on the page in about 2/3 of the cases for cancer.org and about 1/3 of the cases for a recipe site (where the tags tend to repeat words that appear on the page). With our analysis of the click log, we discovered that it is possible to use clustering techniques from machine learning to group together user sessions that visited (some of) the same pages. We discovered that the clusters were strongly related to the question that the user was trying to answer. We also found that some commonly used distance measures are unsuitable for this application. This indicates that mining the clusters to recommend pages to new users holds promise. The broader impact of this work stems in part from the use of a live, production website for our experiments and the use of real users (in Denmark) who were either cancer patients or closely associated with cancer patients in our user tests. Since this project contributed novel methods for making recommendations to users of domain-specific websites, the potential impact of the work is large. Millions of people, world-wide, use the Internet and visit domain-specific web sites every day; our research to improve information retrieval in domain-specific sites may make their searching or browning faster or more effective.