CLiKS: Content-Based Literature Search Using Knowledge and Structure

The project aims to develop new retrieval models and algorithms for searching and browsing in scientific literature. Two important recent developments in the scientific literature production process form the concrete motivation for this project: (1) semantically rich document structuring standards, and (2) increasingly rich keyword annotations that capture domain knowledge. The driving question underlying this proposal is: How can we use these to improve access to scientific literature? To address this question we propose to use rich probabilistic retrieval models that allow us to capture the relation between document content, document structure, and document-level annotations.

To provide focused access, we aim to return semantically defined XML elements, with query models informed by available domain knowledge. We will contrast generative language modeling based approaches with approaches based on discriminative models that may allow for better optimization and estimation methods.

Evaluation is done using standard benchmarks provided by INEX and TREC. On top of that a richly marked up and annotated corpus provided by a leading scientific publisher will be used for a system-centered comparison and for a user study on the benefits of semantically oriented markup vs layout-oriented markup.