Citation-based Extraction of the Core Contents of Biomedical Literature
Biomedical literature has been an essential resource for biomedical research. In this project, we plan to develop a technique CECC that, given a biomedical article r, extracts the core content of r. The core content includes the research goals, research methods, and main findings of r. Extraction of the core content is motivated by the need of retrieving biomedical articles that are highly related to a given article r. A candidate article d is said to be highly related to an r if d and r share similar core contents. Retrieval of the highly related articles is essential for the readers, authors, and reviewers of r, as well as the text mining systems and the human curators that aim at analyzing specific biomedical evidences published in literature. Technically, development of CECC is based on the idea that a biomedical article r often has many citations, which were carefully selected by the author(s) of r to illustrate and highlight the main contributions of r. Therefore, the citation passages (in r) that the author(s) used to discuss the citations can collectively indicate the core content of r more completely and precisely. We expect that inter-article similarity based on the core contents can be measured to improve the retrieval of highly related biomedical articles. Contributions of the project are of technical significance to the information retrieval community, as well as practical significance to the retrieval and mining of the biomedical evidences already published in biomedical literature.
Keywords: core content, citation, citation passage, highly-related biomedical articles, article retrieval.