Context Recognition for Document Filtering and Classification

 

Rey-Long Liu

 

ABSTRACT

Text hierarchies have been a popular way to organize information for browsing, retrieval, and dissemination. In practice, much information may be entered at any time, but only a small subset of the information may be classified into some categories in a hierarchy. Therefore, achieving document filtering (DF) in the course of document classification (DC) is important. It is an essential basis to develop an information center, which classifies suitable documents into suitable categories, reducing information overload while promoting information visibility. Its challenge lies on the estimation of the degree of acceptance (DOA) of each input document with respect to each category. Improper estimations may mislead the system to derive a strategy that classifies more (fewer) unsuitable (suitable) documents into unsuitable (suitable) categories. To tackle the challenge, this project plans to develop a framework ICenter by extending our previous experiences in data mining, text mining, and document classification. ICenter tries to make DOA estimations more proper by considering the context of discussion (COD) of each document and category. It employs profile mining and COD thresholding to achieve COD recognition. Experiments on real-world data are also to be conducted to empirically evaluate the contributions of COD recognition in integrated DF and DC. We expect that, through COD recognition, the performance of ICenter may be not only significantly better, but also more stable under different environments, including different training data, text hierarchies, and test data. The research results are of theoretical and practical significance. They may serve as an essential basis to develop an information center for a user community, which organizes and shares a hierarchy of textual information.

Keywords: Text hierarchy, document filtering, document classification, COD recognition, profile mining, COD thresholding

 

     Back to Research Project