Dynamic Construction of Category Profiles for Document Filtering and Classification
Information is often represented in text form and classified into multiple categories for efficient browsing, retrieval, and dissemination. In such an information space, each category often contains several documents about a specific topic, and hence lots of documents may be entered at any time, but only a small portion of the documents may be classified into some categories. Unfortunately, traditional classifiers were often trained to process the documents in a given information space (i.e. distinguishing a category from other categories in the given information space), rather than the larger amount of documents out of the information space (i.e. distinguishing a category from all categories out of the information space). In this project, we explore how various classifiers’ performances in document filtering (DF) may be improved by employing more suitable features to distinguish relevant documents from non-relevant documents for each category. We plan to develop a novel approach DP4FC (Dynamic Profiling for Filtering Classification) to serve as a preprocessor of various classifiers. Upon receiving a document to be processed, DP4FC dynamically creates the profile of each category, and accordingly decides to filter out the document or pass the document to the classifier to make the final decision. Therefore, DP4FC and the underlying classifier actually work together to complement each other in classifying a document into a category. DP4FC is trained to measure how the document is similar to the basic contents of the category, while the underlying classifier is trained to measure how the document is similar to the discriminative contents of the category (with respect to all the categories in the information space). The contributions of DP4FC could be more significant when the categories have a higher degree of relatedness (e.g. computer networks and compute games). In that case, the underlying classifier may only be competent in distinguishing among the categories using those features that are special in the categories (e.g. "network" and "game"). It may not be competent in recognizing the common backgrounds of the categories using those features that are common in the categories (e.g. "computer"). Theoretical analysis and empirical experiments are to be conducted to evaluate DP4FC under different circumstances. The contributions of the research project are of both theoretical and practical significance to automatic information processing and management.
Keywords: Document Filtering, Document Classification, Feature Selection, Category Profile