| Abstract: |
This talk presents our recent work in two areas: large-scale text categorization and adaptive filtering. The first analyzes the scaling problem in automated text categorization with very large taxonomies via hierarchical decomposition, and evaluates Support Vector Machines and k-nearest neighbor classifiers on the full domain of Yahoo! categories (132,199 categories in both the training and test sets). The second part introduces the research challenges in semi-supervised learning for classification with non-stationary topics or events, with extremely sparse training examples at the start and incremental relevance feedback on biased samples during the filtering process. Our cross-benchmark evaluation with regularized logistic regression and Rocchio-style classifiers concludes on the-state-of-the-art solutions: using relevance feedback on .04~0.6% documents yielded 54% cost (error) reduction and 21% utility increase, compared to the best system without relevance feedback. |