| Abstract: |
Rare category detection refers to the problem of identifying the examples
from the minority classes with the least label requests given an unlabeled,
unbalanced data set. It is an open challenge in machine learning, and has a
wealth of applications, such as financial fraud detection, network intrusion
detection, astronomy, spam image detection, etc. In this thesis, we plan to
address this problem from four perspectives: (1) initial class label
discovery for various data types, (2) dealing with prior information about
the data set, (3) feature selection for rare category detection, and (4)
rare category classification.
Our recent work focuses on the first two perspectives, i.e. rare category
detection for data with feature representation and graph data when different
amount of prior information is available. For data with feature
representation, given enough prior information about the data set, we
proposed the nearest-neighbor-based methods, which essentially perform local
density differential sampling. They are proven to be effective both
theoretically and experimentally. On the other hand, when no prior
information about the data set is available, we proposed the
density-based-method, which makes use of the specially designed exponential
families. For graph data, we designed two algorithms which take advantage of
the global similarity between two examples. Given the same amount of
information, the first algorithm performs better than state-of-the-art
techniques; whereas given much less information, the second algorithm is
comparable with state-of-the-art techniques.
Future work includes three directions. First, for data n high dimensional
feature space, we will select the features that are most relevant to the
minority classes. Second, following rare category detection, we will design
effective methods for rare category classification, which takes into account
the fact that the minority classes form compact clusters in the feature
space. Third, we will adapt existing rare category detection methods to work
for stream data. The goal is to identify the emerging trends as soon as
possible.
Thesis Committee:
Jaime Carbonell (Chair),
John Lafferty,
Larry Wasserman,
Foster Provost, NYU.
On line document: www.cs.cmu.edu/~jingruih/thesis/jingrui_proposal.pdf |