| Abstract: |
Incremental hierarchical text document clustering algorithms are
important in organizing
documents generated from streaming on-line sources, such as, Newswire
and Blogs. However, this is a relatively unexplored area in the text
document clustering literature. Popular incremental hierarchical
clustering algorithms, namely Cobweb and Classit, have
not been applied to text document data. We discuss why, in the current
form, these algorithms are not suitable for text clustering and
propose an alternative formulation for the
same. This includes changes to the underlying distributional
assumption of the algorithm
in order to conform with the empirical data. Both the original Classit
algorithm and our
proposed algorithm are evaluated using Reuters newswire articles and
Ohsumed dataset,
and the gain from using a more appropriate distribution is demonstrated. |