| Abstract: |
Kernel methods, such as support vector machines (SVMs), have been successfully used in various aspects of machine learning problems, such as classification, regression, and ranking. Many of them are formulated as quadratic programming (QP) problems, which take
O(m^3) time and O(m^2) space complexities, where m is the training set size. It is, thus, computationally infeasible on massive data sets. A major research challenge is thus on scaling it up for use on very large data sets.
In this talk, I will first show that the classification SVM can be equivalently formulated as a minimum enclosing ball (MEB) problem in computational geometry. By adopting an efficient approximate MEB algorithm, I obtain provably optimal solutions based on the idea of core-sets. The resultant Core Vector Machine (CVM) algorithm has just linear time complexity and a space complexity that is even independent of m for a fixed approximation factor (1+epsilon)^2. By further generalizing to a center-constrained minimum enclosing ball problem, the generalized CVM algorithm can be used in a variety of learning problems including regression, ranking, and semi-supervised learning.
Experiments on massive real-world data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle very large data sets. For example, the CVM produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz
Pentium-4 PC.
Finally, I introduce the enclosing ball (EB) problem where the ball's radius is fixed and thus does not have to be minimized. I develop efficient (1+epsilon)-approximation algorithms that are simple to implement and do not require any sophisticated numerical solver. For the Gaussian kernel in particular, a suitable choice of this (fixed) radius is easy to determine, and the center obtained from the (1+epsilon)-approximation of this EB problem is close to the center of the corresponding MEB. Experimental results show that the proposed algorithm has accuracies comparable to the other large-scale SVM implementations, but can handle very large data sets and is even faster than the CVM in general. |