Thursday, April 19, 4:30 p.m.
Where: 6115 Gates and Hillman Centers
Computer Science and Engineering
University of California, San Deigo
Machine Learning/Google Seminar
Annotation on the Cheap
We consider the task of labeling an entire data set, given the ability to
query the labels of specific points within it.
Specifically, we define the "finite population annotation" (FPA) problem
as follows. The input consists of: a set of n data points, each of which
has an associated label that is initially missing; and parameters
delta, epsilon. The algorithm is given the ability to ask for any of the
missing labels, and has to produce a set of n labels such that with
probability at least 1-delta, at most an epsilon fraction of them are
incorrect. The goal is to meet this statistical requirement without
asking for very many labels.
This problem is motivated by scenarios in e-discovery and active learning.
We describe three algorithms that solve it by exploiting different types
of structure in the data. In each case, we characterize the types of data
for which the algorithm requires few labels, and we give experimental
illustrations of its behavior.
This is joint work with David Lewis and Matus Telgarsky.
Sanjoy Dasgupta develops algorithms for the statistical analysis of high-dimensional data. Such data is now widespread, in domains ranging from environmental modeling to genomics to web search.
Prior to joining the UCSD Jacobs School in 2002, professor Dasgupta was a senior member of the technical staff at AT&T Labs-Research, where his work focused on algorithms for data mining, with applications to speech recognition and to the analysis of business data. Professor Dasgupta received a Ph.D. in Computer Science in 2000 from UC Berkeley and a B.A. in Computer Science from Harvard in 1993. He is a member of the editorial boards of the Journal of Machine Learning Research, the Journal of Artificial Intelligence Research, and the Machine Learning Journal.
Faculty Host: Steve Hanneke