| Abstract: |
It is often convenient to make certain assumptions during the learning
process. Unfortunately, algorithms built on these assumptions can
often break down if the assumptions are not stable between train and
test data. Relatedly, we can do better at various tasks (like named
entity recognition) by exploiting the richer relationships found in
real-world complex systems. By exploiting these kinds of
non-conventional regularities we can more easily address problems
previously unapproachable, like transfer learning. In the transfer
learning setting, the distribution of data is allowed to vary between
the training and test domains, that is, the independent and
identically distributed (i.i.d.) assumption linking train and test
examples is severed. Without this link between the train and test
data, traditional learning is difficult.
In this thesis we propose finding learning techniques that can still
succeed even in situations where i.i.d. and other common assumptions
are allowed to fail. Specifically, we seek out and exploit
regularities in the problems we encounter and document which specific
assumptions we can drop and under what circumstances and still be able
to complete our learning task. We further investigate different
methods for dropping, or relaxing, some of these restrictive
assumptions so that we may bring more resources (from unlabeled
auxiliary data, to known dependencies and other regularities) to bear
on the problem, thus producing both better answers to existing
problems, and even being able to begin addressing problems previously
unanswerable, such as those in the transfer learning setting.
Thus we propose that learned classifiers and extractors can be made
more robust to shifts between the train and test data by using data
(both labeled and unlabeled) from related domains and tasks, and by
exploiting stable regularities and complex relationships between
different aspects of that data. We present preliminary results
supporting this claim drawn from the problem domain of protein name
extraction in biological publications and propose areas of continuing
and future investigation.Committee:
William W. Cohen (chair),
Tom M. Mitchell,
Noah A. Smith,
ChengXiang Zhai (UIUC).
An electronic copy of this proposal is available on-line at: http://www.cs.cmu.edu/~aarnold/thesis/aarnold_proposal.pdf
|