| Abstract: |
The primary difference between propositional (attribute-value) and relational data is the existence of relations, or links, between entities. Graphs, relational databases, sets of tensors, and first-order knowledge bases are all examples of relational encodings. Because of the relations between entities, standard statistical assumptions, such as independence of entities, is violated. Moreover, these correlations should not be ignored as they provide a source of
information that can significantly improve the accuracy of common machine learning tasks (e.g., prediction, clustering) over propositional alternatives. A current limitation in relational models is that learning and inference are often substantially more expensive than propositional alternatives. One of our objectives is the development of models that account for uncertainty in relational data while scaling to very large data sets, which often cannot fit in main
memory. To that end, we propose representing relational data as a set of tensors, one per relation, whose dimensions index different entity types in the data set. Each tensor has a low-dimensional approximation, where they share a low-dimensional factor for each shared entity-type. For the case of matrices, we refer to this model as collective matrix factorization.
While existing techniques for relational learning assume a batch of data, we propose exploring extensions to active and mixed initiative learning, where the learning algorithm can query its environment (typically a human user) about relationships between entities, the creation of new predicates, and relationships between predicates themselves. It is our belief that the expressiveness of relational representations will allow for more efficient interaction between the
learner and its environment, as well as leading to better predictive models for relational data. Efficiency refers not only to computational efficiency, but also to the efficiency of data collection in active learning scenarios. To support the claim that our models are efficient, we propose exploring three problems: predicting user's ratings of movies with side information, topic models for text using fMRI images of neural activation on words, and mixed initiative tagging of e-mail and other information used by personal information managers---e.g., tasks from todo lists, recently
edited files, and calendar entries.
Thesis Committee: Geoffrey Gordon (Chair), Christos Faloutsos, Tom Mitchell, Pedro Domingos (Univ. of Washington).
The proposal document is found at: http://www.cs.cmu.edu/~ajit/pubs/proposal.pdf
|