|
When:
Thursday, May 15, 09:00 a.m.
Where: 3002 Newell-Simon Hall
Jae Dong Kim
LTI PhD Thesis Proposal
Abstract: In general, corpus-based machine translation systems prefer longer units because they naturally convey local context and local reordering. Our lexical Example-Based Machine Translation (EBMT) system also uses long matches of the input so that it takes advantage of keeping local context and reordering. However, its translation score calculation to find a target phrase given a match had been based on heuristics and we needed a mathematically more reasonable model.
On the other hand, analyzing sentences into their chunks instead of N-gram phrases may help a translation system in several ways. As there are now fewer translation units per sentence, there is less distortion(reordering) to be reckoned with. Hence, less noise is to be expected from the mathematical modeling techniques. Another advantage is that we can to some degree systemically translate untranslatable tokens that exist only in one side. For example, when we translate an English sentence into Korean, the word-to-word translation systems cannot produce a nominative case marker in Korean unless rules are given by human experts or the systems "hallucinate" markers and use language modeling to guess whether or not the case marker should in fact be present.
In this proposal, we show how our new phrasal aligner SPA improved the system and discuss what to investigate for SPA and a chunk-based system. For the chunk-based system, we propose methods on chunk alignment using SPA and other SMT techniques, chunk generalization with chunk labels and chunk use as a basic translation unit in conjunction with a word-based system as a back-off model.
<< Back
|