Phrase Based Statistical Machine Translation (PBSMT)

Abstract

Given a source language sentence s for translation into target language sentence t, Phrase Based Statistical Machine Translation (PBSMT) attempts to find the sentence t that maximizes the posterior probability P(t|s) by considering the product of the language model P(t) and the translation model P(s|t). During decoding, the input sentence is segmented into a sequence of phrases that are translated into the corresponding target language phrases. Phrases, which can be any substring (word sequences) and not necessarily phrases in any syntactic theory, allow these models to learn local reorderings, translation of short idioms, or insertions and deletions that are sensitive to local context. Reordering of the output phrases is modeled by a relative distortion probability distribution. In order to calibrate the output length, a factor (word cost) is introduced for each generated output word in addition to the language model. The translation model is thus a phrase translation model that can be extracted from the parallel corpus by using word alignment techniques followed by appropriate refinements or directly from the parallel corpus using a phrase-based joint probability model.

The Phrase based SMT models usually work well in cases where the domain is fixed, the training and test data match, and a large amount of training data is available. Nevertheless, standard SMT models tend to perform much better on languages that are morphologically simple, whereas highly inflected languages with a large number of potential word forms as well as language pairs with large scale word order variations are more problematic, particularly when training data is sparse.

The Syntax based Statistical Translation model accepts a parse tree as input, i.e., the input sentence is preprocessed by a syntactic parser. The operations that are performed on each node of the parse tree are reordering child nodes, inserting extra words at each node, and translating leaf words. The output is a string, not a parse tree. Therefore, parsing is only needed on the input side. The reorder operation is intended to model translation between languages with different word orders, such as SVO-languages and SOV-languages. The word-insertion operation is intended to capture linguistic differences in specifying syntactic cases.

Besides Statistical MT another corpus based MT approach which has contributed greatly to the field is Example Based Machine Translation (EBMT). From its very inception EBMT has made use of a range of sub-sentential data - both phrasal and lexical - to perform translations. With the advent of phrase-based SMT systems the line between EBMT and SMT has become significantly blurred.