Tuesday, February 11, 2014

Setting up Moses

Moses is an open source tool kit that can be used to perform MT by using parallel corpora to train language models. It can be either phrase based or factored translation.
You can find more about Moses here: http://www.statmt.org/moses/

Moses needs g++ and Boost C++ libraries for a start.

I started off by cloning the source of Moses and Boost from git and sourceforge respectively.
The latest version of Moses and Boost are 2.1 and 1.55 respectively.
Moses can be found here: https://github.com/moses-smt/mosesdecoder
Boost can be found here: http://www.boost.org/users/download/

I first started off by unzipping Boost and compiling it. But as I found, after 3 hours of wasted time later, there was a non-compatibility between Ubuntu 12.04 and Boost 1.55, so I downloaded 1.53 which worked fine for me. After downloading, move it to the folder of your choice and run the following commands. The -j<number> denotes the number of processors on your CPU. 
tar zxvf boost_1_53_0.tar.gz
cd boost_1_53_0/
./bootstrap.sh
./b2 -j8 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static threading=multi,single install || echo FAILURE

Next comes Moses. But not for the trouble with Boost libraries that caused trouble, Moses was quite smooth to get set with. 

./bjam --with-boost=~/rajkiran/boost_1_53_0 -j8 

The with-boost argument is necessary to specify where the Boost libraries that were compiled and placed reside in. After you see those two "SUCCESS" statements, try the sample model.

 
 cd ~/mosesdecoder
 wget http://www.statmt.org/moses/download/sample-models.tgz
 tar xzf sample-models.tgz
 cd ~/mosesdecoder/sample-models
 ~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

That will translate "Das ist ein kleines haus" to "This is a small house". Yes, now you have the permission to jump in joy. However, our real work begins here.


P.S.: All of the above can be found on the official website also. But what the hell, I just want to write about it.

In the next post, I will talk about the additional dependencies like GIZA++ and IRSTLM which are going to be used for Text Alignment and Language Model generation respectively.

In the post following that, let us see about our parallel corpus and how we are going to proceed with translation. Hopefully, there won't be many obstacles. Machine Translation is a fantastic domain!

Why the blog?

Machine Translation is a fascinating concept. The practical implications associated with it are visible and useful across all strata of the society. That is why that became the topic of my Final Year Project.

Hi, I am Rajkiran, currently in my final year of my Bachelor's degree in Computer Science and Engineering, at College of Engineering, Guindy. I am currently doing a MT project for Tamizh to English.

We had initially started out to approach this problem using UNL(Universal Networking Language). UNL approach to MT comes under the category of Interlingua approach, where the source language text is first converted into a language independent form(UNL, in this case) before it is translated into the target language text. Like all MT approaches, it has its pros and cons. One of the main pros, is that code becomes reusable. Since enconversion(Source->UNL) and deconversion(UNL->Target) are decoupled, this is one of the biggest advantages of UNL based MT. But, due to various reasons, we had to drop that approach and turn back to good old Statistical MT.

And, there begins my journey of Tamizh to English MT using Moses.