Tuesday, February 11, 2014

Setting up Moses

Moses is an open source tool kit that can be used to perform MT by using parallel corpora to train language models. It can be either phrase based or factored translation.
You can find more about Moses here: http://www.statmt.org/moses/

Moses needs g++ and Boost C++ libraries for a start.

I started off by cloning the source of Moses and Boost from git and sourceforge respectively.
The latest version of Moses and Boost are 2.1 and 1.55 respectively.
Moses can be found here: https://github.com/moses-smt/mosesdecoder
Boost can be found here: http://www.boost.org/users/download/

I first started off by unzipping Boost and compiling it. But as I found, after 3 hours of wasted time later, there was a non-compatibility between Ubuntu 12.04 and Boost 1.55, so I downloaded 1.53 which worked fine for me. After downloading, move it to the folder of your choice and run the following commands. The -j<number> denotes the number of processors on your CPU. 
tar zxvf boost_1_53_0.tar.gz
cd boost_1_53_0/
./bootstrap.sh
./b2 -j8 --prefix=$PWD --libdir=$PWD/lib64 --layout=tagged link=static threading=multi,single install || echo FAILURE

Next comes Moses. But not for the trouble with Boost libraries that caused trouble, Moses was quite smooth to get set with. 

./bjam --with-boost=~/rajkiran/boost_1_53_0 -j8 

The with-boost argument is necessary to specify where the Boost libraries that were compiled and placed reside in. After you see those two "SUCCESS" statements, try the sample model.

 
 cd ~/mosesdecoder
 wget http://www.statmt.org/moses/download/sample-models.tgz
 tar xzf sample-models.tgz
 cd ~/mosesdecoder/sample-models
 ~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

That will translate "Das ist ein kleines haus" to "This is a small house". Yes, now you have the permission to jump in joy. However, our real work begins here.


P.S.: All of the above can be found on the official website also. But what the hell, I just want to write about it.

In the next post, I will talk about the additional dependencies like GIZA++ and IRSTLM which are going to be used for Text Alignment and Language Model generation respectively.

In the post following that, let us see about our parallel corpus and how we are going to proceed with translation. Hopefully, there won't be many obstacles. Machine Translation is a fantastic domain!

No comments:

Post a Comment