Leave a comment


 

Mahmoud said on June 4th, 2011 at 12:58 am :

Great tutorial! I needed this!
Now I wonder if there is a wrapper in ruby so that i can use it in a rails application :)

May 29, 2011 | Using LibSVM for text categorization on a Mac OS X

There are very few practical tutorials for converting from raw text to classification using LibSVM, and using Weka can be somewhat of a compromise. It also turns out there’s nothing quite frustrating as having to go hunting one dependency after another, so I’ll try to highlight some of my paths in order to make it easier.

Let’s get started!

Sally is a tool that helps transform documents into vectors. Before you even download it, you’ll need libconfig, pkgconfig, gnuplot and libarchive. The easiest way to get all is to use macports. Once you’ve installed macports, run

sudo port install pkgconfig
sudo port install libconfig-hr
sudo port install libarchive
sudo port install gnuplot (optional, takes a while)

If you come across issues with libconfig, you can also download an earlier version by going to the trunk, accessing the revision you want, downloading the Original File, and running sudo port install at the download folder.

Once that’s done, get sally and then cd into the sally folder and run

./configure --enable-libarchive
make
make check
sudo make install

Now you need to make sure your text files are in the right format. The easiest way is to

  1. Append .classname at the end of each file. So if you had two classes, use .class1, .class2 to distinguish between them. (Of course you can use different names).
  2. Have all your files in one folder

Let’s say all your text files are in a folder called “data” (original, I know) Since my config loader was a bit screwy, I used the command line:

sally –input_format dir –chunk_size 128 –ngram_len 1 –ngram_delim “%0a%0d%20%22.,:;?” –vect_embed tfidf –vect_norm none –input_format dir –output_format libsvm data data.libsvm

This should output data.libsvm. From there, we now head over to libsvm.

As a sidenote, we could run a linear classifier using LibLinear, which significantly reduces SVM calculations for large data sets. The sally tutorial follows this path by going through with running

train -v 5 -c 100 data.libsvm

Going down the path of libsvm: it turns out getting gnuplot is a bit of a hassle and not always necessary. If you do decide to use it, install gnuplots using macports and replace your tools/easy.py with the following file.

The important and crucial aspect to successful classification is getting the right features. Otherwise classification using SVMs can take forever!

More coming soon!

This entry was posted on Sunday, May 29th, 2011 at 2:46 am, EST under the category of Coding. You can leave a response, or trackback from your own site.