Japanese Natural Language Processing With Mecab

Japanese Natural Language Processing With Mecab

·

4 min read

If you want to do natural language processing in Japanese, Mecab may be your best choice. In this article, let's see how it can be used.

Basic Usage

You can follow the documentation to install it. Because I need to use it in python, so package mecab-python3 needs to be installed too.

If the installation process goes right, then we can follow the examples to do japanese NLP.

>>> import MeCab
>>> wakati = MeCab.Tagger("-Owakati")
>>> wakati.parse("pythonが大好きです").split()
['python', 'が', '大好き', 'です']

>>> tagger = MeCab.Tagger()
>>> print(tagger.parse("pythonが大好きです"))
python  python  python  python  名詞-普通名詞-一般
が      ガ      ガ      が      助詞-格助詞
大好き  ダイスキ        ダイスキ        大好き  形状詞-一般
です    デス    デス    です    助動詞  助動詞-デス     終止形-一般
EOS

Custom Dictionary

Showing you how to use mecab is not the point. The point is, how to customize dictionary data with mecab.

According to mecab official documentation, to customize user dictionary, we need to prepare the data first.

The data format goes like this.

表層形,左文脈ID,右文脈ID,コスト,品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音

These are all japanese language attributes need to be defined. For example, we can define a customized 名詞 like this.

工藤,1223,1223,6058,名詞,固有名詞,人名,名,*,*,くどう,クドウ,クドウ

OK, now we have custom data, store it in a csv file.

Then we need to compile custom data into binary dictionary form. There are 2 ways to do it. The first is to compile custom data with the system dictionary together. The second is to only compile the custom data. As you can imagine, the former will be slower then the latter.

The documentation shows examples to do it in Linux, but I need to do it in Windows. So I will show how to do it in Windows in below.

Now let see the first method, compiling custom data with the system dictionary.

If you already install Mecab properly, then you can see a executable file in C:\Program Files (x86)\MeCab\bin\mecab-dict-index. This is the tool used to compile dictionary. And another dictionary should be noted is C:\Program Files (x86)\MeCab\dic\ipadic. This is the folder containing the system dictionary data.

Now put the csv file which contains the custom user data into the ipadic folder. Then run command below.

C:\Program Files (x86)\MeCab\bin\mecab-dict-index -f shift_jis -t utf-8 -d C:\Program Files (x86)\MeCab\dic\ipadic -o C:\Program Files (x86)\MeCab\dic\ipadic

Meaning of parameters are blow:

  • -f encoding of source data files
  • -t encoding of target dictionary files
  • -d source system dictionary folder
  • -o output system dictionary folder

The -f encoding should be noted here. Because we need to compile custom user data with the previous system data, so this source file encoding should be the same. As I install MeCab, the system dictionary csv files are all shift_jis encoding, so to make things easy, the custom user data should use shift_jis encoding too.

If running successfully, then 4 files should be created:

matrix.bin
sys.dic
char.bin
unk.dic

Note that we need to write files into the C programs disk folder, so be sure you have administrator privilege. Because I need to run this command in Python, so I take a simple approach, just copy the bin folder and ipadic folder into my code base, so I can write any data freely.

With this new dictionary generated, then next time mecab load the dictionary, the custom user data should take effect.

Another point is if you copy the bin folder and ipadic folder into code base as I did, then you need to specified the ditionary path manually like this.

import MeCab

tagger = MeCab.Tagger("-d ./ipadic")
# ...

OK, this is the first method, now let's see the second method, compiling custom user data independently.

Just like before, we need to use the mecab-dict-index tool and ipadic folder. Then run below command to compile.

C:\Program Files (x86)\MeCab\bin\mecab-dict-index -f utf-8 -t utf-8 -d C:\Program Files (x86)\MeCab\dic\ipadic -u user.dic user_dict.csv

As you can see, instead of specifing output folder path by -o, we use -u to specify output user dict file path, and put the custom user data csv file path at the end.

If running successfully, a file user.dic should be created. With this dictionary, we can pass its path to MeCab and it will load it.

import MeCab

# it will load system dictionary under the hood
# use -d to specify system dictionary path if the path is not the default value
tagger = MeCab.Tagger("-u ./user.dic")

OK, that's all for this article. Compiling custom data into system dictionary is slower, but we can have a whole new dictionary. Compiling it independently is quicker but we should load 2 dictionaries. Choose according to your actual situation.