If you want to do natural language processing in Japanese, Mecab may be your best choice. In this article, let's see how it can be used.
If the installation process goes right, then we can follow the examples to do japanese NLP.
import MeCab wakati = MeCab.Tagger("-Owakati") wakati.parse("pythonが大好きです").split() ['python', 'が', '大好き', 'です'] tagger = MeCab.Tagger() print(tagger.parse("pythonが大好きです")) python python python python 名詞-普通名詞-一般 が ガ ガ が 助詞-格助詞 大好き ダイスキ ダイスキ 大好き 形状詞-一般 です デス デス です 助動詞 助動詞-デス 終止形-一般 EOS
Showing you how to use mecab is not the point. The point is, how to customize dictionary data with mecab.
According to mecab official documentation, to customize user dictionary, we need to prepare the data first.
The data format goes like this.
These are all japanese language attributes need to be defined. For example, we can define a customized
名詞 like this.
OK, now we have custom data, store it in a
Then we need to compile custom data into binary dictionary form. There are 2 ways to do it. The first is to compile custom data with the system dictionary together. The second is to only compile the custom data. As you can imagine, the former will be slower then the latter.
The documentation shows examples to do it in Linux, but I need to do it in Windows. So I will show how to do it in Windows in below.
Now let see the first method, compiling custom data with the system dictionary.
If you already install Mecab properly, then you can see a executable file in
C:\Program Files (x86)\MeCab\bin\mecab-dict-index. This is the tool used to compile dictionary. And another dictionary should be noted is
C:\Program Files (x86)\MeCab\dic\ipadic. This is the folder containing the system dictionary data.
Now put the
csv file which contains the custom user data into the
ipadic folder. Then run command below.
C:\Program Files (x86)\MeCab\bin\mecab-dict-index -f shift_jis -t utf-8 -d C:\Program Files (x86)\MeCab\dic\ipadic -o C:\Program Files (x86)\MeCab\dic\ipadic
Meaning of parameters are blow:
-fencoding of source data files
-tencoding of target dictionary files
-dsource system dictionary folder
-ooutput system dictionary folder
-f encoding should be noted here. Because we need to compile custom user data with the previous system data, so this source file encoding should be the same. As I install MeCab, the system dictionary csv files are all shift_jis encoding, so to make things easy, the custom user data should use shift_jis encoding too.
If running successfully, then 4 files should be created:
matrix.bin sys.dic char.bin unk.dic
Note that we need to write files into the C programs disk folder, so be sure you have administrator privilege. Because I need to run this command in Python, so I take a simple approach, just copy the
bin folder and
ipadic folder into my code base, so I can write any data freely.
With this new dictionary generated, then next time mecab load the dictionary, the custom user data should take effect.
Another point is if you copy the
bin folder and
ipadic folder into code base as I did, then you need to specified the ditionary path manually like this.
import MeCab tagger = MeCab.Tagger("-d ./ipadic") # ...
OK, this is the first method, now let's see the second method, compiling custom user data independently.
Just like before, we need to use the
mecab-dict-index tool and
ipadic folder. Then run below command to compile.
C:\Program Files (x86)\MeCab\bin\mecab-dict-index -f utf-8 -t utf-8 -d C:\Program Files (x86)\MeCab\dic\ipadic -u user.dic user_dict.csv
As you can see, instead of specifing output folder path by
-o, we use
-u to specify output user dict file path, and put the custom user data
csv file path at the end.
If running successfully, a file
user.dic should be created. With this dictionary, we can pass its path to MeCab and it will load it.
import MeCab # it will load system dictionary under the hood # use -d to specify system dictionary path if the path is not the default value tagger = MeCab.Tagger("-u ./user.dic")
OK, that's all for this article. Compiling custom data into system dictionary is slower, but we can have a whole new dictionary. Compiling it independently is quicker but we should load 2 dictionaries. Choose according to your actual situation.