MEH with Other Languages

Recently, MEH has been updated to be able to handle virtually any language that your computer can properly display/interpret. To analyze texts that are in a language not listed in the Language dropdown menu, you only need to complete two steps.

1.   Using the Text Encoding of Input File(s) dropdown menu (located under the Input/Output File Settings tab), choose the proper encoding for your text files. For example, if you want to analyze Arabic texts, you will likely want to use the UTF-8 or UTF-16 encoding option. If you’re not sure about the encoding of your texts, please check out the ExamineTXT and TranscodeTXT programs, located in the toolbox.

2.   Under the Lemmatization tab, choose your desired input language. You can also use the built-in stoplist / conversion list for your language if it is built into the program. If your desired language is not currently included in the software, feel free to send me an e-mail to see if it’s something that can be easily added.

That’s it! At the current time, MEH does not include default stop lists and default conversions for most languages on the planet (there are a lot of them!). Additionally, lemmatization cannot be done on languages not included in the menu. You will need to create your own stop list and conversions (which can be used to lemmatize manually) for any language not included in MEH by default. I recommend ranks.nl as a good starting point for stop lists.

If you would like for me to add a default stop list and a default conversion list for a specific language, please send me an e-mail; I would be happy to add it.


Note that if your language does not have a lemmatizer built into MEH, you may also consider doing lemmatization manually via the “Conversions” box in the software. For example, you could create a conversions list from those provided here:

https://github.com/michmech/lemmatization-lists

…thus performing lemmatization without actually relying on a trained lemmatizer.


For languages that do not separate words in the same way as, for example, English, you may need to tokenize your texts prior to analysis with MEH. I currently have tokenization programs available for both Chinese and Korean. Other languages may be added in the future at user requests. For now, the Chinese and Korean tokenizers (ZhToken and KoToken, respectively) can be downloaded from here:

https://toolbox.ryanb.cc