Getting your Samples into TXT Files

The Meaning Extraction Helper is set up to work directly with .txt files rather than analyze texts packed into other types of documents, such as CSV spreadsheets, etc. Why not just have it analyze any kind of text?

The main reason for this is to make things easier to troubleshoot the process when you’re analyzing your text samples. The world of character¬†encodings¬†is surprisingly complicated and difficult to navigate, and I have found that most people are not aware of such complexities. At the end of the day, I find that when we combine issues that people have with the various file formats (CSV vs. XLSX vs. TSV, etc.), then add in potential problems that come up when people don’t know how their files are encoded, troubleshooting problems can be extremely difficult, especially via e-mail. Once language samples are contained in individual .txt files, it becomes far easier to figure out exactly what encoding should be used, see specific problems, and more.

While I understand that getting texts out of spreadsheets can be daunting, I have provided some scripts that make this process much, much easier. Here, you can find a couple of scripts for different languages (Python, R) that will make getting texts out of your CSV files and into TXT files a snap. Simply take whatever spreadsheet format your texts are currently in, save it as a CSV file, then use the script that is most convenient / appropriate for your needs: