Understanding Output

Term Frequency Files

Since version 1.2.0, these files are a crucial part of how MEH functions. These files provide basic information about each text file that you are processing. Namely, this includes the word count (calculated after conversions and lemmatization, but before stop-word removal) and Term Frequency information for each document.

Important: It is vital that you do not alter or move these files while your text is processing. Doing so may cause your results to be incomplete or incorrect. It is also important that you save these files if you intend to reuse your frequency list after all of your text has been initially processed. I would recommend using a compression program, such as 7-Zip for Windows or Keka for Mac, to store these files at low cost to your hard drive space.

Frequency Table

By default, MEH will output an N-gram frequency table after performing the initial analysis. This file will contain a complete list of all N-grams and their corresponding frequencies in your data — this information is counted in two different ways. The first way is reflected in a column labeled “Total_Frequency” — this is the overall frequency of each n-gram in the entire dataset. The second way is reflected in a column labeled “Included_Frequency” — this is the frequency of each n-gram, excluding observations that fall below your specified minimum word count. This table will reflect your data after accounting for any conversions, stop words, and lemmatization features that you have selected.

Your frequency table will also include both the raw number, as well as the percentage, of observations that the N-gram appears in. Remember, your total number of observations will vary depending upon your choices in segmentation and minimum word count cutoffs, which will then invariably alter your observation percentages. It is possible to get a 0% in this column for a given N-gram — this will happen when an N-gram is detected by frequency analysis yet does not occur in a file/segment that meets your minimum WC cutoff value. This area of the frequency list will also include the IDF, or “inverse document frequency“. This data can be paired with your Document Term Matrix (DTM) output file to easily create a dataset in TF-IDF format, which may be useful for various procedures.

Finally, the frequency table will also include metadata about the values used when it was created, as well as the final number of observations included. Since the observation percentages are partially determined by values that you have selected (such as desired observation size, etc.), it is important to maintain this information for reporting purposes and for reusing your frequency table (see “Use Existing Frequency List” under Options).

External Dictionary File

This feature is currently only available when using MEH to search for 1-grams.

When this option is selected, MEH will create an external content coding dictionary file that may be used as a custom dictionary with RIOT Scan (this file can be very easily edited for use with LIWC 2007 as well). The dictionary created will be derived from the text that you are processing and accounts for conversions and lemmatization, if being used. Since the dictionary is derived from the text being processed, it may not necessarily reflect all possible words that you would like to include for a given category. For example, in this phrase:

Timothy was working on his work. It worked!

…the words “work”, “working”, and “worked” will all be able to be lemmatized and converted into the word “work”, and the custom dictionary will reflect this fact. However, since the word “works” does not appear in this phrase, MEH does not know that it will also be able to be a part of this dictionary category. As such, you may want to inspect your dictionary file after it is created and make any additions that you feel are necessary.

Important: The “Build Dictionary” feature of MEH is considerably slower than other features of the software. This is because MEH currently goes through every file and tries every combination of possibilities with your data as a way to try to make the most complete version of a dictionary possible. Making this feature faster and more flexible is on the “to do” list.

Verbose Output

The verbose output generated by MEH is similar to output that you might see created by standard content coding software, such as LIWC or RIOT Scan. Observations are numbered and accompanied by filenames, along with the segment numbers of each file (where applicable). The DictPercent variable reflects to total percentage of each observation that was captured by the searched-for N-grams (specified by the user; see the options page for more information). Additionally, values in each column represent the frequency of each N-gram, represented as a percentage of each observation’s word count (remember that word count is calculated after conversions and lemmatization have been applied).

Of note, the columns are pre-sorted (from highest to lowest) according to the percentage of observations containing the corresponding N-gram.

Binary Output

The binary output is identical to the verbose output, however, scores for each N-gram are converted into simple presence/absence scores. Values of 1 and signify the corresponding N-gram’s presence and absence, respectively, for a given observation. As per standard recommendations (e.g., Chung & Pennebaker, 2008), the binary output is often preferred over the verbose output for the meaning extraction method.

DTM (Document Term Matrix) Output

The document term matrix output is similar to the binary and the verbose outputs, however, it provides the raw counts for each n-gram per observation. This output file can easily be used for the purpose of something like Latent Dirichlet Allocation using the “topicmodels” package in R.

If you are new to LDA, or you simply need an R script that makes LDA easy to use with MEH’s DTM output, I have written one that you may freely use. It can be downloaded here.

Term Frequency – Inverse Document Frequency (TF-IDF) Output

This output is derived from a combination of the DTM output and the inverse document frequency (IDF) data that is generated with the frequency list. This same data can be calculated manually from other MEH output, however, it’s always nice to have software do the work for us. If you are unfamiliar with the concept behind tf-idf, a great primer page is located here.

Edge Matrices / Node Edge List Output

These output files are experimental in their current form — use with caution, as they have not been thoroughly examined and tested. These matrices and node/edge lists are raw co-occurrences between the different n-grams. The node list and edge lists are specifically designed to be loaded into Gephi, a network analysis tool. In cases where there are different numbers of co-occurring n-grams, the smaller value is used. For example, if the word “CAT” is used 3 times, and the word “HAT” is used 5 times, this is treated as 3 co-occurrences.

Note: I’ve included an R script to automatically process network data files and extract some data for each network into a single .csv file. This can be used to directly compare the structure of networks, etc. See the R script here: MEH Network Analysis.R