Options

Conversions:

Conversions take place prior to lemmatization when text is being processed; this feature is intended to augment and assist lemmatization. Additionally, this feature allows the user to customize text replacement. The “Conversions” field may be used to fix common misspellings (e.g., “hieght” to “height”; “teh” to “the”), convert “textisms” (e.g., “bf” to “boyfriend”), and so on. The conversions feature also allows for wilcards (*).
The proper format for conversion is:

(original)^(converted form)

Example:

bf^boyfriend

This will replace all occurrences of the word “bf” with “boyfriend” before analyzing text. As a note, original and converted forms need not be a single word (e.g., “MEH is awesome” to “This software is adequate”). For more advanced uses and a deeper explanation of using the conversion engine, please refer to the “Advanced Conversions” page.

Stop Words:

You may read more about using stop words here. The only point of note here is that “QQNUMBERQQ” is used to represent all numeric values in the text. This will show up as “#” in the frequency analysis.

Dictionary Words:

“Dictionary Words” are user-specified words to include, even if they are low base-rate words. Dictionary Words can be rather multipurpose in nature, and it can be used in multiple ways depending on your intent. For example, if you are looking to search a corpus of text for specific words in specific observations to look for differences, Dictionary Mode works nicely for this. Another example of its use is as follows: if you scan one corpus and get the most common words, you can copy/paste the words from this frequency list into the Dictionary List to search a new corpus for the same words.

An additional example of Dictionary Words use is to code text into a categorical fashion, similar to software like LIWC. For example, if you would like to code words into positive and negative emotion categories, you could do this coding in the Extra Conversions box.

happy^posEmotion
happ*^posEmotion

surpris*posEmotion
sad^negEmotion
depress*^negEmotion
furious^negEmotion

Following this, you could simply include “posEmotion” and “negEmotion” in your dictionary list. This will result in output that codes for all of the words specified in your conversion list, aggregated into the categories specified in your dictionary list. Note that if you use this feature for content coding, you will want use conversions that result in strings that don’t appear in your files. For example, if you convert all positive emotions words to “PosEmoCoded”, you are probably safe. If you convert all positive emotion words to the word “up”, you are going to have a bad time.

Note: If you input words into the Dictionary Words box, MEH will automatically adjust the N-grams and [N-X]-grams to correspond with your list right before analyses. This is done so that multi-word phrases, etc., in your dictionary list are not missed due to an incorrect setting.

 


 

 Search Files for N-grams:

This option allows you to choose the “window” of words for which you would like to scan in terms of N-grams. The default is 1-grams, and this is strongly recommended for the meaning extraction method. The user may specify up to 5-grams, as this may be useful for frequency analyses. Be advised that with each progressive N-gram, the time that it takes to construct a frequency table grows exponentially.

Include [N-X]-grams:

If you are scanning for N-grams that are larger than 1 (e.g., 2-grams, 3-grams, etc.), you may also want to scan for “sub”-grams. In other words, if you are scanning for 3-grams, you may also be interested in the 2-grams and 1-grams that are present. Selecting this option will tell MEH to include these [N-X]-grams in your frequency list and other output.

Minimum Obs. Word Count

This field will exclude observations from the content analysis portion of the meaning extraction helping process. Observation word counts are determined after conversions and lemmatization have been applied. Note: To be clear, this value will not impact the frequency analysis, only the content analysis portion of the process.

 Minimum Obs. Percentage

Following the frequency analysis, this number will be used to specify the percentage of observations in which an N-gram must appear for it to be included in subsequent steps (such as rescanning and dictionary construction). Your observation percentages will change if you use segmentation since segmentation and minimum word counts will change your final number of observations. (see Understanding Output).

This value can be used to follow standard guidelines (e.g., 5% and above) that have been recommended for performing the meaning extraction method.

 


 

Use Existing T-D Freqs / Freq List:

This option allows you to start from a term-frequency folder and (if desired) a frequency list that has already been generated by MEH. This option may be useful if you need to restart or rerun an analysis for which you have already obtained frequency data, allowing you to expedite the process. It is important that your frequency list contains the metadata that is provided by MEH 1.0.5 and later versions (see Understanding Output). You will also need to let MEH know where your Term Frequency folder is located. This folder is used in further steps.

Freq. List Generation Options

For most users, it will be completely unnecessary to change these options. However, if the “Building Combined Frequency List” phase of your analysis is running particularly slowly, you should consider reading the details of these options to find better settings.

Newer versions of MEH have been designed with larger datasets in mind. This is true from both the standpoint of breadth (i.e., an extremely large number of files) and depth (i.e., files that have a lot of text). After creating Term Frequency files, MEH needs to recombine this information in order to figure out how many times each word appears, and in what percentage of observations. In order to achieve this relatively quickly, MEH makes several passes over your Term Frequency information, combining it as efficiently as it can, then by writing temporary information to your hard drive. Each pass is set up to combine certain amounts of data and, after several passes, this data will be effectively reduced to the final frequency list.

In this section of the options, you will see a slider that reads “Items to Incorporate per Pass“. If you know that your text files are very large, you may consider setting this to a very low value as this will result in faster data combination. If you know that you have an extremely large amount of very small files, you should consider setting this to a higher value. A related option reads “Decrement Value After Each Pass” — this option will help the later passes to combine information more quickly.

An alternative to these options is to have MEH “Dynamically Adjust These Values“. This option will use a simple algorithm to automatically decide how much data to combine at any giving point as an attempt to improve the efficiency of the frequency list generation phase. This option also allows you to specify how “strict” the algorithm will be, with greater strictness requiring more passes but potentially making the overall process more speedy. Preliminary benchmarking suggests that speed of this stage of analysis can be improved by up to approximately 20% with this option. However, the degree to which this algorithm can be efficiently applied to various datasets is currently untested.

Prune Low Baserate N-Grams

This option can help to minimize the amount of time that it takes to build your frequency list and, additionally, can help keep your frequency list size manageable. This is particularly true for very large datasets (either in terms of the number of files or the sheer amount of words in each file).

The way that this option works is that it will selectively omit n-grams while combining information from all of your text files. You will need to specify two things:

  1. The pass number at which you would like to start pruning low baserate n-grams
  2. The minimum frequency an n-gram must have (at your specified pass) in order to be retained

This option, if selected, will kick in during the “Building Combined Frequency List” phase, when MEH is merging together information from all of your files. For example, let’s say that it takes 20 complete passes over your files to build your entire frequency list. You can specify to MEH that you want to omit any n-grams that have not occurred in your dataset at least 5 times in any given subset of your data by the 3rd pass. Once MEH hits the 3rd pass, it will start to prune off n-grams that occur below this threshold in each chunk of data that it is combining.

Note that using this option will give you less accurate numbers for low baserate n-grams, but will not impact moderately common or high baserate n-grams. Essentially, if you want to have a complete and comprehensive frequency list, you should not use this option. However, if you do not intend to use very uncommon words and do not mind that uncommon words exhibit some degree of under-representation, then this option is ideal. This is particularly true for very large datasets, as the “Building Combined Frequency List” phase can take extremely long given that there will be a massive number of n-grams that appear only a few times in your entire corpus — this is consistent with Zipf’s Law.

 


 

 Select Text Encoding

The encoding option allows you to select how you would like for your text files to be read, as well as how you would like for your output files to be written. For most cases, you will likely want to use the your system’s default encoding (which MEH detects and selects by default). If you are experiencing odd characters or broken words in your output, this is likely being caused by a mismatch between your selected encoding and the encoding used for the text files. In these cases, you may want to consider examining your text files and selecting the appropriate encoding.

 Lemmatization:

Lemmatization using the LemmaGen engine is recommended for the meaning extraction method. You may read more about lemmatization hereNote: For various reasons, such as part-of-speech ambiguity in lieu of context, some words are not converted to potential lemmas during lemmatization. See the “Conversions” feature for additional details.

Output in European Format

This option will ensure that the primary output files are able to be read as proper .csv files on computers with a region set to European. These files have a slightly different internal format, ensuring that columns are properly delimited for computers set to this region.


“Big Data” Settings

MEH is in many ways designed with your “big data” needs in mind. The two options included for big data are “On The Fly” Folder indexing and subfolder scanning.

Subfolder scanning will allow MEH to crawl through all subfolders of the primary folder that you have specified to find all .txt files in the directory tree and include them in your analyses. This is particularly useful if you are working with a large dataset, where it is usually best to divide observations into between 10,000 and 25,000 files per folder. As an example, a single folder containing 22,000,000 text files is effectively beyond the current capabilities of the Windows operating system, and you will not be able to peruse a folder that is this full, let alone analyze it effectively or rapidly. However, if you divide these text files into 800+ folders of 25,000 files each, both MEH and you will be able to quickly and easily access each folder and set of files without putting undue strain on the Windows OS.

Indexing folders “on the fly” can save a considerable amount of time if you are working with an extremely large number of text files as well. For example, pretend that you have a dataset with 500,000 text samples, divided into folders of 25,000 text samples each (i.e., 20 folders, each containing 25,000 files). Normally, MEH would go through each folder to determine the total number of files, create a temporary index of all .txt files, and then proceed with analysis. Indexing folders “on the fly”, however, will start the analysis procedures immediately, and will do its best to index each folder of text samples in turn as it proceeds, rather than all folders at once. This option is highly recommended for situations such as this, as indexing files ahead of time is not a useful exercise when there are so many of them. Note that you might see MEH pause for some time while performing analyses as it moves from location to location — this is normal, and MEH is working correctly.

Number of Processor Cores to Use

If you have a multi-core processor in your computer, or you have multiple processors, MEH can take advantage of your hardware to process multiple files at the same time. This results in much faster analyses and can be extremely useful for large datasets. If you notice considerable slow-down or freezing of the interface when processing texts, you might need to lower the number of cores being used by MEH.

Skip files that cannot be opened/read

This option is included for people working with large datasets that have a lot of issues. For example, many datasets might have series of corrupted text files that were improperly written to disk. By checking this option, MEH will not get stuck in a “retry/cancel” state for a file that it cannot read. Rather, it will attempt to open the file and, if this fails, the file will be treated as empty.