Segmentation Options

Split Files into Equally Sized Segments:

This option allows you to uniformly split all incoming files into the same number of segments. For example, if you want to split all text files into 10 parts, this option will accomplish this. Each segment within each file will be approximately the same size.

Desired Segment Size:

This field allows you to specify what your desired maximum segment size will be when processing text. Use the default value of zero to refrain from segmenting text. Your text files will be smartly parsed so that segments are as equally close to your desired target size as is possible.

This option may also be thought of as a word count normalization tool, as well as a word count upper boundary limitation. This feature is of great use when the files that you would like to process are of varying word counts. The Meaning Extraction Method is optimal when word counts across observations is relatively homogenous. When using this option, no segments will exceed the limit that you place in this field. For example, a target segmentation size of 150 will parse files in such a manner:

An 80-word observation remains at 80 words.
A 300-word observation becomes 2 150-word segments.
A 500-word observation becomes 4 segments, each containing approximately 125 words.

Note: If any observation becomes segmented, its segments will never fall below 50% of the limits specified by this option. Observations that already fall below the target segmentation specified by the user will remain unsegmented at their original word count.

Think of engaging in the meaning extraction method as a bit like tuning an oscilloscope. You have two knobs that you are trying to tweak to find the “best” possible theme solution. You will want to tune your “wave amplitude” knob (i.e., the segment size / word count normalization) and then try to find the right “wave frequency” (i.e., various PCA solutions). Turn your “amplitude” knob to a good spot, then try adjusting the frequency. If you are getting a “noisy” signal, then try changing the amplitude, then adjust your frequency some more.

Segment Text with Regular Expression:

This allows you to enter a regular expression that will be used to determine where to segment texts. For example, if you want to split your text files by paragraph, you might use the regular expression rn for newline splits. This option is also useful for building semantic network data, be it at the paragraph level (\r\n), sentence level, etc. Every time a match is found in with your expression, a split will be placed in that location. Also useful for other topic modeling methods, such as LDA, if you are looking for a specific level of analysis.