Normalizes free-form text from selected columns within a given table. For example, the input text "Emergence of Scaling in Random Networks" becomes "emerg|scale|random|network" where we have chosen "|" as the character separating individual items of the list.

From this example you can follow the four normalization steps:

  1. Lowercase: The example text becomes "emergence of scaling in random networks".
  2. Tokenize: The text blob is split into a list of individual words. The example text becomes "emergence|of|scaling|in|random|networks".
  3. Stem: Common or low-content prefixes and suffixes are removed to identify the core concept. The example text becomes "emerg|of|scale|in|random|network".
  4. Stopword: Low-content tokens like "of" and "in" are removed (see the complete stopword list). The example text becomes "emerg|scale|random|network".

This algorithm can prepare the text in a table for Burst Detection.

See Also