This burst detection algorithm is implemented based on the Jon Kleinberg's, Bursty and Heirarchical Hierarchical Structure in StreamStreams. A burst is a period of increased activity, determined by minimizing a cost function that assumes a set of possible states (not bursting and various degrees of burstiness) with increasing event frequencies, where it is expensive (costly) to go up a level and cheap (zero-cost) to decrease a level. It is useful for text stream analysis (such as emails, corpus, publication) where you want to know the activity of the stream in a period of time.
- Gamma is the value that state transition costs are proportional to. The higher Gamma value results the higher transition costs. Use this parameter to control how ease the automaton can change states.
- Density scaling determines how much 'more bursty' each level is beyond the previous one. The higher the scaling value, the more active (bursty) the event happens in each level.
- Bursting States determines how many bursting states there will be, beyond the non-bursting state. An i value of bursting states is equals to i+1 automaton states.
- Date Column is the name of the column with date/time when the events / topics happens.
- Date Format specifies how the date column will be interpreted as a date/time. See http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html for details.
- Batch Burst Length Unit specifies how to divide the date range into burstable units.
- Batch Burst Length specifies the number of burstable units per burstable period. For example, 10 years generates bursts by decade.
- Text Column is the name of the column with values (delimiter and tokens) to be computed for bursting results.
- Text Separator delimits the tokens in the text column. When constructing your tables, do not use a separator that is used as a whole or part of any token.
Burst detection is particularly useful for examining the trends in collections of texts or communities of conversation. Even words that are used comparatively little, but that change in frequency of usage over time, stand out, unlike in burst detection algorithms based on thresholds.
Since we are focus on scholarly data, the data will be distributed into batches (usually yearly batches) before the burst computation started. The burst detection algorithm was re-implemented in Java based on the origin C implementation. Please see Kleinberg \ [pg. 14\]. We replace the missing years with empty batches to make the batches continuously by year. There will no burst for these empty batches. Users can change the scaling factor for the batches to month, day, hour; even number of years per batch. The batching implementation will not consider the date fields with the scaling factors that are smaller than the user selected scaling factor. For example, if the days scaling factor is selected, the batching algorithm will remove the hour and minute fields in the date value. Wiki Markup
Please read the Description section before continuing. This burst algorithm is a text based burst detection that provide burst results in hierarchical structure. However, it is also capable to detect if the bursts exist by setting the bursting states to 1.
Notes: The default parameters are typically good choices, but more sophisticated models can be fitted by tweaking them in various ways. You might not want to include the records with empty Text Column into the input file if you don't want to count them into the burst result. This could be happens while you have a set of records from year 1970 - 2011. But there are no text information between 1970 - 1980. Since it is lack of information rather than no burst, you might want to remove records in year 1970 - 1980 to have a better focus period.
J. Kleinberg. Bursty and Hierarchical Structure in Streams. Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002.