Text Tokenisation and Filtering

When searching for content based on search terms entered by the user, Confluence splits the text of the content into tokens, and then filters and modifies those tokens according to the following rules.

Tokenisation

Confluence uses Lucene's Standard Tokenizer. This splits the text into tokens as follows:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by white space is considered part of a token.
トークン内に数値がない限りハイフンのところで語句を分割します。数値がある場合、トークン全体が製品番号と解釈し、分割しません。
Recognises email addresses and internet host names as one token.

An example: The string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.

Filtering

Confluence then:

Removes "'s" from the ends of words.
Removes the dots from acronyms, e.g. I.B.M. becomes IBM.
Converts everything to lower case.
Removes common words like 'the' and 'or' are removed.
Converts words to their stems. For example, 'fishing' and 'fishes' both become 'fish'.

ページツリー

Tokenisation

Filtering

関連トピック