Text Tokenization and Filtering

Confluence splits the text of content into tokens, and then filters and modifies those tokens according to the following rules.

Tokenization

This uses the Lucene Standard Tokenizer. This splits the text into tokens thus:

このツールは、句読文字で語句を分割し、句読文字を削除します。ただし、直後に空白がない点は、トークンの一部とみなされます。
トークン内に数値がない限りハイフンのところで語句を分割します。数値がある場合、トークン全体が製品番号と解釈し、分割しません。
メールアドレスおよびインターネットのホスト名を１つのトークンとして認識します。

Note that this means that the string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.

Filtering

Confluence then removes "'s" from the ends of words and removes the dots from acronyms, i.e. I.B.M. becomes IBM. Everything is converted to lower case and common words like 'the' and 'or' are removed. Finally words are stemmed, so that 'fishing' and 'fishes', for example, both become 'fish'.

ページツリー

Text Tokenization and Filtering

Tokenization

Filtering