History log of /dovecot/src/lib-fts/fts-tokenizer-generic-private.h
Revision Date Author Comments Expand
3a54211bd6c4dc3f8687c16020770551cf83a548 17-Aug-2015 Teemu Huovila <teemu.huovila@dovecot.fi>

lib-fts: Add Unicode TR29 rule WB5a setting to tokenizer. Splits prefixing contracted words from base word. E.g. "l'homme" -> "l" "homme". Together with a language specific stopword list unnecessary contractions can thus be filtered away. This is disabled by default and only works with the TR29 algorithm. Enable by "fts_tokenizer_generic = algorithm=tr29 wb5a=yes"

b6b06530d654f0436bfbaefc1e988d53fff0cbee 01-Jun-2015 Timo Sirainen <tss@iki.fi>

lib-fts: tokenizers - Fixed removal of trailing character in truncated tokens. If the token is truncated, we don't want to remove the trailing character since it's not actually there. Also we don't want to remove trailing apostrophes from a truncated word, because they're not actually at the end of the (untruncated) token there. This doesn't make a big difference, but it's slightly more correct.

b15ff9096eab230fa041996d9340b96ac7343c0d 01-Jun-2015 Timo Sirainen <tss@iki.fi>

lib-fts: Optimization for tr29 - we don't need to track last_size explicitly

65a2c8fef977bcf4625fdb5e2f524b42667cb501 01-Jun-2015 Teemu Huovila <teemu.huovila@dovecot.fi>

lib-fts: Change TR29 tokenizer to break at full stop (and others). Diverge from the TR29 rules and always break at MidNumLet letters. This fixes tokenizing first.last@domain.tld email addresses.

0c5854b6891c59c1c3f443569bc823d7db571582 21-May-2015 Teemu Huovila <teemu.huovila@dovecot.fi>

lib-fts: Fix simple tokenizer apostrophe handling. Apostrophes and quotation marks are now treated as word breaks, except U+0027 between non-wordbrek characters. The characters U+2019 and U+FF07 are transformed to U+0027 before processing.

2bb1ef0b669901fb91ff961e7fb074439ef769ab 09-May-2015 Timo Sirainen <tss@iki.fi>

lib-fts: Minor code cleanups

34c7e8b10f94e9b76bd5b64b146c0c7e1a65e0f9 09-May-2015 Timo Sirainen <tss@iki.fi>

lib-fts: fts-tokenizer-generic-private.h had content that didn't really belog there.

c865b0e9c65fd77f7b2ab6f8616d3def5501ecb3 20-Apr-2015 Timo Sirainen <tss@iki.fi>

Initial import for lib-fts. Parts of what this code does was already implemented internally by fts-lucene. lib-fts is intended to be usable for all the FTS backends. The APIs are still going to change a bit, but hopefully not after v2.2.17 release. Mostly written by Teemu Huovila.