3a54211bd6c4dc3f8687c16020770551cf83a548 |
|
17-Aug-2015 |
Teemu Huovila <teemu.huovila@dovecot.fi> |
lib-fts: Add Unicode TR29 rule WB5a setting to tokenizer.
Splits prefixing contracted words from base word.
E.g. "l'homme" -> "l" "homme". Together with a language specific stopword list
unnecessary contractions can thus be filtered away.
This is disabled by default and only works with the TR29 algorithm.
Enable by "fts_tokenizer_generic = algorithm=tr29 wb5a=yes" |
b6b06530d654f0436bfbaefc1e988d53fff0cbee |
|
01-Jun-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: tokenizers - Fixed removal of trailing character in truncated tokens.
If the token is truncated, we don't want to remove the trailing character
since it's not actually there.
Also we don't want to remove trailing apostrophes from a truncated word,
because they're not actually at the end of the (untruncated) token there.
This doesn't make a big difference, but it's slightly more correct. |
b15ff9096eab230fa041996d9340b96ac7343c0d |
|
01-Jun-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: Optimization for tr29 - we don't need to track last_size explicitly |
65a2c8fef977bcf4625fdb5e2f524b42667cb501 |
|
01-Jun-2015 |
Teemu Huovila <teemu.huovila@dovecot.fi> |
lib-fts: Change TR29 tokenizer to break at full stop (and others).
Diverge from the TR29 rules and always break at MidNumLet letters.
This fixes tokenizing first.last@domain.tld email addresses. |
0c5854b6891c59c1c3f443569bc823d7db571582 |
|
21-May-2015 |
Teemu Huovila <teemu.huovila@dovecot.fi> |
lib-fts: Fix simple tokenizer apostrophe handling.
Apostrophes and quotation marks are now treated as word breaks,
except U+0027 between non-wordbrek characters. The characters
U+2019 and U+FF07 are transformed to U+0027 before processing. |
2bb1ef0b669901fb91ff961e7fb074439ef769ab |
|
09-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: Minor code cleanups |
34c7e8b10f94e9b76bd5b64b146c0c7e1a65e0f9 |
|
09-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: fts-tokenizer-generic-private.h had content that didn't really belog there. |
c865b0e9c65fd77f7b2ab6f8616d3def5501ecb3 |
|
20-Apr-2015 |
Timo Sirainen <tss@iki.fi> |
Initial import for lib-fts.
Parts of what this code does was already implemented internally by
fts-lucene. lib-fts is intended to be usable for all the FTS backends. The
APIs are still going to change a bit, but hopefully not after v2.2.17
release.
Mostly written by Teemu Huovila. |