3ec8b0d282d46d1f698b1f2aa27922cb8f26cb97 |
|
17-Nov-2015 |
Teemu Huovila <teemu.huovila@dovecot.fi> |
lib-fts: Add Norwegian.
Norwegian has two main dialects, Bokmal(nb) and Nynorsk(nn). They
are detected separately by libexttextcat, but the stemmer only
knows Norwegian. Thus they are treated as a single language,
Norwegian (no). This might also make more sense in everyday
use of mixed writing style Norwegian.
Caveat: The default normalizer filter does not modify U+00F8
(Latin Small Letter O with Stroke). In some configurations it
might be desirable to rewrite it to e.g. o. Same goes for the
upper case version. This can be done by passing a modified "id"
setting to the normalizer filter. |
19ed8f08b23d6ed204e6b27e5d1c0c6fe6bb11dd |
|
15-Nov-2015 |
Phil Carmody <phil@dovecot.fi> |
various - remove 8-bit characters from literal strings in test cases
C has a portable way of expressing characters not in the basic character
set, namely \xNN escaping. Otherwise, the interpretation of the raw utf-8
is implentation dependent. This has the benefit of making some tests'
expected output more obvious, such as "=c3=a4" matching "\xC3\xA4", even
if it hinders the readability of some natural-language-based tests.
Signed-off-by: Phil Carmody <phil@dovecot.fi> |
3e786e2a411dc973a2359bc213fcf827e6c314d2 |
|
22-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: ICU normalization changes some characters to spaces - remove them.
We don't really want to add spaces to our index. It would be nice if the
words between spaces were actually split to different tokens, but that's
more of the fts-tokenizer's job and at filter stage that's probably not
wanted anymore. |
bf698b98d3a3a1eced66cc682c449f23bf2b67d0 |
|
16-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: Rewrite ICU handling functions.
Some of the changes:
- Use buffers instead of allocating everything from data stack.
- Optimistically attempt to write the data directly to the buffers without
first calculating their size. Grow the buffer if it doesn't fit first.
- Use u_strFromUTF8Lenient() instead of u_strFromUTF8(). Our input is
already supposed to be valid UTF-8, although we don't check if all code
points are valid, while u_strFromUTF8() does check them and return failures.
We don't really care about if code points are valid or not and
u_strFromUTF8Lenient() passes through everything.
Added unit tests to make sure all the functions work as intended and all the
UTF-8 input passes through them successfully. |