3e786e2a411dc973a2359bc213fcf827e6c314d2 |
|
22-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: ICU normalization changes some characters to spaces - remove them.
We don't really want to add spaces to our index. It would be nice if the
words between spaces were actually split to different tokens, but that's
more of the fts-tokenizer's job and at filter stage that's probably not
wanted anymore. |
bf698b98d3a3a1eced66cc682c449f23bf2b67d0 |
|
16-May-2015 |
Timo Sirainen <tss@iki.fi> |
lib-fts: Rewrite ICU handling functions.
Some of the changes:
- Use buffers instead of allocating everything from data stack.
- Optimistically attempt to write the data directly to the buffers without
first calculating their size. Grow the buffer if it doesn't fit first.
- Use u_strFromUTF8Lenient() instead of u_strFromUTF8(). Our input is
already supposed to be valid UTF-8, although we don't check if all code
points are valid, while u_strFromUTF8() does check them and return failures.
We don't really care about if code points are valid or not and
u_strFromUTF8Lenient() passes through everything.
Added unit tests to make sure all the functions work as intended and all the
UTF-8 input passes through them successfully. |