Sunday, June 01, 2025

Re: case-insensitive grep with accented letters

On 6/1/25 11:10 AM, Ingo Schwarze wrote:
> Hello,
>
> Stuart Henderson wrote on Sat, May 31, 2025 at 10:45:17AM -0000:
>> On 2025-05-31, rsykora@disroot.org <rsykora@disroot.org> wrote:
>>> I was surprised to learn that 'grep -i' does not
>>> really work for accented letters
>> OpenBSD base doesn't support LC_COLLATE.
> Everything that sthen@ said is correct.
>
> Let me add that supporting LC_COLLATE is not even a long-term goal.
>
> LC_COLLATE is among the most complicated aspects of locales.
> The collation order depends on the language, and for some
> languages, there is even more than one collation order that
> is commonly used. We certainly do not want to poison our libc
> with that amount of complexity.
>
> That said, implementing 'grep -i' for non-ASCII characters does not
> strictly require LC_COLLATE support (as opposed to, for example,
> sort(1) might). What *is* needed is working towlower(3) support
> in libc, which is controlled by LC_CTYPE, and which we do have (and
> it is reasonably up to date because our libc Unicode support follows
> Perl, currently at Unicode Version 15.0.0, released in September
> 2022).
>
> For example, towlower(U+017D) works for me and returns U+017E.
>
> Your desire requires wide-character support in both regexec(3)
> and grep(1) such that (1) U+017D can be recognized as a character
> rather than being treated as two bytes and (2) towlower can
> transform it to U+017E and (3) the result can then be compared
> to the command line argument in a wchar_t to wchar_t comparison.
> These are multiple tasks of significant difficulty and size.
>
> Maybe, as a partial solution, it would even be possible to improve
> *only* grep(1) while leaving the (even more scary) regexec(3)
> alone, i.e. have grep(1), when called with -i, convert both
> the command line arguments and every input line to lower case
> with towlower(3), then pass both to the narrow-character regexec(3),
> which should work for your use case. It would not work for other
> use cases though; for example, /./ still wouldn't match an accented
> character.
>
> Yours,
> Ingo
>
(Obviously) mapping unicode->ascii is complex and a pain
The NIH National Library of Medicine has java tools whose distribution
conditions -might- be acceptable
https://lhncbc.nlm.nih.gov/LSG/Projects/lvg/current/web/termsAndConditions.html
The documentation describes the various mappings one might use.

geoff steckel

No comments:

Post a Comment