Sunday, June 01, 2025

Re: case-insensitive grep with accented letters

On Sat, May 31, 2025 at 10:45:17AM -0000, Stuart Henderson wrote:
> ggrep does in this instance, but I don't know how reliable that is.

I had already forgotten about a problem I encountered with GNU grep
under Linux while writing a shell script to process mbox files long time
ago. Some of the messages in my mbox files were iso-latin encoded
(Spanish,) since my locales were UTF-8, a grep command in a pipe at the
end of my script printed the message "binary file matches" and removed
from the output any line containing invalid UTF-8 sequences considering
them garbage from a binary file. This is what still happens under Linux
(\xed is latin-1 iacute):

$ printf '\xedHello\n' > test
$ grep Hello test
grep: test: binary file matches
$ LANG=C grep Hello test
�Hello

I mention this as a practical example of the trade-offs of using
wide-character functions.



--
Walter

No comments:

Post a Comment