Wednesday, November 02, 2022

Re: pcre2: newline any => anycrlf

YASUOKA Masahiko writes:

> Hello,
>
> Currently pcre2 is configured with "--enable-newline-is-any". With
> the option, the library treats 0x85 as a newline char. But in UTF-8,
> 0x85 is used at least for some casual Kanji chars. So the pcre2
> cannot handle text which includes such the chars properly.
>
> Since --enable-newline-is-any conflicts with using UTF-8, I think we
> should change it to --enable-newline-is-anycrlf to avoid the
> conflict.
>
> https://github.com/PCRE2Project/pcre2/blob/pcre2-10.37/src/pcre2_internal.h#L663
> 657 /* In ASCII/Unicode, linefeed is '\n' and we equate this to NL for
> 658 compatibility. NEL is the Unicode newline character; make sure it is
> 659 a positive value. */
> 660
> 661 #define CHAR_LF '\n'
> 662 #define CHAR_NL CHAR_LF
> -> 663 #define CHAR_NEL ((unsigned char)'\x85')
> 664 #define CHAR_ESC '\033'
> 665 #define CHAR_DEL '\177'
> 666 #define CHAR_NBSP ((unsigned char)'\xa0')
>
> \u8005 is "\xe0\x80\x85" in UTF-8, which includes "\x85".
> https://glyphwiki.org/wiki/u8005
>
> test code in php:
>
> <?php
> $test = "\u{8005} hogehoge";
> if (preg_match("/^(.+)$/m", $test, $match)) {
> print("result: " . str_ends_with($match[1], "hoge") .
> " (should be 1)\n");
> }
> ?>
>
> ok?

Good counterexample. Seems like a better default than I initially
thought.

ok namn@

>
>
> Specify --enable-newline-is-anycrlf instead of --enable-newline-is-any
> which doesn't work properly with UTF-8 text. The former option treats
> 0x85, which is used for some kanji in UTF-8, as a newline char.w
>
> Index: devel/pcre2/Makefile
> ===================================================================
> RCS file: /cvs/ports/devel/pcre2/Makefile,v
> retrieving revision 1.16
> diff -u -p -r1.16 Makefile
> --- devel/pcre2/Makefile 11 Mar 2022 18:52:29 -0000 1.16
> +++ devel/pcre2/Makefile 2 Nov 2022 14:02:31 -0000
> @@ -9,6 +9,8 @@ SHARED_LIBS += pcre2-posix
>
> CATEGORIES = devel
>
> +REVISION = 0
> +
> MASTER_SITES = https://ftp.pcre.org/pub/pcre/ \
> ${MASTER_SITE_SOURCEFORGE:=pcre/} \
> http://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ \
> @@ -27,7 +29,7 @@ LIB_DEPENDS = archivers/bzip2
> CONFIGURE_STYLE = gnu
> CONFIGURE_ARGS = --enable-pcre2-16 \
> --enable-pcre2-32 \
> - --enable-newline-is-any \
> + --enable-newline-is-anycrlf \
> --enable-pcre2grep-libz \
> --enable-pcre2grep-libbz2 \
> --enable-pcre2test-libreadline

No comments:

Post a Comment