Tuesday, October 05, 2021

Re: Spleen with russian (maybe more) cyrillic symbols

On 05/10/2021, ropers <ropers@gmail.com> wrote:
> This does relate to a question I've been thinking about for a while,
> so even if actually offering diffs for that is still way above my pay
> grade, I will offer these thoughts:
>
> * Of ASCII's 128 characters, only 95 are actually printable (ASCII
> sticks 2 thru 7 minus 0x7F DEL).[0]
> * In principle, the console is capable of supporting 256 glyphs.
> * With traditional Extended ASCII (EASCII) character sets, more than
> 95 characters were (still are) printable, but code assuming the use of
> ISO 8859-1 is deprecated and no longer portable in this age of UTF-8,
> and for EASCII sticks 8 thru F, there no longer is a direct
> correspondence between code points and code units at all.
> * Even if framebuffer console drivers could hypothetically be altered
> to allow the use of more than 256 glyphs, I completely agree with Ingo
> that that would be a fairly terrible idea for various reasons. While
> the 256-glyphs limitation does stem from VGA console drivers
> permitting no more than 256 text mode glyphs (or 512 with hacks), it
> would be best to not totally break framebuffer and vga console
> compatibility, but to stay within those limits.
> * With "extremely minimalistic UTF-8 support", up to 161 "spots" might
> be available.
> * There are 1,112,064 legal Unicode character code points (0x11 *
> 0x10000 - 0x800, i.e. seventeen 65,536-character planes minus the
> 2,048 code points from U+D800 thru U+DFFF that are reserved for UTF-16
> surrogates). Of those, 137,468 are private use, and 66 are
> non-characters. If we also subtract the 95 printable ASCII
> characters, that leaves 974,435 characters that might compete for
> those 161 spots.
> * There is an extremely strong argument for accommodating all
> characters from ISO 8859-1 in any future minimalistic UTF-8 console
> support. The non-breaking space and soft hyphen could use the same
> glyphs as space and hyphen-minus, respectively. This means that to
> maintain maximum backwards compatibility and UTF-8
> forward-portability, 94 of those 161 spots would have to be taken,
> leaving 67.
> * There might also be a strong argument for accommodating all the
> characters from ISO 8859-15 (so an additional 8) and Windows-1252,[1]
> which despite no Unix pedigree is a common superset of ISO 8859-1,
> with EASCII sticks A thru F being identical to ISO 8859-1. ISO
> 8859-15 differs from ISO 8859-1 in that it includes 8 characters in
> sticks A/B that Windows-1252 encodes in sticks 8/9. However, with
> UTF-8, code units and code points no longer match outside of sticks
> 0-7, so UTF-8 implementers of ISO 8859-1 and Windows-1252 backwards
> compatibility get ISO 8859-15 support for free. Besides those 8,
> Windows-1252 support would consume an additional 19 characters, so
> we'd have to subtract 27 from those 27 remaining spots, leaving 40.

s/27 remaining/67 remaining

> * 32 of those spots are from the C0 control codes from ASCII sticks
> 0/1. While Bemer et al. did originally propose alternatively
> printable glyphs for those normally unprintable characters, their
> glyphs were never commonly used. If "maximum printability" is a
> criterion, Unicode does define so-called "Control Pictures" for them
> (U+2400 thru U+241F).[2] It conceivably could be useful to have e.g.
> a console-based hex editor render something printable for most code
> units, however attempts at Control Picture inclusion would bump
> against the technical limitation that the Control Pictures glyphs are
> already barely legible in X11/xterm: So could actually useful Control
> Pictures glyphs even be defined if one has just 8x16, 8x14, 8x10 or
> 8x8 pixels to play with, as may be the case on the console? It seems
> doubtful. Perhaps those sparse spots and precious pixels are better
> spent on something else, like Cyrillic for example.
> * The once-common DOS code page 437 has 31 alternatively printable
> glyphs for sticks 0/1. Of these, only the bullet point, section sign
> and paragraph mark (pilcrow) can be found in the ISO 8859/Windows-1252
> family. There is no compelling reason --like what's mentioned in
> footnote [1]-- that could motivate the inclusion of its stick 0/1
> glyphs, DOS having largely gone the way of the dodo. Also, full CP437
> support would require many more glyphs, support for just this subset

s/support for/so support for

> of that old code page, never common in Unix-land, would seem wasteful.
> That still leaves 40 spots that could potentially be used.
> * The question is, which of the 974,435 candidates deserve one of
> those 40 spots. With a look at a relevant map[3], Arabic, Cyrillic,
> and Indic abugidas might have particularly strong claims. Arabic has
> 28 letters, but many contextual variants (though no case), Cyrillic,
> or more specifically the Russian alphabet has 33 letters and it does
> have case, so 40 spots might limit any support to UPPER CASE ONLY, or
> should I say ЦРРЕЯ СА5Е ОИГУ. I do not feel I know enough about Indic
> abugidas to say something intelligent.
> * The question of what subset of Unicode to settle on for minimalistic
> 256-glyphs-only UTF-8 support might be bigger than OpenBSD. Other
> Unix-like OSes might ask themselves the same question. Is this
> something that ought to be standardised across Unix-land or something
> OpenBSD would want to decide on its own?
> * I mentioned "512 with hacks" above, but I do not know enough if it
> could be viable, clean and VGA-compatible to blow past that 256
> boundary. If yes, then an additional 256 spots might comfortably
> allow for the inclusion of many more of the above.
> * Either way, even if no code is created at this time, just having a
> roadmap and knowing which glyphs ought to make the cut might be
> desirable. It would also be possible to already make the font(s) once
> that is known. Code that actually uses such a font to implement
> minimalistic UTF-8 support (for the console) need not arrive at the
> same time.
> * On the other hand, if extending our minimum character set to cover
> Windows-1252 and ISO 8859-15 and especially deciding upon the use of
> the last 40 spots cannot be settled yet, then it might be fine to
> leave that for later. The existing ISO 8859-1 fonts could actually be
> useable by a minimalistic UTF-8 support implementation, if developed.
> Again, once such an implementation has code points properly divorced
> from code units, it could absolutely source its glyphs from those
> fonts.
> That would only leave the small issue of UTF-8 compliance by
> everything else in base and ports...
>
> I hope that was useful and worth the verbiage.
>
> Thanks for your time,
> Ian
>
> (Ian Ropers)
>
> Footnotes:
> [0] Yes, they're properly called sticks. 8 sticks of 16 characters in
> ASCII; 16 sticks of 16 characters in EASCII. See Bob Bemer's Inside
> ASCII.
> [1] Per enwp.org/CP1252, Windows-1252 text mislabelled as ISO-8859-1
> is still very common (online), and "[m]ost modern web browsers and
> e-mail clients treat the media type charset ISO-8859-1 as Windows-1252
> to accommodate such mislabeling. This is now standard behavior in the
> HTML5 specification, which requires that documents advertised as
> ISO-8859-1 actually be parsed with the Windows-1252 encoding."
> [2] https://en.wikipedia.org/wiki/Control_Pictures
> [3] https://en.wikipedia.org/wiki/File:Writing_systems_worldwide.png
>
>
> On 05/10/2021, Ingo Schwarze <schwarze@usta.de> wrote:
>> Hi Slava,
>>
>> Slava Voronzoff wrote on Tue, Oct 05, 2021 at 03:01:26PM +0300:
>>
>>> I'm working right now on adding cyrillic to Spleen font. How can I later
>>> add it to OpenBSD kernel and ports? Pull request to main font on github
>>> (Hi, Frederic) or patch here?
>>
>> You cannot add it to the kernel because the kernel does not support
>> UTF-8, but only US-ASCII, and US-ASCII contains no code points for
>> cyrillic letters.
>>
>> Full UTF-8 support is definitely not wanted in the kernel. Adding
>> extremely minimalistic UTF-8 support to the kernel is not completely
>> out of the question, but some developers are likely to feel sceptic even
>> about that. Consequently, trying to pursue a project of adding anything
>> related to UTF-8 to the kernel is likely to end in frustration if the
>> person trying that does not have a significant amount of experience with
>> getting OpenBSD kernel patches committed.
>>
>> I'm sorry that i know absolutely nothing about fonts in ports, maybe
>> someone else can answer that part of the question.
>>
>> Yours,
>> Ingo
>>
>>
>

No comments:

Post a Comment