Tuesday, October 05, 2021

Re: Spleen with russian (maybe more) cyrillic symbols

This does relate to a question I've been thinking about for a while,
so even if actually offering diffs for that is still way above my pay
grade, I will offer these thoughts:

* Of ASCII's 128 characters, only 95 are actually printable (ASCII
sticks 2 thru 7 minus 0x7F DEL).[0]
* In principle, the console is capable of supporting 256 glyphs.
* With traditional Extended ASCII (EASCII) character sets, more than
95 characters were (still are) printable, but code assuming the use of
ISO 8859-1 is deprecated and no longer portable in this age of UTF-8,
and for EASCII sticks 8 thru F, there no longer is a direct
correspondence between code points and code units at all.
* Even if framebuffer console drivers could hypothetically be altered
to allow the use of more than 256 glyphs, I completely agree with Ingo
that that would be a fairly terrible idea for various reasons. While
the 256-glyphs limitation does stem from VGA console drivers
permitting no more than 256 text mode glyphs (or 512 with hacks), it
would be best to not totally break framebuffer and vga console
compatibility, but to stay within those limits.
* With "extremely minimalistic UTF-8 support", up to 161 "spots" might
be available.
* There are 1,112,064 legal Unicode character code points (0x11 *
0x10000 - 0x800, i.e. seventeen 65,536-character planes minus the
2,048 code points from U+D800 thru U+DFFF that are reserved for UTF-16
surrogates). Of those, 137,468 are private use, and 66 are
non-characters. If we also subtract the 95 printable ASCII
characters, that leaves 974,435 characters that might compete for
those 161 spots.
* There is an extremely strong argument for accommodating all
characters from ISO 8859-1 in any future minimalistic UTF-8 console
support. The non-breaking space and soft hyphen could use the same
glyphs as space and hyphen-minus, respectively. This means that to
maintain maximum backwards compatibility and UTF-8
forward-portability, 94 of those 161 spots would have to be taken,
leaving 67.
* There might also be a strong argument for accommodating all the
characters from ISO 8859-15 (so an additional 8) and Windows-1252,[1]
which despite no Unix pedigree is a common superset of ISO 8859-1,
with EASCII sticks A thru F being identical to ISO 8859-1. ISO
8859-15 differs from ISO 8859-1 in that it includes 8 characters in
sticks A/B that Windows-1252 encodes in sticks 8/9. However, with
UTF-8, code units and code points no longer match outside of sticks
0-7, so UTF-8 implementers of ISO 8859-1 and Windows-1252 backwards
compatibility get ISO 8859-15 support for free. Besides those 8,
Windows-1252 support would consume an additional 19 characters, so
we'd have to subtract 27 from those 27 remaining spots, leaving 40.
* 32 of those spots are from the C0 control codes from ASCII sticks
0/1. While Bemer et al. did originally propose alternatively
printable glyphs for those normally unprintable characters, their
glyphs were never commonly used. If "maximum printability" is a
criterion, Unicode does define so-called "Control Pictures" for them
(U+2400 thru U+241F).[2] It conceivably could be useful to have e.g.
a console-based hex editor render something printable for most code
units, however attempts at Control Picture inclusion would bump
against the technical limitation that the Control Pictures glyphs are
already barely legible in X11/xterm: So could actually useful Control
Pictures glyphs even be defined if one has just 8x16, 8x14, 8x10 or
8x8 pixels to play with, as may be the case on the console? It seems
doubtful. Perhaps those sparse spots and precious pixels are better
spent on something else, like Cyrillic for example.
* The once-common DOS code page 437 has 31 alternatively printable
glyphs for sticks 0/1. Of these, only the bullet point, section sign
and paragraph mark (pilcrow) can be found in the ISO 8859/Windows-1252
family. There is no compelling reason --like what's mentioned in
footnote [1]-- that could motivate the inclusion of its stick 0/1
glyphs, DOS having largely gone the way of the dodo. Also, full CP437
support would require many more glyphs, support for just this subset
of that old code page, never common in Unix-land, would seem wasteful.
That still leaves 40 spots that could potentially be used.
* The question is, which of the 974,435 candidates deserve one of
those 40 spots. With a look at a relevant map[3], Arabic, Cyrillic,
and Indic abugidas might have particularly strong claims. Arabic has
28 letters, but many contextual variants (though no case), Cyrillic,
or more specifically the Russian alphabet has 33 letters and it does
have case, so 40 spots might limit any support to UPPER CASE ONLY, or
should I say ЦРРЕЯ СА5Е ОИГУ. I do not feel I know enough about Indic
abugidas to say something intelligent.
* The question of what subset of Unicode to settle on for minimalistic
256-glyphs-only UTF-8 support might be bigger than OpenBSD. Other
Unix-like OSes might ask themselves the same question. Is this
something that ought to be standardised across Unix-land or something
OpenBSD would want to decide on its own?
* I mentioned "512 with hacks" above, but I do not know enough if it
could be viable, clean and VGA-compatible to blow past that 256
boundary. If yes, then an additional 256 spots might comfortably
allow for the inclusion of many more of the above.
* Either way, even if no code is created at this time, just having a
roadmap and knowing which glyphs ought to make the cut might be
desirable. It would also be possible to already make the font(s) once
that is known. Code that actually uses such a font to implement
minimalistic UTF-8 support (for the console) need not arrive at the
same time.
* On the other hand, if extending our minimum character set to cover
Windows-1252 and ISO 8859-15 and especially deciding upon the use of
the last 40 spots cannot be settled yet, then it might be fine to
leave that for later. The existing ISO 8859-1 fonts could actually be
useable by a minimalistic UTF-8 support implementation, if developed.
Again, once such an implementation has code points properly divorced
from code units, it could absolutely source its glyphs from those
fonts.
That would only leave the small issue of UTF-8 compliance by
everything else in base and ports...

I hope that was useful and worth the verbiage.

Thanks for your time,
Ian

(Ian Ropers)

Footnotes:
[0] Yes, they're properly called sticks. 8 sticks of 16 characters in
ASCII; 16 sticks of 16 characters in EASCII. See Bob Bemer's Inside
ASCII.
[1] Per enwp.org/CP1252, Windows-1252 text mislabelled as ISO-8859-1
is still very common (online), and "[m]ost modern web browsers and
e-mail clients treat the media type charset ISO-8859-1 as Windows-1252
to accommodate such mislabeling. This is now standard behavior in the
HTML5 specification, which requires that documents advertised as
ISO-8859-1 actually be parsed with the Windows-1252 encoding."
[2] https://en.wikipedia.org/wiki/Control_Pictures
[3] https://en.wikipedia.org/wiki/File:Writing_systems_worldwide.png


On 05/10/2021, Ingo Schwarze <schwarze@usta.de> wrote:
> Hi Slava,
>
> Slava Voronzoff wrote on Tue, Oct 05, 2021 at 03:01:26PM +0300:
>
>> I'm working right now on adding cyrillic to Spleen font. How can I later
>> add it to OpenBSD kernel and ports? Pull request to main font on github
>> (Hi, Frederic) or patch here?
>
> You cannot add it to the kernel because the kernel does not support
> UTF-8, but only US-ASCII, and US-ASCII contains no code points for
> cyrillic letters.
>
> Full UTF-8 support is definitely not wanted in the kernel. Adding
> extremely minimalistic UTF-8 support to the kernel is not completely
> out of the question, but some developers are likely to feel sceptic even
> about that. Consequently, trying to pursue a project of adding anything
> related to UTF-8 to the kernel is likely to end in frustration if the
> person trying that does not have a significant amount of experience with
> getting OpenBSD kernel patches committed.
>
> I'm sorry that i know absolutely nothing about fonts in ports, maybe
> someone else can answer that part of the question.
>
> Yours,
> Ingo
>
>

No comments:

Post a Comment