Re: [PATCH]console:UTF-8 mode compatibility fixes

From: Adam TlaÅka
Date: Sat Feb 18 2006 - 11:01:06 EST


UÅytkownik Andrew Morton napisaÅ:

Adam Tla/lka <atlka@xxxxxxxxx> wrote:


This patch applies to 2.6.15.3 kernel sources to drivers/char/vt.c file.
It should work with other versions too.

Changed console behaviour so in UTF-8 mode vt100 alternate character
sequences work as described in terminfo/termcap linux terminal definition.
Programs can use vt100 control seqences - smacs, rmacs and acsc characters
in UTF-8 mode in the same way as in normal mode so one definition is always
valid - current behaviour make these seqences not working in UTF-8 mode.

Added reporting malformed UTF-8 seqences as replacement glyphs.
I think that terminal should always display something rather then ignoring
these kind of data as it does now. Also it sticks to Unicode standards
saying that every wrong byte should be reported. It is more human readable
too in case of Latin subsets including ASCII chars.

...

- } else if (vc->vc_utf) {
+ } else if (vc->vc_utf && !vc->vc_disp_ctrl) {
/* Combine UTF-8 into Unicode */
- /* Incomplete characters silently ignored */
+ /* Malformed sequence represented as replacement glyphs */
+rescan_last_byte:
if(c > 0x7f) {

...

+ if (vc->vc_npar) {
+ c = orig;
+ goto rescan_last_byte;
+ }

...

+ }
+ vc->vc_utf_count = 0;
+ c = orig;
+ goto rescan_last_byte;
+ }
continue;
}


I spent some time trying to work out why this cannot cause an infinite loop
and gave up. Can you explain?

1. this code is executed only if vc_utf_count != 0
which means uncompleted UTF-8 sequence, because in case of proper UTF-8 sequence or normal mode vc_utf_count == 0 in these places of code.

2. vc_npar is not used while completing UTF-seqence so I used it as a counter of scanned sequence continuation bytes, it is set to 0 if begin of UTF-8 seqence is detected and vc_utf_count set to number of continuation bytes

3. when you can't display replacement glyph bad sequence is ignored as previously so vc_utf_count and vc_npar is zeroed in case of malformed UTF-8 seqence and there is no loop - anyway replacement glyph
should always be defined IMHO or I must change this code because
it seems not to be correct to use c as tc as a last resort because in this case c means byte value which malformed scanned seqence so it is not valuable for us. Maybe the better way is to use "?" char as a last
resort instead of c value. What do you think?
Maybe I should remember all bytes of the UTF-sequence to use their values as a last resort char in case of malformed sequence and 0xfffd
not defined?

Regards
--
Adam TlaÅka mailto:atlka@xxxxxxxxx ^v^ ^v^ ^v^
System & Network Administration Group - - - ~~~~~~
Computer Center, GdaÅsk University of Technology, Poland
PGP public key: finger atlka@xxxxxxxxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/