Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)

From: Timothy Miller (miller@techsource.com)
Date: Thu Apr 10 2003 - 17:08:45 EST


I took the liberty of reading the FAQ (yeah, I saw 9.16) and joining the
list after reading an interesting recent discussion on i18n of kernel
messages. In short, the primary maintainers of the kernel don't want
it, and I agree with them.

HOWEVER, the discussion inspired me to think about ways of reducing some
of the unfortunate but necessary bloat caused by keeping all of those
strings in RAM. Naturally, any way to do this must be absolutely
painless, so I came up with the following set of restrictions:

- Absolutely no requirement to change existing strings, unless you feel
like it
- Must be easy to use
- Must actually shrink the kernel
- The impact on the way kernel messages appear should be minimized

To be brief, the idea I came up with was to identify the 128 most common
words in kernel messages and replace them with single character values
above 127 which printk would decode on the way out. Once the list was
determined, there would be a header file people could use, at their
leisure, to make stubstitutions. So, for instance, instead of having this:

    printk("invalid: ...");

We would have this:

    #define MSG_INVALID "\200"
    ...
    prink(MSG_INVALID "...");

To judge the practicality of this, I used 'strings' on an uncompressed
kernel image (2.4.20, IIRC) and then ran it through this:

tr '[:lower:]' '[:upper:]' | tr '[:blank:]' '\n' | sort | uniq -c | tr ' ' 0

This gave me a list of all words found in the kernel along with their
counts. Then I ran it through a positively awful little C program which
I wrote to determine not the 128 most frequent, but rather, the 128 that
would result in the maximum shrinkage (maximize count * (length-1)).
 The results of that run are given below. The results of the test are
that this approach might save up to 62424 bytes of kernel space which is
only about 3% of the kernel image size I got the strings from, but it's
nearly 27% of the total output I got from 'strings'. Is it worth it?
 Maybe not yet, but then again, there may be an even more intelligent
approach to this compression that we could use, hopefully one which
wouldn't require any more effort to use.

Here's are the results:

   count string
-------- --------
      37 GIGABIT
     102 BLOCK
      62 NULL
     871 [^_]
      26 INTERFACE
      23 MICROSYSTEMS
      75 RAGE
     338 SE
     226 TECH
     113 DEVICE
     214 <3>
     838 PC
      19 <3>INIT_MODULE:
      35 REGISTER
      41 <3>EXT3-FS
     656 UWVS
      57 NETWORK
      32 SUPPORT
      97 COMPUTER
     878 [^_
     137 NET
     198 MODE
     534 INC
      33 INTERNATIONAL
      59 CARDBUS
     203 TECHNO
     119 TECHNOLOGY
      46 CORP.
      31 EXT2-FS
     290 CONTROLLER
      64 ASSERTION
      83 DATA/FAX
     249 DATA
      60 KERNEL:
     304 CONTROL
      33 INVALID
     322 %D
     486 PCI
     185 INC.
      61 ERROR
      80 PORT
     154 IDE
      74 INODE
     102 <4>
      88 KERNEL
      52 ELECTRONICS
      44 <3>EXT3
     117 FAILED
      70 AUDIO
      83 HOST
      27 SEMICONDUCTOR
      50 CHIPS
      63 DEVFS
     117 ETHERNET
     299 ID
     291 COM
      46 CANNOT
      24 TRANSACTION
     238 TO
      79 TECHNOLOGIES
      63 %08X
      98 D$$
      37 PROCESS
     288 CORP
      56 DATA/FAX/VOICE
      39 COMMUNICATIONS
      44 10/100
      38 SERIAL
     146 CORPORATION
     236 TEC
     107 MICRO
      26 MICROSYSTEM
      95 ADAPTER
     324 NO
      50 POWER
     121 56K
      27 ACCELERATOR
      33 RESEARCH
      21 INTEGRATED
     271 PRO
      19 TECHNOLOGIES,
     237 LT
      43 CHIPSET
      28 NETWORKS
     317 L$
      40 <3>EXT3-FS:
    1665 CO
     192 BRIDGE
      13 MICROELECTRONICS
     157 JOURNAL
     147 FOR
      91 9D$
      18 CYBERSERIAL
      54 CYBER
      56 MEMORY
      34 DATA/FAX/VOICE/SPKP
      49 SMART
     207 LTD
     137 TCP
      57 CACHE
     407 T$
     160 <6>
      26 GRAPHICS
     888 D$
     140 SYSTEMS
     249 AT
       6 JOURNAL->J_COMMITTING_TRANSACTION
     142 MODEM
      32 CHANNEL
     131 %S:
     394 %S
      14 COMMIT_TRANSACTION
      63 FILE
      28 SMARTDAA)
      67 CHIP
      30 WINMODEM
     113 NOT
     139 ETH
     331 DEV
     197 FO
      52 VIDEO
      73 ELECTRONIC
      67 EXT3
      99 CARD
    1336 IN
     222 SYSTEM
     197 AD
      53 COMMUNICATION
Total reduction: 62424

Comments?

NOTE: I realize that some of those words probably aren't actually
"strings" in the kernel. This is a feasibility test, not a suggested list.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Apr 15 2003 - 22:00:21 EST