Character Repertoires Implementor's Guide

Printer Working Group Draft, January 13, 2003

Editors:: Elliott Bradshaw, Oak Technology Imaging Group

Abstract

When sending a job to a printer, a print client (PC or other device) needs to make sure the printer has the ability to print the characters in the job. On PCs and similar devices, clients traditionally use font downloading to supply characters which the printer may not have. On smaller devices, including PDAs, set-top-boxes, etc., this will often not be an option.

This document provides guidance for implementors of printers and printing clients, including summaries and references to existing standards, recommended practices, and recommendations for future standards.

Status of this Document
Revision History
General Approach
Primary References
Unicode
Terminology
Internet Charsets
Microsoft Codepages
Discussion of Referenced Character Sets
Named Character Repertoires
Determining A Printer's Supported Repertoires
Determining a Printer's Supported Charsets
Recommendations for the Printer Implementor
Recommendations for the Client Implementor
Recommendations for Standards Work
Issues
Acknowledgements
References

Status of this Document

This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.

Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.

Revision History

January 13, 2003:

Rewritten based on initial review.

General Approach

There is very little new material in this document. Rather, it is an attempt to summarize a complex subject, provide a conceptual framework, and bring together references so that a non-specialist can quickly find what is needed for managing printable characters. [The present author often feels, while surfing the web, that he is rediscovering what was well-known in a different time and place.]

A second goal is to clarify areas where more standards work is needed.

We assume the reader has some familiarity with Internet technologies such as Unicode, MIME, and XML. Older technologies are used only as needed for specific applications, and can usually be mapped into or associated with corresponding Internet technologies. This approach has two principle advantages:

"Forward-looking" applications such as XHTML-Print can be built without knowledge of legacy technologies
The Internet technologies are well documented and widely understood, thus providing a reliable basis for common understanding

Primary References

We take technical material from these general areas:

[Unicode-principles] The most widely adopted, world-wide character encoding standard.
Internet and Worldwide Web specifications rely heavily on the MIME registry for "charsets," located at [IANA-charsets]. This defines character encoding used in most internet applications.
Microsoft published "codepages" for various national regions. These codepages are online at [Microsoft-codepages], and include mappings to Unicode. These are used by most Windows applications.

Useful background reading can be found at:

[Lunde], which provides useful history and comparisons of various Asian character sets.
[XML-Japanese], which summarizes internet issues and references for Japanese characters. Some of this material is also relevant for other Asian languages.

Unicode

In this document we rely heavily on the Unicode scheme for organizing characters. Much of the following material is excerpted from [Unicode-principles].

Unicode is a widely-adopted, worldwide character encoding standard. For each character it defines:

A number, known as a code point, usually written in hexadecimal preceded by "U+"
A name

Some examples include:

U+0041 "LATIN CAPITAL LETTER A"
U+0180 "LATIN SMALL LETTER B WITH STROKE"
U+0436 "CYRILLIC SMALL LETTER ZHE"
U+0624 "ARABIC LETTER WAW WITH HAMZA ABOVE"
U+0A1B "GURMUKHI LETTER CHA"
U+2733 "EIGHT SPOKED ASTERISK"
U+30A4 "KATAKANA LETTER I"
U+3204 "PARENTHESIZED HANGUL MIEUM"

The actual appearance of the character on paper or screen is called a "glyph", and varies based on device, font, etc. Unicode does not define glyphs, although it does give examples.

Character encodings define how these numeric values are represented in bits. Unicode defines three encodings:

UTF-8. Each character is represented in 1-4 bytes. The Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII.
UTF-16. Each character is represented in 1 or 2 16-bit words.
UTF-32. Each character is represented in 1 32-bit word.

In order to print successfully, a client needs to know both what characters (code points) are available, as well as what encodings can be used.

There are many character sets that are not based on Unicode, and several of these are important for printing existing documents. Fortunately, nearly all have published mappings into Unicode. Therefore, knowing what Unicode characters are available, a client can deduce which characters are available from an alternate character set. In addition, the client needs to find out whether the printer can accept characters encoded in the alternate character set, or whether the client must map them to Unicode.

Of course a printer may be configured to accept other character sets, but not those based on Unicode. However, such a printer is outside the scope of this Guide.

In summary, before printing a job a client needs to determine this information about the printer:

Which character encoding schemes are available (e.g. Unicode UTF-8, Shift-JIS)
Within each scheme, which characters are available

It may also be that some characters are conditionally available, e.g. only when certain fonts are selected. This topic is reserved for future work, and is not considered in this Guide. In fact, one recommendation is that a printer implement a system default font that can be used to render its full character set, and that this font be used as a fall-through to handle missing characters in other fonts.

Terminology

charset: A method of converting a sequence of octets into a sequence of characters. This is the way as it is used in the MIME registry. See [RFC-2278] and [XML-Japanese] for discussions of the complexities of this term.

We use the term "repertoire" in two ways:

repertoire: (1) The complete set of characters defined in a given named character set, such as ISO 8859-1.; (2) The subset of characters defined in Unicode 3.2, that are needed for an exact mapping to a smaller character set, such as ISO 8859-1.

[Issue: should we use the term "character collection" instead?]

Primarily, for purposes of this document we focus on the second definition. We rely on Unicode for the definition of characters, and on various repertoires to tell which Unicode characters are actually present.

Examples of "charsets":

UTF-8 (all the characters of Unicode, encoded in a specific way)
ISO 8859-1, which gives a specific coded value for each character
Shift-JIS

Examples of "repertoires":

Unicode itself
"The Unicode characters that map to ISO-8859-1"
"The Unicode characters that map to Shift-JIS"

Internet Charsets

Historically, the Internet community created standards for charsets based on the need to agree on coding schemes for email using MIME. These MIME definitions have been incorporated into HTTP, XML, and most other web-based specifications.

The IANA registry (long) of charsets is available at [IANA-charsets]. Every registered charset contains at least:

a primary name
reference to an RFC or other publicly available specification
if feasible, a mapping to Unicode

In some cases, an alternate "preferred MIME name" is given. In those cases that is the name we use.

In MIME and HTTP headers, the charset is indicated with the "charset" parameter [Issue: verify this].

In XML, the charset may be indicated with a text declaration containing a coding declaration (see [XML] Section 4.3), e.g.:
<?xml encoding='UTF-8'?>

Printing languages based on XML may therefore use an XML text declaration to choose a non-Unicode charset, if this charset is supported by the printer.

Microsoft Codepages

As a practical matter one can't ignore the influence of Microsoft on printing applications. Microsoft has converted to a Unicode-centric approach to their codepages, and each of their codepages is based on a published standard. However, in some cases Microsoft has added [Issue: and changed/removed?] characters.

Discussion of Referenced Character Sets

Latin

[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.

In the XHTML community, [XHTML-chars] defines a number of pre-defined character entities, in these groups:

Latin 1 (96 entries)
Special Characters (33 entries)
Mathematical, Greek, and Symbolic (124 entries)

for a total of 253 entries.

Microsoft has registered these codepages with IANA:

In addition, as part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode (see [WGL4.0-desc] and [WGL4.0-data]). It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.

Thai is handled has a Latin alphabet, using ISO-8859-11. [Issue: There is apparently some controversy about this.]

You can compare the ISO-8859, XHTML, and Microsoft repertoires side by side at [PWG-Latin-table].

Asian (CJK)

Normative references to Asian character encoding definitions are given in [IANA-charsets]. In general, mapping these to Unicode is difficult, due to ambiguity in some of the characters (see [XML-Japanese] for discussion of this).

If a printer implements a specific Asian charset, we recommend that it do both of these:

Implement all the characters implied by the various common mappings to Unicode.
Implement the original (non-Unicode) character encoding, as defined in the originating charset.

If a client has text in an Asian charset (e.g. Shift-JIS), it should use that charset directly if the printer supports it. Otherwise, it should use one of the common mappings to convert to Unicode. This Guide does not define which of the common mappings is the preferred one.

Specific CJK repertoires are:

Chinese PRC (charset: GB_2312-80). There are 7,445 characters. Mapping to Unicode is given by [RFC1345].
Chinese Taiwan (charset: Big5). There are 13,494 characters. Mapping?
Japanese (charset: Shift_JIS). There are four common, slightly different mappings to Unicode, as described in [XML-Japanese].
Korean (charset: KS_C_5601-1987). Same as KS_X_1001:1992. There are 8,224 characters. Mapping to Unicode is given by [RFC1345].

Another source of mappings is the Unihan database published by the Unicode Consortium [Unihan]. However, it is not easy to determine exactly which Unihan tag to use in these various cases.

Microsoft publishes their CJK codepages, with Unicode mappings:

Named Character Repertoires

The PWG will define a standard set of repertoire names to be used for printing capabilities. The draft version of this list is:

PWG Character Repertoire	Based on IANA Charset	Description	Reference Location
ISO-8859-1	ISO-8859-1	Latin alphabet No. 1	[RFC-1345]
ISO-8859-2	ISO-8859-2	Latin alphabet No. 2	[RFC-1345]
ISO-8859-3	ISO-8859-3	Latin alphabet No. 3	[RFC-1345]
ISO-8859-4	ISO-8859-4	Latin alphabet No. 4	[RFC-1345]
ISO-8859-5	ISO-8859-5	Latin/Cyrillic alphabet	[RFC-1345]
ISO-8859-6	ISO-8859-6	Latin/Arabic alphabet	[RFC-1345]
ISO-8859-7	ISO-8859-7	Latin/Greek alphabet	[RFC-1345]
ISO-8859-8	ISO-8859-8	Latin/Hebrew alphabet	[RFC-1345]
ISO-8859-9	ISO-8859-9	Latin alphabet No. 5	[RFC-1345]
ISO-8859-10	ISO-8859-10	Latin alphabet No. 6	[RFC-1345]
ISO-8859-13	ISO-8859-13	Latin alphabet No. 7	http://www.iana.org/assignments/ charset-reg/iso-8859-13
ISO-8859-14	ISO-8859-14	Latin alphabet No. 8	http://www.iana.org/assignments/ charset-reg/iso-8859-14
ISO-8859-15	ISO-8859-15	Latin alphabet No. 9	http://www.iana.org/assignments/ charset-reg/ISO-8859-15
ISO-8859-16	ISO-8859-16	Latin alphabet No. 10	??? Could use http://www.unicode.org/Public/ MAPPINGS/ISO8859/8859-16.TXT
GB_2312-80	GB_2312-80	Chinese (People’s Republic of China)	[RFC-1345]
Shift_JIS	Shift_JIS	Japanese	[JIS X 0201] and [JIS X 0208]
KS_C_5601-1987	KS_C_5601-1987	Korean	[RFC-1345]
Big5	Big5	Chinese (Taiwan)	[Big5]
TIS-620	TIS-620	Thai	[TIS-620]

Note that the XHTML predefined character entities are not shown in this table. They should be supported implicitly by any printer processing an XHTML-based language.

[Issues:

-how should we handle Microsoft code pages? Should a printer reference them directly? Should a printer add in characters from "similar" MS codepages, e.g. from windows-1251 when doing Latin/Cyrillic?

-how should we handle characters in WGL4.0? A few of these are symbols that don't show up in ISO-8859.

]

Determining A Printer's Supported Repertoires

Capability Queries for Supported Repertoires

Various protocols provide a way for a client to find out information about a printer's capabilities. These protocols should be extended to define how the client can learn what repertoires are available in a printer.

The fundamental semantic unit for getting this capability is an attribute named "repertoires-supported" on the Printer object. The value is a comma-separated string containing the PWG names of the supported repertoires, including any implicitly-supported repertoires as listed below. Various protocols may map these names to other forms of representation. For example, the Bluetooth Basic Printing Profile uses bits in a bitmap, while the Printer MIB uses string names with no punctuation.

In addition, a protocol may provide a mechanism for discovering particular charsets that may be sent directly. The repertoires-supported attribute does not necessarily reflect characters available in non-Unicode charsets.

Queries associating available repertoires with fonts, charsets, PDLs, etc. are reserved for future study.

Implicitly-Supported Repertoires

If a printer uses a protocol that supports a repertoire capability query, the client should use it. When that is not possible, a client may make the following assumptions:

Every printer supports Latin 1
Every printer that supports Enhanced Layout (as used in XHTML-Print, UPnP, Bluetooth BPP, etc.) supports 8859-1 through -5, -7, -9, -10, -13, -14, and -15. This includes Greek and Cyrillic but excludes right-to-left alphabets.
If a printer is described as supporting Arabic, it support 8859-6 and implements right-to-left printing.
If a printer is described as supporting Hebrew, it support 8859-8 and implements right-to-left printing.
If a printer is described as supporting Chinese (PRC), Chinese (Taiwan), Korean, Japanese, or Thai, it supports the corresponding repertoire from the above table.
If a printer supports XHTML-based languages, it supports the characters defined by the XHTML predefined character entities (including the ones that don't appear in a named repertoire).

Determining a Printer's Supported Charsets

Most printing languages define a default charset. Languages based on XHTML specify that a printer must support UTF-8 (an encoding of Unicode) as well as any others.

Based on the repertoires defined above, a printer may always use the Unicode codepoints corresponding to those repertoires. However, most of these repertoires originated with some non-Unicode encoding, and there may be problems mapping to Unicode.

A printer may choose to implement the original, non-Unicode charset based on the repertoires listed above. This is not likely to be useful for Latin codings, but may be especially useful for Shift-JIS.

[Issue: how should a client learn which charsets are available?]

Recommendations for the Printer Implementor

Always implement Unicode UTF-8, in addition to any other character encoding schemes.
Implement characters described by the rules in "Implicitly-Supported Repertoires," above.
Make supported characters available in all fonts, using a system font fall-through if needed.
Print a recognizable "missing character" symbol (for example an empty rectangle) for any character not supported.

Recommendations for the Client Implementor

If the printer provides a query mechanism to obtain supported repertoires and charsets, use it to find out what the printer can handle.
Otherwise, follow the guidelines in "Implicitly-Supported Repertoires," above.
If the source document is not in Unicode, decide whether or not to map it to Unicode. Usually, if the printer can handle the original charset it is best to send it unmapped.
If the document contains characters that won't print, decide whether to alert the user, map them to some other characters, let the printer handle them, etc.

Recommendations for Standards Work

This section is directed at the Printer Working Group, with suggestions for standards that need to be developed.

Adopt a standard set of character repertoire names.
Define the rules for implicitly supported repertoires.
Define the semantics of a query mechanism to determine which repertoires and charsets are available in a printer.
Agree on and publish normative references for mapping between other schemes and Unicode.

Issues

How do we reference ISO-8859? Is there a version online, or does every reader need to buy it from ISO? If so, we should list exactly what they need to buy.
What about other ISO-8859 components?
- 8859-11: Latin/Thai
- 8859-12: does not exist
- 8859-16: Latin alphabet No. 10

Acknowledgements

This Guide was prepared by the PWG Character-Repertoires Working Group, with input and assistance from:

Rod Acosta, Agfa Monotype
Jun Fujisawa, Canon
Jim Bigelow, Hewlett-Packard
Ira McDonald, High North
Paul Tykodi, Intermate US
Mark Robb, Lexmark
Don Wright, Lexmark

We also thank the authors of the original material cited in the references.

References

[Big5]: "Chinese for Taiwan Multi-byte set. PCL Symbol Set Id: 18T", but where is this?
[BPP]: "Bluetooth Basic Printing Profile", Bluetooth SIG, October 5, 2001. Available at: http://www.bluetooth.com/pdf/Basic_Printing_Profile_0_95a.pdf.
[IANA-charsets]: http://www.iana.org/assignments/character-sets.
[ISO-8859]: ...purchase each alphabet online at http://www.iso.org.
[JIS X 0201]: Japanese Industrial Standards Committee. 7-bit and 8-bit coded character sets for information interchange, JIS X 0201:1997, Japanese Standards Association, 1997.
[JIS X 0208]: Japanese Industrial Standards Committee. 7-bit and 8-bit double byte coded KANJI sets for information interchange, JIS X 0208:1997, Japanese Standards Association, 1997.
[Lunde]: CJKV Information Processing, Ken Lunde. O'Reilly Press, 1999.
[Microsoft-codepages]: http://www.microsoft.com/globaldev/reference/cphome.asp.
[PWG-Latin-table]: ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/CRsummary.html.
[RFC-1345]: Character Mnemonics and Character Sets, Jun, 1992. ftp://ftp.rfc-editor.org/in-notes/rfc1345.txt.
[RFC-2278]: IANA Charset Registration Procedures. ftp://ftp.rfc-editor.org/in-notes/rfc2278.txt.
[TIS-620]: ???. maybe http://www.nectec.or.th/it- standards/std620/std620.htm (in Thai)
[Unicode-8859]: Mapping tables from 8859 alphabets to Unicode.; http://www.unicode.org/Public/MAPPINGS/ISO8859/
[Unicode-principles]: The Unicode® Standard: A Technical Introduction.; http://www.unicode.org/standard/principles.html
[Unihan]: Asian property database for Unicode; include mapping from other alphabets. A very large file; zip form available at http://www.unicode.org/Public/UNIDATA/Unihan.zip.
[WGL4.0-data]: Unicode values for WGL4.0. http://www.microsoft.com/typography/OTSPEC/WGL4.htm.
[WGL4.0-desc]: Description of Microsoft's character set standard which "includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish." http://www.microsoft.com/typography/unicode/cscp.htm
[XHTML-chars]: Predefined character entities in XHTML. http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
[XML]: Extensible Markup Language (XML) 1.0 (Second Edition), October, 2000. http://www.w3.org/TR/REC-xml.
[XML-Japanese]: XML Japanese Profile, April, 2000. http://www.w3.org/TR/japanese-xml/.