Copyright 2002-2003
Printer Working Group, All
Rights Reserved.
When sending a job to a printer, a print client (PC or other device) needs to make sure the printer has the ability to print the characters in the job. On PCs and similar devices, clients traditionally use font downloading to supply characters which the printer may not have. On smaller devices, including PDAs, set-top-boxes, etc., this will often not be an option.
This document provides guidance for implementors of printers and printing clients, including summaries and references to existing standards, recommended practices, and recommendations for future standards.
This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.
Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.
January 13, 2003:
There is very little new material in this document. Rather, it is an attempt to summarize a complex subject, provide a conceptual framework, and bring together references so that a non-specialist can quickly find what is needed for managing printable characters. [The present author often feels, while surfing the web, that he is rediscovering what was well-known in a different time and place.]
A second goal is to clarify areas where more standards work is needed.
We assume the reader has some familiarity with Internet technologies such as Unicode, MIME, and XML. Older technologies are used only as needed for specific applications, and can usually be mapped into or associated with corresponding Internet technologies. This approach has two principle advantages:
We take technical material from these general areas:
Useful background reading can be found at:
In this document we rely heavily on the Unicode scheme for organizing characters. Much of the following material is excerpted from [Unicode-principles].
Unicode is a widely-adopted, worldwide character encoding standard. For each character it defines:
Some examples include:
The actual appearance of the character on paper or screen is called a "glyph", and varies based on device, font, etc. Unicode does not define glyphs, although it does give examples.
Character encodings define how these numeric values are represented in bits. Unicode defines three encodings:
In order to print successfully, a client needs to know both what characters (code points) are available, as well as what encodings can be used.
There are many character sets that are not based on Unicode, and several of these are important for printing existing documents. Fortunately, nearly all have published mappings into Unicode. Therefore, knowing what Unicode characters are available, a client can deduce which characters are available from an alternate character set. In addition, the client needs to find out whether the printer can accept characters encoded in the alternate character set, or whether the client must map them to Unicode.
Of course a printer may be configured to accept other character sets, but not those based on Unicode. However, such a printer is outside the scope of this Guide.
In summary, before printing a job a client needs to determine this information about the printer:
It may also be that some characters are conditionally available, e.g. only when certain fonts are selected. This topic is reserved for future work, and is not considered in this Guide. In fact, one recommendation is that a printer implement a system default font that can be used to render its full character set, and that this font be used as a fall-through to handle missing characters in other fonts.
We use the term "repertoire" in two ways:
[Issue: should we use the term "character collection" instead?]
Primarily, for purposes of this document we focus on the second definition. We rely on Unicode for the definition of characters, and on various repertoires to tell which Unicode characters are actually present.
Examples of "charsets":
Examples of "repertoires":
Historically, the Internet community created standards for charsets based on the need to agree on coding schemes for email using MIME. These MIME definitions have been incorporated into HTTP, XML, and most other web-based specifications.
The IANA registry (long) of charsets is available at [IANA-charsets]. Every registered charset contains at least:
In some cases, an alternate "preferred MIME name" is given. In those cases that is the name we use.
In MIME and HTTP headers, the charset is indicated with the "charset" parameter [Issue: verify this].
In XML, the charset may be indicated with a text declaration containing a coding declaration (see [XML] Section 4.3), e.g.:
<?xml encoding='UTF-8'?>
Printing languages based on XML may therefore use an XML text declaration to choose a non-Unicode charset, if this charset is supported by the printer.
As a practical matter one can't ignore the influence of Microsoft on printing applications. Microsoft has converted to a Unicode-centric approach to their codepages, and each of their codepages is based on a published standard. However, in some cases Microsoft has added [Issue: and changed/removed?] characters.
[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.
In the XHTML community, [XHTML-chars] defines a number of pre-defined character entities, in these groups:
for a total of 253 entries.
Microsoft has registered these codepages with IANA:
In addition, as part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode (see [WGL4.0-desc] and [WGL4.0-data]). It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.
Thai is handled has a Latin alphabet, using ISO-8859-11. [Issue: There is apparently some controversy about this.]
You can compare the ISO-8859, XHTML, and Microsoft repertoires side by side at [PWG-Latin-table].
Normative references to Asian character encoding definitions are given in [IANA-charsets]. In general, mapping these to Unicode is difficult, due to ambiguity in some of the characters (see [XML-Japanese] for discussion of this).
If a printer implements a specific Asian charset, we recommend that it do both of these:
If a client has text in an Asian charset (e.g. Shift-JIS), it should use that charset directly if the printer supports it. Otherwise, it should use one of the common mappings to convert to Unicode. This Guide does not define which of the common mappings is the preferred one.
Specific CJK repertoires are:
Another source of mappings is the Unihan database published by the Unicode Consortium [Unihan]. However, it is not easy to determine exactly which Unihan tag to use in these various cases.
Microsoft publishes their CJK codepages, with Unicode mappings:
The PWG will define a standard set of repertoire names to be used for printing capabilities. The draft version of this list is:
PWG Character Repertoire | Based on IANA Charset | Description | Reference Location |
ISO-8859-1 | ISO-8859-1 | Latin alphabet No. 1 | [RFC-1345] |
ISO-8859-2 | ISO-8859-2 | Latin alphabet No. 2 | [RFC-1345] |
ISO-8859-3 | ISO-8859-3 | Latin alphabet No. 3 | [RFC-1345] |
ISO-8859-4 | ISO-8859-4 | Latin alphabet No. 4 | [RFC-1345] |
ISO-8859-5 | ISO-8859-5 | Latin/Cyrillic alphabet | [RFC-1345] |
ISO-8859-6 | ISO-8859-6 | Latin/Arabic alphabet | [RFC-1345] |
ISO-8859-7 | ISO-8859-7 | Latin/Greek alphabet | [RFC-1345] |
ISO-8859-8 | ISO-8859-8 | Latin/Hebrew alphabet | [RFC-1345] |
ISO-8859-9 | ISO-8859-9 | Latin alphabet No. 5 | [RFC-1345] |
ISO-8859-10 | ISO-8859-10 | Latin alphabet No. 6 | [RFC-1345] |
ISO-8859-13 | ISO-8859-13 | Latin alphabet No. 7 | http://www.iana.org/assignments/ charset-reg/iso-8859-13 |
ISO-8859-14 | ISO-8859-14 | Latin alphabet No. 8 | http://www.iana.org/assignments/ charset-reg/iso-8859-14 |
ISO-8859-15 | ISO-8859-15 | Latin alphabet No. 9 | http://www.iana.org/assignments/ charset-reg/ISO-8859-15 |
ISO-8859-16 | ISO-8859-16 | Latin alphabet No. 10 | ??? Could use http://www.unicode.org/Public/ MAPPINGS/ISO8859/8859-16.TXT |
GB_2312-80 | GB_2312-80 | Chinese (People’s Republic of China) | [RFC-1345] |
Shift_JIS | Shift_JIS | Japanese | [JIS X 0201] and [JIS X 0208] |
KS_C_5601-1987 | KS_C_5601-1987 | Korean | [RFC-1345] |
Big5 | Big5 | Chinese (Taiwan) | [Big5] |
TIS-620 | TIS-620 | Thai | [TIS-620] |
Note that the XHTML predefined character entities are not shown in this table. They should be supported implicitly by any printer processing an XHTML-based language.
[Issues:
-how should we handle Microsoft code pages? Should a printer reference them directly? Should a printer add in characters from "similar" MS codepages, e.g. from windows-1251 when doing Latin/Cyrillic?
-how should we handle characters in WGL4.0? A few of these are symbols that don't show up in ISO-8859.
]
Various protocols provide a way for a client to find out information about a printer's capabilities. These protocols should be extended to define how the client can learn what repertoires are available in a printer.
The fundamental semantic unit for getting this capability is an attribute named "repertoires-supported" on the Printer object. The value is a comma-separated string containing the PWG names of the supported repertoires, including any implicitly-supported repertoires as listed below. Various protocols may map these names to other forms of representation. For example, the Bluetooth Basic Printing Profile uses bits in a bitmap, while the Printer MIB uses string names with no punctuation.
In addition, a protocol may provide a mechanism for discovering particular charsets that may be sent directly. The repertoires-supported attribute does not necessarily reflect characters available in non-Unicode charsets.
Queries associating available repertoires with fonts, charsets, PDLs, etc. are reserved for future study.
If a printer uses a protocol that supports a repertoire capability query, the client should use it. When that is not possible, a client may make the following assumptions:
Most printing languages define a default charset. Languages based on XHTML specify that a printer must support UTF-8 (an encoding of Unicode) as well as any others.
Based on the repertoires defined above, a printer may always use the Unicode codepoints corresponding to those repertoires. However, most of these repertoires originated with some non-Unicode encoding, and there may be problems mapping to Unicode.
A printer may choose to implement the original, non-Unicode charset based on the repertoires listed above. This is not likely to be useful for Latin codings, but may be especially useful for Shift-JIS.
[Issue: how should a client learn which charsets are available?]
This section is directed at the Printer Working Group, with suggestions for standards that need to be developed.
This Guide was prepared by the PWG Character-Repertoires Working Group, with input and assistance from:
We also thank the authors of the original material cited in the references.