Copyright 2001-2002
Printer Working Group, All
Rights Reserved.
When sending a job to a printer, a print client (PC or other device) needs to make sure the printer has the ability to print the characters in the job. On PCs and similar devices, clients traditionally use font downloading to supply characters which the printer may not have. On smaller devices, including PDAs, set-top-boxes, etc., this will often not be an option.
This document provides guidance for implementors of printers and printing clients, including summaries and references to existing standards, recommended practices, and recommendations for future standards.
This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.
Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.
A list of current PWG Standards and other technical documents can be found at http://www.pwg.org/standards.html.
There is very little new material in this document. Rather, it is an attempt to summarize a complex subject, provide a conceptual framework, and bring together references so that a non-specialist can quickly find what is needed for managing printable characters. [The present author often feels, while surfing the web, that he is rediscovering what was well-known in a different time and place.]
A second goal is to clarify areas where more standards work is needed.
We assume the reader has some familiarity with Internet technologies such as Unicode, MIME, and XML. Older technologies are used only as needed for specific applications, and can usually be mapped into or associated with corresponding Internet technologies. This approach has two principle advantages:
The term "character set" is confusing. It is most often used to mean a specific set of (abstract) characters, each represented in a specific way (almost always as a series of octets). Knowing the character set, you know what bits to expect for each character. This is also the meaning of the term "charset" as used in MIME and XML.
Unicode adds a layer of abstraction, defining each character as an integer, and allowing for multiple coding schemes to represent each integer as a set of octets. In this sense Unicode is not a "character set", but might perhaps be called a "set of abstract characters."
In this document we preserve this distinction. We are primarily concerned with the set of abstract characters supported in a printer, relying on the (perhaps erroneous) assumption that multiple encodings would nevertheless access the same character data.
We adopt the term "repertoire" to mean a specified subset of Unicode characters, without regard to how they are encoded. When we use the term "character set", we mean a specific encoding, not necessarily Unicode.
Examples of "character sets":
Examples of "repertoires":
In this document we rely heavily on the Unicode scheme for organizing characters. Much of the following material is excerpted from [UC-Principles].
Unicode is a widely-adopted, worldwide character encoding standard. For each character it defines:
Some examples include:
The actual appearance of the character on paper or screen is called a "glyph", and varies based on device, font, etc. Unicode does not define glyphs, although it does give examples.
Character encodings define how these numeric values are represented in bits. Unicode defines three encodings:
In order to print successfully, a client needs to know both what characters (code points) are available, as well as what encodings can be used.
There are many character sets that are not based on Unicode, and several of these are important for printing existing documents. Fortunately, nearly all have published mappings into Unicode. Therefore, knowing what Unicode characters are available, a client can deduce which characters are available from an alternate character set. In addition, the client needs to find out whether the printer can accept characters encoded in the alternate character set, or whether the client must map them to Unicode.
Of course a printer may be configured to accept other character sets, but not those based on Unicode. However, such a printer is outside the scope of this Guide.
In summary, before printing a job a client needs to determine this information about the printer:
It may also be that some characters are conditionally available, e.g. only when certain fonts are selected. This topic is reserved for future work, and is not considered in this Guide. In fact, one recommendation is that a printer implement a system default font that can be used to render its full character set, and that this font be used as a fall-through to handle missing characters in other fonts.
The IANA registry (long) of character sets is available at http://www.iana.org/assignments/character-sets. Every registered charset contains at least:
In some cases, an alternate "preferred MIME name" is given. In those cases that is the name we use.
Part of the PWG's mission should be to identify a short list of character sets, as preferred for use in printing applications. Wherever possible we use the IANA name for a character set.
[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.
Microsoft publishes a number of single- and mutli-byte code pages, at http://www.microsoft.com/globaldev/reference/cphome.asp. These are all defined in terms of Unicode.
As part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode. It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.
[XHTML-Chars] defines a number of pre-defined character entities, in these groups:
For a total of 253 entries.
You can compare the ISO-8859, Microsoft, and XHTML repertoires side by side here.
These are the relevant fields in the [Unihan] database:
For Thai, use 8859-11, which is equivalent to TIS 620-2533 (1990) with the addition of 0xA0 NO-BREAK SPACE.
The PWG will define a standard set of repertoire names to be used for printing capabilities. The draft version of this list is:
PWG Character Repertoire | Based on IANA Charset | Description | Reference Location |
ISO-8859-1 | ISO-8859-1 | Latin alphabet No. 1 | RFC1345 |
ISO-8859-2 | ISO-8859-2 | Latin alphabet No. 2 | RFC1345 |
ISO-8859-3 | ISO-8859-3 | Latin alphabet No. 3 | RFC1345 |
ISO-8859-4 | ISO-8859-4 | Latin alphabet No. 4 | RFC1345 |
ISO-8859-5 | ISO-8859-5 | Latin/Cyrillic alphabet | RFC1345 |
ISO-8859-6 | ISO-8859-6 | Latin/Arabic alphabet | RFC1345 |
ISO-8859-7 | ISO-8859-7 | Latin/Greek alphabet | RFC1345 |
ISO-8859-8 | ISO-8859-8 | Latin/Hebrew alphabet | RFC1345 |
ISO-8859-9 | ISO-8859-9 | Latin alphabet No. 5 | RFC1345 |
ISO-8859-10 | ISO-8859-10 | Latin alphabet No. 6 | RFC1345 |
ISO-8859-13 | ISO-8859-13 | Latin alphabet No. 7 | http://www.iana.org/assignments/ charset-reg/iso-8859-13 |
ISO-8859-14 | ISO-8859-14 | Latin alphabet No. 8 | http://www.iana.org/assignments/ charset-reg/iso-8859-14 |
ISO-8859-15 | ISO-8859-15 | Latin alphabet No. 9 | http://www.iana.org/assignments/ charset-reg/ISO-8859-15 |
ISO-8859-16 | ISO-8859-16 | Latin alphabet No. 10 | ??? Could use http://www.unicode.org/Public/ MAPPINGS/ISO8859/8859-16.TXT |
GB_2312-80 | GB_2312-80 | Chinese (People’s Republic of China) | RFC1345 |
Shift_JIS | Shift_JIS | Japanese | "Appendix 1 of JIS X0208:1997,"
but where is this?
Unicode Unihan database has entries ("Jis1") for JIS X 0212-1990 |
KS_C_5601-1987 | KS_C_5601-1987 | Korean | RFC1345 |
Big5 | Big5 | Chinese (Taiwan) | "Chinese for Taiwan Multi-byte set. PCL Symbol Set Id: 18T", but where is this? |
TIS-620 | TIS-620 | Thai | ???. maybe http://www.nectec.or.th/it- standards/std620/std620.htm (in Thai) |
XHTML | http://www.w3.org/TR/xhtml- modularization/dtd_module_defs.html# a_xhtml_character_entities |
||
<to be specified> | Microsoft symbols |
The PWG will develop recommendations for built-in repertoires, based on the the advertised service level of the printer. (These service levels are intended to align with those for XHTML-Print, Bluetooth, and UPnP printing.) If a client knows that a printer implements one of these service levels, it may assume the presence of the given repertoires.
A draft version of this list is:
Print Service Level | Built-in Repertoires |
Basic | ISO-8859-1 |
Enhanced | TBD |
Various protocols provide a way for a client to find out information about a printer's capabilities. These protocols should be extended to define how the client can learn what repertoires are available in a printer. N0te that this query, if implemented, should always include the built-in repertoires for the service level offered by the printer.
The fundamental semantic unit for getting this capability is an attribute named "repertoires-supported" on the Printer object. The value is a comma-separated string containing the PWG names of the supported repertoires. Various protocols may map these names to other forms of representation. For example, the Bluetooth Basic Printing Profile uses bits in a bitmap, while the Printer MIB uses string names with no punctuation.
In addition, a protocol may provide a mechanism for discovering particular character sets that may be sent directly. The repertoires-supported attribute does not necessarily reflect characters available in non-Unicode character sets.
Queries associating available repertoires with fonts, charsets, PDLs, etc. are reserved for future study.
This section is directed at the Printer Working Group, with suggestions for standards that need to be developed.