Microsoft Typography | Developer... | OpenType specification | OpenType tables | The cmap table


cmap - Character To Glyph
Index Mapping Table

This table defines the mapping of character codes to the glyph index values used in the font. It may contain more than one subtable, in order to support more than one character encoding scheme. Character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. The glyph at this location must be a special glyph representing a missing character, commonly known as .notdef.

The table header indicates the character encodings for which subtables are present. Each subtable is in one of seven possible formats and begins with a format code indicating the format used.

The platform ID and platform-specific encoding ID in the header entry (and, in the case of the Macintosh platform, the language field in the subtable itself) are used to specify a particular 'cmap' encoding. The header entries must be sorted first by platform ID, then by platform-specific encoding ID, and then by the version field in the corresponding subtable. Each platform ID, platform-specific encoding ID, and subtable language combination may appear only once in the 'cmap' table.

When building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1. When building a symbol font for Windows, the platform ID should be 3 and the encoding ID should be 0. When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0.

All Microsoft Unicode encodings (Platform ID = 3, Encoding ID = 1) must provide at least a Format 4 'cmap' subtable. If the font is meant to support supplementary Unicode characters, it will additionally need a Format 12 subtable with a platform encoding ID 10. The contents of the Format 12 subtable need to be a superset of the contents of the Format 4 subtable. Microsoft strongly recommends using a Unicode 'cmap' for all fonts. However, some other encodings that appear in current fonts follow:

Microsoft Encodings
Platform ID Encoding ID Description
3 0 Symbol
3 1 Unicode
3 2 ShiftJIS
3 3 PRC
3 4 Big5
3 5 Wansung
3 6 Johab
3 7 Reserved
3 8 Reserved
3 9 Reserved
3 10 UCS-4

The Character To Glyph Index Mapping Table is organized as follows:

cmap Header
Type Name Description
USHORT version Table version number (0).
USHORT numTables Number of encoding tables that follow.

The cmap table header is followed by an array of encoding records that specify the particular encoding and the offset to the subtable for that encoding. The number of encoding records is numTables. An encoding record entry looks like:

Encoding Record
Type Name Description
USHORT platformID Platform ID.
USHORT encodingID Platform-specific encoding ID.
ULONG offset Byte offset from beginning of table to the subtable for this encoding.

OTF Windows NT compatibility mapping:

If a platform ID 4 (custom), encoding ID 0-255 (OTF Windows NT compatibility mapping) 'cmap' encoding is present in an OpenType font with CFF outlines, then the OTF font driver in Windows NT will: (a) superimpose the glyphs encoded at character codes 0-255 in the encoding on the corresponding Windows ANSI (code page 1252) Unicode values in the Unicode encoding it reports to the system; (b) add Windows ANSI (CharSet 0) to the list of CharSets supported by the font; and (c) consider the value of the encoding ID to be a Windows CharSet value and add it to the list of CharSets supported by the font. Note: The 'cmap' subtable must use Format 0 or 6 for its subtable, and the encoding must be identical to the CFF's encoding.

This 'cmap' encoding is not required. It provides a compatibility mechanism for non-Unicode applications that use the font as if it were Windows ANSI encoded. Non-Windows ANSI Type 1 fonts, such as Cyrillic and Central European fonts, that Adobe shipped in the past had "0" (Windows ANSI) recorded in the CharSet field of the .PFM file; ATM for Windows 9x ignores the CharSet altogether. Adobe provides this compatibility 'cmap' encoding in every OTF converted from a Type1 font in which the Encoding is not StandardEncoding.

Note on the language field in 'cmap' subtables:

This field must be set to zero for all cmap subtables whose platform IDs are other than Macintosh (platform ID 1). For cmap subtables whose platform IDs are Macintosh, set this field to the Macintosh language ID of the cmap subtable plus one, or to zero if the cmap subtable is not language-specific. For example, a Mac OS Turkish cmap subtable must set this field to 18, since the Macintosh language ID for Turkish is 17. A Mac OS Roman cmap subtable must set this field to 0, since Mac OS Roman is not a language-specific encoding.


Format 0: Byte encoding table

This is the Apple standard character to glyph index mapping table.

Type Name Description
USHORT format Format number is set to 0.
USHORT length This is the length in bytes of the subtable.
USHORT language Please see "Note on the language field in 'cmap' subtables" in this document.
BYTE glyphIdArray[256] An array that maps character codes to glyph index values.

This is a simple 1 to 1 mapping of character codes to glyph indices. The glyph set is limited to 256. Note that if this format is used to index into a larger glyph set, only the first 256 glyphs will be accessible.


Format 2: High-byte mapping through table

This subtable is useful for the national character code standards used for Japanese, Chinese, and Korean characters. These code standards use a mixed 8/16-bit encoding, in which certain byte values signal the first byte of a 2-byte character (but these values are also legal as the second byte of a 2-byte character).

In addition, even for the 2-byte characters, the mapping of character codes to glyph index values depends heavily on the first byte. Consequently, the table begins with an array that maps the first byte to a 4-word subHeader. For 2-byte character codes, the subHeader is used to map the second byte's value through a subArray, as described below. When processing mixed 8/16-bit text, subHeader 0 is special: it is used for single-byte character codes. When subHeader zero is used, a second byte is not needed; the single byte value is mapped through the subArray.

Type Name Description
USHORT format Format number is set to 2.
USHORT length This is the length in bytes of the subtable.
USHORT language Please see "Note on the language field in 'cmap' subtables" in this document.
USHORT subHeaderKeys[256] Array that maps high bytes to subHeaders: value is subHeader index * 8.
4 words struct subHeaders[ ] Variable-length array of subHeader structures.
USHORT glyphIndexArray[ ] Variable-length array containing subarrays used for mapping the low byte of 2-byte characters.

A subHeader is structured as follows:

Type Name Description
USHORT firstCode First valid low byte for this subHeader.
USHORT entryCount Number of valid low bytes for this subHeader.
SHORT idDelta See text below.
USHORT idRangeOffset See text below.

The firstCode and entryCount values specify a subrange that begins at firstCode and has a length equal to the value of entryCount. This subrange stays within the 0-255 range of the byte being mapped. Bytes outside of this subrange are mapped to glyph index 0 (missing glyph).The offset of the byte within this subrange is then used as index into a corresponding subarray of glyphIndexArray. This subarray is also of length entryCount. The value of the idRangeOffset is the number of bytes past the actual location of the idRangeOffset word where the glyphIndexArray element corresponding to firstCode appears.

Finally, if the value obtained from the subarray is not 0 (which indicates the missing glyph), you should add idDelta to it in order to get the glyphIndex. The value idDelta permits the same subarray to be used for several different subheaders. The idDelta arithmetic is modulo 65536.


Format 4: Segment mapping to delta values

This is the Microsoft standard character to glyph index mapping table for fonts that support Unicode ranges other than the range [U+D800 - U+DFFF] (defined as Surrogates Area, in Unicode v 3.0) which is used for UCS-4 characters. If a font supports this character range (i.e. in turn supports the UCS-4 characters) a subtable in this format with a platform specific encoding ID 1 is yet needed, in addition to a subtable in format 12 with a platform specific encoding ID 10. Please see details on format 12 below, for fonts that support UCS-4 characters on Windows.

This format is used when the character codes for the characters represented by a font fall into several contiguous ranges, possibly with holes in some or all of the ranges (that is, some of the codes in a range may not have a representation in the font). The format-dependent data is divided into three parts, which must occur in the following order:

  1. A four-word header gives parameters for an optimized search of the segment list;
  2. Four parallel arrays describe the segments (one segment for each contiguous range of codes);
  3. A variable-length array of glyph IDs (unsigned words).

Type Name Description
USHORT format Format number is set to 4.
USHORT length This is the length in bytes of the subtable.
USHORT language Please see "Note on the language field in 'cmap' subtables" in this document.
USHORT segCountX2 2 x segCount.
USHORT searchRange 2 x (2**floor(log2(segCount)))
USHORT entrySelector log2(searchRange/2)
USHORT rangeShift 2 x segCount - searchRange
USHORT endCount[segCount] End characterCode for each segment, last=0xFFFF.
USHORT reservedPad Set to 0.
USHORT startCount[segCount] Start character code for each segment.
SHORT idDelta[segCount] Delta for all character codes in segment.
USHORT idRangeOffset[segCount] Offsets into glyphIdArray or 0
USHORT glyphIdArray[ ] Glyph index array (arbitrary length)

The number of segments is specified by segCount, which is not explicitly in the header; however, all of the header parameters are derived from it. The searchRange value is twice the largest power of 2 that is less than or equal to segCount. For example, if segCount=39, we have the following:

segCountX2 78
searchRange 64 (2 * largest power of 2 <=39)
entrySelector 5 log2 (32)
rangeShift 14 2 x 39 - 64

Each segment is described by a startCode and endCode, along with an idDelta and an idRangeOffset, which are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values, and the segment values are specified in four parallel arrays. You search for the first endCode that is greater than or equal to the character code you want to map. If the corresponding startCode is less than or equal to the character code, then you use the corresponding idDelta and idRangeOffset to map the character code to a glyph index (otherwise, the missingGlyph is returned). For the search to terminate, the final endCode value must be 0xFFFF. This segment need not contain any valid mappings. (It can just map the single character code 0xFFFF to missingGlyph). However, the segment must be present.

If the idRangeOffset value for the segment is not 0, the mapping of character codes relies on glyphIdArray. The character code offset from startCode is added to the idRangeOffset value. This sum is used as an offset from the current location within idRangeOffset itself to index out the correct glyphIdArray value. This obscure indexing trick works because glyphIdArray immediately follows idRangeOffset in the font file. The C expression that yields the glyph index is:

*(idRangeOffset[i]/2 
+ (c - startCount[i]) 
+ &idRangeOffset[i])

The value c is the character code in question, and i is the segment index in which c appears. If the value obtained from the indexing operation is not 0 (which indicates missingGlyph), idDelta[i] is added to it to get the glyph index. The idDelta arithmetic is modulo 65536.

If the idRangeOffset is 0, the idDelta value is added directly to the character code offset (i.e. idDelta[i] + c) to get the corresponding glyph index. Again, the idDelta arithmetic is modulo 65536.

As an example, the variant part of the table to map characters 10-20, 30-90, and 153-480 onto a contiguous range of glyph indices may look like this:

segCountX2: 8
searchRange: 8
entrySelector: 4
rangeShift: 0
endCode: 20 90 480 0Xffff
reservedPad: 0
startCode: 10 30 153 0Xffff
idDelta: -9 -18 -27 1
idRangeOffset: 0 0 0 0
 

This table performs the following mappings:

10 -> 10 - 9 = 1 
20 -> 20 - 9 = 11
30 -> 30 - 18 = 12
90 -> 90 - 18 = 72
...and so on.

Note that the delta values could be reworked so as to reorder the segments.


Format 6: Trimmed table mapping

Type Name Description
USHORT format Format number is set to 6.
USHORT length This is the length in bytes of the subtable.
USHORT language Please see "Note on the language field in 'cmap' subtables" in this document.
USHORT firstCode First character code of subrange.
USHORT entryCount Number of character codes in subrange.
USHORT glyphIdArray [entryCount] Array of glyph index values for character codes in the range.

The firstCode and entryCount values specify a subrange (beginning at firstCode,length = entryCount) within the range of possible character codes. Codes outside of this subrange are mapped to glyph index 0. The offset of the code (from the first code) within this subrange is used as index to the glyphIdArray, which provides the glyph index value.



Supporting 4-byte character codes

While the four existing 'cmap' subtable formats which currently exist have served us well, the introduction of the Surrogates Area in Unicode 2.0 has stressed them past the point of utility. This section specifies three formats, format 8, 10 and 12; which directly support 4-byte character codes. A major change introduced with these three formats is a more pure 32-bit orientation. The 'cmap' table version number will continue to be 0x0000, for those fonts that make use of these formats.


Format 8: mixed 16-bit and 32-bit coverage

Format 8 is a bit like format 2, in that it provides for mixed-length character codes. If a font contains characters from the Unicode Surrogates Area (U+D800-U+DFFF), which are UCS-4 characters; it's likely that it will also include other, regular 16-bit Unicodes as well. We therefore need a format to map a mixture of 16-bit and 32-bit character codes, just as format 2 allows a mixture of 8-bit and 16-bit codes. A simplifying assumption is made: namely, that there are no 32-bit character codes which share the same first 16 bits as any 16-bit character code. This means that the determination as to whether a particular 16-bit value is a standalone character code or the start of a 32-bit character code can be made by looking at the 16-bit value directly, with no further information required.

Here's the format 8 subtable format:

Type Name Description
USHORT format Subtable format; set to 8.
USHORT reserved Reserved; set to 0
ULONG length Byte length of this subtable (including the header)
ULONG language Please see "Note on the language field in 'cmap' subtables" in this document.
BYTE is32[8192]
Tightly packed array of bits (8K bytes total) indicating whether the particular 16-bit (index) value is the start of a 32-bit character code
ULONG nGroups Number of groupings which follow

Here follow the individual groups. Each group has the following format:

Type Name Description
ULONG startCharCode First character code in this group; note that if this group is for one or more 16-bit character codes (which is determined from the is32 array), this 32-bit value will have the high 16-bits set to zero
ULONG endCharCode Last character code in this group; same condition as listed above for the startCharCode
ULONG startGlyphID Glyph index corresponding to the starting character code

A few notes here. The endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group. Groups must be sorted by increasing startCharCode. A group's endCharCode must be less than the startCharCode of the following group, if any.

To determine if a particular word (cp) is the first half of 32-bit code points, one can use an expression such as ( is32[ cp / 8 ] & ( 1 << ( 7 - ( cp % 8 ) ) ) ). If this is non-zero, then the word is the first half of a 32-bit code point.

0 is not a special value for the high word of a 32-bit code point. A font may not have both a glyph for the code point 0x0000 and glyphs for code points with a high word of 0x0000.

The presence of the packed array of bits indicating whether a particular 16-bit value is the start of a 32-bit character code is useful even when the font contains no glyphs for a particular 16-bit start value. This is because the system software often needs to know how many bytes ahead the next character begins, even if the current character maps to the missing glyph. By including this information explicitly in this table, no "secret" knowledge needs to be encoded into the OS.

Although this format might work advantageously on some platforms for non-Unicode encodings, Microsoft does not support it for Unicode encoded UCS-4 characters.


Format 10: Trimmed array

Format 10 is a bit like format 6, in that it defines a trimmed array for a tight range of 32-bit character codes:

Type Name Description
USHORT format Subtable format; set to 10.
USHORT reserved Reserved; set to 0
ULONG length Byte length of this subtable (including the header)
ULONG language Please see "Note on the language field in 'cmap' subtables" in this document.
ULONG startCharCode First character code covered
ULONG numChars Number of character codes covered
USHORT glyphs[] Array of glyph indices for the character codes covered

This format is not supported by Microsoft.


Format 12: Segmented coverage

This is the Microsoft standard character to glyph index mapping table for fonts supporting the UCS-4 characters in the Unicode Surrogates Area (U+D800 - U+DFFF). It is a bit like format 4, in that it defines segments for sparse representation in 4-byte character space. Here's the subtable format:

Type Name Description
USHORT format Subtable format; set to 12.
USHORT reserved Reserved; set to 0
ULONG length Byte length of this subtable (including the header)
ULONG language Please see "Note on the language field in 'cmap' subtables" in this document.
ULONG nGroups Number of groupings which follow

Fonts providing Unicode encoded UCS-4 character support for Windows 2000 and later, need to have a subtable with platform ID 3, platform specific encoding ID 1 in format 4; and in addition, need to have a subtable for platform ID 3, platform specific encoding ID 10 in format 12. Please note, that the content of format 12 subtable, needs to be a super set of the content in the format 4 subtable. The format 4 subtable needs to be in the cmap table to enable backward compatibility needs.

Here follow the individual groups, each of which has the following format:

Type Name Description
ULONG startCharCode First character code in this group
ULONG endCharCode Last character code in this group
ULONG startGlyphID Glyph index corresponding to the starting character code

Groups must be sorted by increasing startCharCode. A group's endCharCode must be less than the startCharCode of the following group, if any. The endCharCode is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode be there explicitly saves the necessity of an addition per group.



this page was last updated 21 March 2002
© 2001 Microsoft Corporation. All rights reserved. Terms of use.
comments to the MST group: how to contact us.

 

Microsoft Typography | Developer... | OpenType specification | OpenType tables | The cmap table