Tech Stuff - Character Sets

This page summarises what, at face value, seems a remarkably simple concept - character representation. Turns out it's more like a nightmare. The column marked Relationship tries to define the relationships between the various standards.

Name	Standard	Aliases	Description	Relationship
ASCII	ANSI X3.4-1986 ISO 646 ITU-T T.50	US-ASCII IA5 IRA5 ISO 646	ASCII is encoded as an 8 bit field but only uses the 7 bits 00 to 7F (0 to 127 decimal). What is frequently generically called ASCII is normally US-ASCII but various national definitions exist which typically have only two printable differences.	ASCII is the same as IA5 or more properly now International Reference Alphabet No. 5 (IRA5) and previously International Alphabet No. 5 (defined in ITU-T T.50) and ISO 646. It has the same character values as the first 128 entries in ISO 8859-1 (Latin-1), ISO 8859-15 (Latin-9) and CP1252. The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
IA5	ITU-T T.50	IRA5 ASCII ISO 646	International Alphabet No. 5 (ISO 646) now renamed International Reference Alphabet No. 5 (IRA5).
IRA5	ITU-T T.50	IA5 ISO 646 ASCII	International Reference Alphabet No. 5 (IRA5) (was International Alphabet No. 5 - IA5) and is the ITU equivalent of ASCII and ISO 646. IRA5 is encoded as an 8 bit field but only uses the 7 bits 00 to 7F (0 to 127 decimal).	IRA5 is almost the same as ISO 646 and ASCII (typically two - national/international variant - differences). The character values are the same as the first 128 entries in ISO 8859-1 (Latin-1), ISO 8859-15 (Latin-9) and CP1252. The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
ISO 646	ISO 646	IA5 IRA5 ASCII	ISO 646 is encoded as an 8 bit field but only uses the 7 bits 00 to 7F (0 to 127 decimal).	ISO 646 is the same as IRA5 (IA5) and ASCII. The character values are the same as the first 128 entries in ISO 8859-1 (Latin-1), ISO 8859-15 (Latin-9) and CP1252. The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
ISO 8859-1	ISO 8859-1	Latin-1	ISO 8859-1 is part of a large family (ISO 8859-1 to 8859-16) and is encoded as an 8 bit field which uses all 8 bits 00 to FF (0 to 255 decimal).	The first 128 character values are the same as IRA5, ISO 646, ASCII, ISO 8859-15 (Latin-9) and CP1252. The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
ISO 8859-15	ISO 8859-15	Latin-9	ISO 8859-15 is part of a large family (ISO 8859-1 to 8859-16) and is encoded as an 8 bit field which uses all 8 bits 00 to FF (0 to 255 decimal). It differs from 8859-1 by 8 changes including the euro symbol.	The first 128 character values are the same as IRA5, ISO 646, ASCII, ISO 8859-1 (Latin-1) and CP1252. The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
ISO 10646	ISO 10646	UCS	ISO 10646 (Universal Character Set) is designed to be the replacement for all previous character sets by providing a single family of standards for the encoding of all possible characters and symbols in all written languages. It has two implementations UCS-2 (a 16 bit encoding) and UCS-4 (a 32 bit encoding).	The first 128 characters (but not the encoding) in ISO 10646 are the same as ASCII, IA5, IRA5 and ISO 646, 8859-1 and 8859-15. Unicode from version 1.1 is the same as ISO 10646.
Unicode	Unicode Consortium	-	Unicode (currently version 3.0).	From version 1.1 is fully compatible with ISO 10646.
CP1252	Microsoft	code page 1252	Microsoft's version of ISO 8859-1. There are 27 differences from 8859-1 (it includes the euro) - all in range x80 - x9F. 8 bit encoding.	The first 128 character values are the same as IRA5, ISO 646, ASCII, ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9). The first 128 characters in Unicode and ISO 10646 (UCS) are the same but the character encoding is different.
Transformations
These values define how the underlying codeset of Unicode/ISO 10646 are sent over the wire. They are not character sets.
UTF-7	RFC 2152	-	UCS Transformation Format-7. Defines how ISO 10646 (UCS) is transformed for non-MIME email data communications. May use from 1 to 9 octets for a single ISO 10646/Unicode character.
UTF-8	RFC 3629	UTF-2 FSS-UTF	UCS Transformation Format-8. Defines how ISO 10646 (UCS) is transformed for MIME enabled data communications. May use from 1 to 7 octets for a single ISO 10646/Unicode character.
UTF-16	RFC 2781	-	UCS Transformation Format-16. Defines how ISO 10646 (UCS) is transformed for data communications. May use 1 or 2 octets for a single ISO 10646/Unicode character and thus reduces any UCS-4 to a UCS-2 format before encoding.

ISO 8859 Family

ISO 8859-1   Latin alphabet No. 1     West European
ISO 8859-2   Latin alphabet No. 2     Central and East European
ISO 8859-3   Latin alphabet No. 3     South European, Maltese & Esperanto
ISO 8859-4   Latin alphabet No. 4     North European
ISO 8859-5   Latin/Cyrillic alphabet  Slavic languages
ISO 8859-6   Latin/Arabic alphabet    Arabic
ISO 8859-7   Latin/Greek alphabet     modern Greek
ISO 8859-8   Latin/Hebrew alphabet    Hebrew and Yiddish
ISO 8859-9   Latin alphabet No. 5     Turkish
ISO 8859-10  Latin alphabet No. 6     Nordic (Sámi, Inuit, Icelandic)
ISO 8859-11  Latin/Thai alphabet      Thai
ISO 8859-12  not been defined)
ISO 8859-13  Latin alphabet No. 7     Baltic Rim
ISO 8859-14  Latin alphabet No. 8     Celtic
ISO 8859-15  Latin alphabet No. 9     adds euro to -1 (8 changes)
ISO 8859-16  Latin alphabet No. 10    South-Eastern Europe

Problems, comments, suggestions, corrections (including broken links) or something to add? Please take the time from a busy life to 'mail us' (at top of screen), the webmaster (below) or info-support at zytrax. You will have a warm inner glow for the rest of the day.