UCS-2 Encoding Form


Unicode characters are normally encoded using 16-bit integral unsigned numbers. An appropriate type definition according to C/C++ language is:

typedef unsigned short UCS2;

The UCS-2 encoding form is the appropriate form to use for internal processing. Depending on your interchange requirements, the UCS-2 form may also be appropriate. In the absence of other information, the UCS-2 form should be assumed to apply.

Wide Character versus Multibyte Character Form

The UCS-2 form of encoding Unicode characters should not be confused with current practice associated with double byte character set (DBCS) systems. The distinction has to do with machine dependent byte-ordering. For example, the following assertion may or may not hold depending upon the byte ordering of the machine on which it runs:

unsigned char   dbstr[] = { 0x80, 0x81, 0x00 };
UCS2            ucstr[] = { 0x8081, 0x0000 };
assert ( ((UCS2 *)dbstr)[0] == ucstr[0] );

In general, double byte enabled systems make certain assumptions that do not hold in a UCS-2 Unicode encoding system. For example, the following code will work for a typical DBCS system but not for a UCS-2 Unicode system.

//
// Count number of characters (not bytes) in
// DBCS string S which is of length SLEN (in bytes,
// excluding terminator).
//
int CountChars ( const unsigned char *s, int slen )
{
  unsigned char ch;
  int n = 0;
  if ( ! s ) return n;
  while ( ( ch = *s ) != 0 ) {
    if ( ! ( ch & 0x80 ) )
      s += 1;   // single byte char
    else
      s += 2;   // double byte char
    n++;
  }
  return n;
}

In a UCS-2 Unicode system, one cannot legally interpret individual bytes that constitute only a portion of a Unicode character; rather, the entire 16-bit integral value must be tested. In the above case, the end of a Unicode string would be signalled with a 16-bit NULL, i.e., with 0x0000; however, a single 8-bit NULL, 0x00, may appear in either the lower or upper 8-bits of a single UCS-2 Unicode character code. Consequently, the following code may return 0 or 1 depending on whether the machine is big-endian or little-endian, respectively. In neither case would the correct answer (2) be returned.

const UCS2 ucstr[] =
{
  (UCS2) 'a',
  (UCS2) 'b',
  (UCS2) '\0'
};
int n = CountChars ( (const unsigned char *) ucstr, 2 * sizeof (UCS2) );

To implement the above function for a UCS-2 Unicode string is much easier than the DBCS case:

int CountChars ( const UCS2 * s, int slen )
{
  return s ? ( slen / 2 ) : 0;
}

Also notice that the size parameter, which specifies size of the string in bytes, is not useful in the DBCS case. For a DBCS system, there may be a mixture of 1-byte and 2-byte characters; thus the entire string must be scanned. In the Unicode case, the number of character codes is immediately determined from the size since each code element is 2-bytes in length. The consequences of this may be considerable, since in the DBCS case, one has an O(N) algorithm for determining the number of characters in a string; whereas, for the Unicode case, one has an O(1) algorithm.


Copyright ? 1994 Unicode, Inc.