Unicode characters are normally encoded using 16-bit integral unsigned numbers. An appropriate type definition according to C/C++ language is:
typedef unsigned short UCS2;
The UCS-2 encoding form is the appropriate form to use for internal processing. Depending on your interchange requirements, the UCS-2 form may also be appropriate. In the absence of other information, the UCS-2 form should be assumed to apply.
The UCS-2 form of encoding Unicode characters should not be confused with current practice associated with double byte character set (DBCS) systems. The distinction has to do with machine dependent byte-ordering. For example, the following assertion may or may not hold depending upon the byte ordering of the machine on which it runs:
unsigned char dbstr[] = { 0x80, 0x81, 0x00 }; UCS2 ucstr[] = { 0x8081, 0x0000 }; assert ( ((UCS2 *)dbstr)[0] == ucstr[0] );
In general, double byte enabled systems make certain assumptions that do not hold in a UCS-2 Unicode encoding system. For example, the following code will work for a typical DBCS system but not for a UCS-2 Unicode system.
// // Count number of characters (not bytes) in // DBCS string S which is of length SLEN (in bytes, // excluding terminator). // int CountChars ( const unsigned char *s, int slen ) { unsigned char ch; int n = 0; if ( ! s ) return n; while ( ( ch = *s ) != 0 ) { if ( ! ( ch & 0x80 ) ) s += 1; // single byte char else s += 2; // double byte char n++; } return n; }
In a UCS-2 Unicode system, one cannot legally interpret individual bytes that constitute only a portion of a Unicode character; rather, the entire 16-bit integral value must be tested. In the above case, the end of a Unicode string would be signalled with a 16-bit NULL, i.e., with 0x0000; however, a single 8-bit NULL, 0x00, may appear in either the lower or upper 8-bits of a single UCS-2 Unicode character code. Consequently, the following code may return 0 or 1 depending on whether the machine is big-endian or little-endian, respectively. In neither case would the correct answer (2) be returned.
const UCS2 ucstr[] = { (UCS2) 'a', (UCS2) 'b', (UCS2) '\0' }; int n = CountChars ( (const unsigned char *) ucstr, 2 * sizeof (UCS2) );
To implement the above function for a UCS-2 Unicode string is much easier than the DBCS case:
int CountChars ( const UCS2 * s, int slen ) { return s ? ( slen / 2 ) : 0; }
Also notice that the size parameter, which specifies size of the string in bytes, is not useful in the DBCS case. For a DBCS system, there may be a mixture of 1-byte and 2-byte characters; thus the entire string must be scanned. In the Unicode case, the number of character codes is immediately determined from the size since each code element is 2-bytes in length. The consequences of this may be considerable, since in the DBCS case, one has an O(N) algorithm for determining the number of characters in a string; whereas, for the Unicode case, one has an O(1) algorithm.