Balancer>Миш, ты не совсем прав Уже сейчас по стандарту UTF-8 кодирует 231 символов и в рамках же стандарта расширается до произвольного числа символов. Его аналог, классический тридцатидвухбитный Unicode, крайне неудобен по всем параметрам, начиная от его обработки, кончая тем, что отводить по 4 байта на каждый символ - это издевательство
Ром, а по какому стандарту UTF-8 кодирует 2
31 символов? RFC 3629 показывает 21 битик и ссылается на ту книжку, что я привел - там тоже только 21 битик. Кроме того, UTF-8 не имеет бесконечно расширяемости. Одно из свойств UTF-8 - это то, что в потоке байте легко отличить байт данных от байта управления, а байт управления указывает на длину всей кодовой последовательности. А там битиков ограниченное количество.
UTF - любая - это encoding form. UTF + что-то еще - схема. Т.е. это способ представления абстрактного числа, закрепленного за кодовой точкой (собственно символа) в потоке или машине. Это разные вещи. Как имплементация - да есть всякие представления - 32 двух битовые или 16 битовые. Кроме того, Unicode - это не только кодирование, а много чего еще другого.
А про 4 байта под каждый символ - это пока с простыми текстами работаешь - а как пойдешь со сложными, так и в UTF-8 будут все многобайтовые.
Блин, надо тему в компьютерный переносить.
2.4 Code Points and Characters
On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters.
The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF16, comprising 1,114,112 code points available for assigning the repertoire of abstract characters. Of course, there are constraints on how the codespace is organized, and particular areas of the codespace have been set aside for encoding of certain kinds of abstract characters or for other uses in the standard. For more on the allocation of the Unicode codespace, see Section 2.8, Unicode Allocation.
2.5 Encoding Forms
Computers handle numbers not simply as abstract mathematical objects, but as combinations of fixed-size units like bytes and 32-bit words. A character encoding model must take this fact into account when determining how to associate numbers with the characters. Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are correspondingly named UTF-8, UTF-16, and UTF-32. (The “UTF” is a carryover from earlier terminology meaning Unicode (or UCS) Transformation Format.) Each of these three encoding forms is an equally legitimate mechanism for representing Unicode characters; each has advantages in different environments.
All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; they are thus fully interoperable for implementations that may choose different encoding forms for various reasons. Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.
2.6 Encoding Schemes
The discussion of Unicode encoding forms in the previous section was concerned with the machine representation of Unicode code units. Each code unit is represented in a computer simply as a numeric data type; just as for other numeric types, the exact way the bits are laid out internally is irrelevant to most processing. However, interchange of textual data, particularly between computers of different architectural types, requires consideration of the exact ordering of the bits and bytes involved in numeric representation. Integral data, including character data, is serialized for open interchange into well-defined sequences of bytes. This process of byte serialization allows all applications to correctly interpret exchanged data and to accurately reconstruct numeric values (and thereby character values) from it. In the Unicode Standard, the specifications of the distinct types of byte serializations
to be used with Unicode data are known as Unicode encoding schemes.
Modern computer architectures differ in ordering in terms of whether the most significant byte or the least significant byte of a large numeric data type comes first in internal representation. These sequences are known as “big-endian” and “little-endian” orders, respectively. For the Unicode 16- and 32-bit encoding forms (UTF-16 and UTF-32), the specification of a byte serialization must take into account the big-endian or little-endian architecture of the system on which the data is represented, so that when the data is byteserialized for interchange it will be well defined.
A character encoding scheme consists of a specified character encoding form plus a specification of how the code units are serialized into bytes. The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or littleendian data in some of the Unicode encoding schemes. (See the “Byte Order Mark” subsection in Section 15.9, Specials.)
Balancer>А всякие wchar в 4 байта - это извращение. В наше время char есть char, и ни о какой его "длине" программист знать не должен [»]
Вот именно поэтому и не надо вообще про длину говорить. Это абстрактное число - без размера.