2013年12月15日星期日

java in the Code Points and Code Units of difference ?

Look java core when the first volume , see the following:
The length method yields the number of code units required for a given string in the UTF-16 encoding For example.:

String greeting = "Hello";
int n = greeting.length (); / / is 5.



To get the true length, that is, the number of code points, call

int cpCount = greeting.codePointCount (0, greeting.length ());


After running the code and found that the results are 5 , do not know what is the difference .

------ Solution ------------------------------------ --------
values ​​generally do not consider the length of the code unit may simply that the code length is the same point .

only supplementary characters , that code point U +10000 ~ U +10 FFFF characters.

In Java, a Unicode character is encoded using UTF-16 char be represented , which is a char can only say that U +0000 ~ U + FFFF basic characters of Unicode (BMP, basic multilingual plane). So in Java need to represent U +10000 ~ U +10 FFFF characters need to use a proxy characters , range of high- agent character U + D800 ~ U + DBFF, low- agency character range U + DC00 ~ U + DFFF. Said U +10400 example requires two character char (U + D801, U + DC00) to said length of time is a point code , and the code length of two units .

For example :

public class Main {

    public static void main(String[] args) {
        char[] chs = Character.toChars(0x10400);
        System.out.printf("U+10400 高代理字符: %04x%n", (int)chs[0]);
        System.out.printf("U+10400 低代理字符: %04x%n", (int)chs[1]);        
        String str = new String(chs);
        System.out.println("代码单元长度: " + str.length());
        System.out.println("代码点数量: " + str.codePointCount(0, str.length()));
    }
}

------ Solution ------------------------------------- -------
then simply said:
an occupation
code unit represents the Unicode character encoding bits
code point indicates the number specified encoding format , for less than U + FFFF character , using UTF-16 encoding requires a code point of greater than U + FFFF characters need to use two code points to represent
------ Solution ------------------------------- -------------
more detailed description can refer to this article

Java platform Supplementary Character
http://java.sun.com/developer/technicalArticles/Intl/Supplementary / index_zh_CN.html
------ Solution -------------------------------- ------------


Oh, sorry , code unit and the description just write code point backwards, huh.

That reply was midnight , I estimate fugue , haha ​​
------ For reference only --------------------- ------------------
up
His top
------ For reference only -------------------------------- -------
dragon sitting up so late brother
------ For reference only --------------------- ------------------
still do not understand , do not want it.
------ For reference only -------------------------------------- -
also learn java, a character I understand it, is the unicode code points corresponding to the determined coding scheme coded values ​​, and strings inside the characters ( such as a letter , a character ) correspondence ;
utf-16 code units is achieved unicode encoded using the basic unit , the basic character with a code unit (16bit) , said the need to use two supplementary characters .

没有评论:

发表评论