The length method yields the number of code units required for a given string in the UTF-16 encoding For example.:
String greeting = "Hello";
int n = greeting.length (); / / is 5.
To get the true length, that is, the number of code points, call
int cpCount = greeting.codePointCount (0, greeting.length ());
After running the code and found that the results are 5 , do not know what is the difference .
------ Solution ------------------------------------ --------
values generally do not consider the length of the code unit may simply that the code length is the same point .
only supplementary characters , that code point U +10000 ~ U +10 FFFF characters.
In Java, a Unicode character is encoded using UTF-16 char be represented , which is a char can only say that U +0000 ~ U + FFFF basic characters of Unicode (BMP, basic multilingual plane). So in Java need to represent U +10000 ~ U +10 FFFF characters need to use a proxy characters , range of high- agent character U + D800 ~ U + DBFF, low- agency character range U + DC00 ~ U + DFFF. Said U +10400 example requires two character char (U + D801, U + DC00) to said length of time is a point code , and the code length of two units .
For example :
public class Main {
public static void main(String[] args) {
char[] chs = Character.toChars(0x10400);
System.out.printf("U+10400 高代理字符: %04x%n", (int)chs[0]);
System.out.printf("U+10400 低代理字符: %04x%n", (int)chs[1]);
String str = new String(chs);
System.out.println("代码单元长度: " + str.length());
System.out.println("代码点数量: " + str.codePointCount(0, str.length()));
}
}
------ Solution ------------------------------------- -------
then simply said:
an occupation
code unit represents the Unicode character encoding bits
code point indicates the number specified encoding format , for less than U + FFFF character , using UTF-16 encoding requires a code point of greater than U + FFFF characters need to use two code points to represent
------ Solution ------------------------------- -------------
more detailed description can refer to this article
Java platform Supplementary Character
http://java.sun.com/developer/technicalArticles/Intl/Supplementary / index_zh_CN.html
------ Solution -------------------------------- ------------
Oh, sorry , code unit and the description just write code point backwards, huh.
That reply was midnight , I estimate fugue , haha
------ For reference only --------------------- ------------------
up
His top
------ For reference only -------------------------------- -------
dragon sitting up so late brother
------ For reference only --------------------- ------------------
still do not understand , do not want it.
------ For reference only -------------------------------------- -
also learn java, a character I understand it, is the unicode code points corresponding to the determined coding scheme coded values , and strings inside the characters ( such as a letter , a character ) correspondence ;
utf-16 code units is achieved unicode encoded using the basic unit , the basic character with a code unit (16bit) , said the need to use two supplementary characters .
没有评论:
发表评论