java源码分析-String类源码概要
1.简介
String类是我们java程序员使用频率非常大的类,我们都知道String类是不可变类,一旦定义就不可改变,那么它为什么不可变呢?它的内部做了什么处理呢?今天我们就好好研究一下String类。
先上源码:
public final class String
implements java.io.Serializable, Comparable<String>, CharSequence
通过上面我们能知道:
- String类被final修饰,所以是不可变类,或者说没有子类实现;
- 实现Serializable接口,表示该类可以序列化;序列化的目的是将一个实现了Serializable接口的对象可以转换成一个字节序列,保存对象的状态。
- 实现了Comparable接口,表示该实现类的对象会被整体排序(自然排序),实现Comparable接口的对象列表(和数组)可以通过Collections.sort(和Arrays.sort)进行自然排序。
- 实现了CharSequence接口,此接口是char值的一个可读序列,它对许多不同种类的char序列提供统一的制度访问。
2.重要属性
/** The value is used for character storage. */
private final char value[];
用于存储字符串的字符数组,同样被final修饰,一旦创建就不可以被修改。
/** Cache the hash code for the string */
private int hash; // Default to 0
String的hash值是一个int类型的数值,表示一个4字节32位的数据,默认值为0,具体的计算方法见hashCode()方法;
static final boolean COMPACT_STRINGS;
COMPACT_STRINGS是字符串压缩标识,默认情况下,String是可压缩的。
在静态代码块中
static {
COMPACT_STRINGS = true;
}
表示String可压缩。而该字段的值是由jvm传入。
3.构造函数
String的构造函数有多个,截图如下:
3.1使用字符串构造
/**
* Initializes a newly created {@code String} object so that it represents
* the same sequence of characters as the argument; in other words, the
* newly created string is a copy of the argument string. Unless an
* explicit copy of {@code original} is needed, use of this constructor is
* unnecessary since Strings are immutable.
*
* @param original
* A {@code String}
*/
public String(String original) {
this.value = original.value;
this.hash = original.hash;
}
直接将实参字符串originall的value和hash值传给目标String,来构建一个新的String对象。简单说明,新创建的字符串是实参字符串的副本。
3.2使用数组构造
/**
* Allocates a new {@code String} so that it represents the sequence of
* characters currently contained in the character array argument. The
* contents of the character array are copied; subsequent modification of
* the character array does not affect the newly created string.
*
* @param value
* The initial value of the string
*/
public String(char value[]) {
this.value = Arrays.copyOf(value, value.length);
}
通过拷贝字符数组中的值来生成一个新的String对象,后续实参字符数组中的改变不会影响新创建的字符串;
public static char[] copyOf(char[] original, int newLength) {
char[] copy = new char[newLength];//这里是创建了新的char数组
System.arraycopy(original, 0, copy, 0,
Math.min(original.length, newLength));
return copy;
}
3.3使用字节数组构造
/**
* Constructs a new {@code String} by decoding the specified array of bytes
* using the platform's default charset. The length of the new {@code
* String} is a function of the charset, and hence may not be equal to the
* length of the byte array.
*
* <p> The behavior of this constructor when the given bytes are not valid
* in the default charset is unspecified. The {@link
* java.nio.charset.CharsetDecoder} class should be used when more control
* over the decoding process is required.
*
* @param bytes
* The bytes to be decoded into characters
*
* @since JDK1.1
*/
public String(byte bytes[]) {
this(bytes, 0, bytes.length);
}
public String(byte bytes[], int offset, int length) {
checkBounds(bytes, offset, length);
this.value = StringCoding.decode(bytes, offset, length);
}
public String(byte ascii[], int hibyte) {
this(ascii, hibyte, 0, ascii.length);
}
/* Common private utility method used to bounds check the byte array
* and requested offset & length values used by the String(byte[],..)
* constructors.
*/
private static void checkBounds(byte[] bytes, int offset, int length) {
if (length < 0)
throw new StringIndexOutOfBoundsException(length);
if (offset < 0)
throw new StringIndexOutOfBoundsException(offset);
if (offset > bytes.length - length)
throw new StringIndexOutOfBoundsException(offset + length);
}
StringCoding
static char[] decode(byte[] ba, int off, int len) {
String csn = Charset.defaultCharset().name(); //获取编码
try {
// use charset name decode() variant which provides caching.
return decode(csn, ba, off, len); //根据编码进行解码
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
try {
return decode("ISO-8859-1", ba, off, len);
} catch (UnsupportedEncodingException x) {
// If this code is hit during VM initialization, MessageUtils is
// the only way we will be able to get any kind of error message.
MessageUtils.err("ISO-8859-1 charset not available: "
+ x.toString());
// If we can not find ISO-8859-1 (a required encoding) then things
// are seriously wrong with the installation.
System.exit(1);
return null;
}
}
static char[] decode(String charsetName, byte[] ba, int off, int len)
throws UnsupportedEncodingException
{
StringDecoder sd = deref(decoder);
String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
if ((sd == null) || !(csn.equals(sd.requestedCharsetName())
|| csn.equals(sd.charsetName()))) {
sd = null;
try {
Charset cs = lookupCharset(csn);
if (cs != null)
sd = new StringDecoder(cs, csn);
} catch (IllegalCharsetNameException x) {}
if (sd == null)
throw new UnsupportedEncodingException(csn);
set(decoder, sd);
}
return sd.decode(ba, off, len);
}
在Java中,String实例中保存有一个char[]字符数组,char[]字符数组是以unicode码来存储的,String 和 char 为内存形式,byte是网络传输或存储的序列化形式。所以在很多传输和存储的过程中需要将byte[]数组和String进行相互转化。所以,String提供了一系列重载的构造方法来将一个字符数组转化成String,提到byte[]和String之间的相互转换就不得不关注编码问题。
String(byte[] bytes, Charset charset) 是指通过charset来解码指定的byte数组,将其解码成unicode的char[]数组,够造成新的String。
- 这里的bytes字节流是使用charset进行编码的,想要将他转换成unicode的char[]数组,而又保证不出现乱码,那就要指定其解码方式。
如果我们在使用byte[]构造String的时候,会使用StringCoding.decode方法进行解码,使用的解码的字符集就是我们指定的charsetName或者charset。 我们在使用byte[]构造String的时候,如果没有指明解码使用的字符集的话,那么StringCoding的decode方法首先调用系统的默认编码格式,如果没有指定编码格式则默认使用ISO-8859-1编码格式进行编码操作。
4.内部类
CaseInsensitiveComparator内部类
public static final Comparator<String> CASE_INSENSITIVE_ORDER
= new CaseInsensitiveComparator();
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;
public int compare(String s1, String s2) {//比较两个字符串大小。(同时比较了UpperCase和LowerCase,是为了兼容Georgian字符)
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) { //1.同位字符比较
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) { //2.全转为转为大写再比较
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {//全部转为小写再比较
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}
/** Replaces the de-serialized object. */
private Object readResolve() { return CASE_INSENSITIVE_ORDER; }
}
这个内部类主要是用于比较两个字符串的大小。(对大小写不敏感)
这里有人会问,为什么转换为大写之后判断不相等了,怎么还要判断小写,是不是多次一举?
其实不是的,在String类的regionMatches()方法的注释中已经说明了:
public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len) { char ta[] = value; int to = toffset; char pa[] = other.value; int po = ooffset; // Note: toffset, ooffset, or len might be near -1>>>1. if ((ooffset < 0) || (toffset < 0) || (toffset > (long)value.length - len) || (ooffset > (long)other.value.length - len)) { return false; } while (len-- > 0) { char c1 = ta[to++]; char c2 = pa[po++]; if (c1 == c2) { continue; } if (ignoreCase) { // If characters don't match but case may be ignored, // try converting both characters to uppercase. // If the results match, then the comparison scan should // continue. char u1 = Character.toUpperCase(c1); char u2 = Character.toUpperCase(c2); if (u1 == u2) { continue; } // Unfortunately, conversion to uppercase does not work properly // for the Georgian alphabet, which has strange rules about case // conversion. So we need to make one last check before // exiting. //这里已经说明!!! if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) { continue; } } return false; } return true; }
注释的意思是:转换为大写不能正常工作的Georgian字母表,它有关于大小写转换的奇怪规则。所以我们得在离开前做最后一次检查。
5.常见方法
5.1length方法
该方法用户用去字符串长度。
public int length() {
return value.length;
}
String类获取字符串长度实际上就是获取char[]数组中的个数,也就是value的值。
5.2isEmpty方法
判断一个字符串是否为空字符串,也就是""。
public boolean isEmpty() {
return value.length == 0;
}
通过判断value的值是否为0,也就是char[]数组长度是否为0.
5.3charAt方法
该方法是获取指定下标处的字符。
public char charAt(int index) {
if ((index < 0) || (index >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return value[index];
}
通过获取char数组的指定下边的元素。
5.4codePoint方法
该方法是获取某个下标字符的编码值。
public int codePointAt(int index) {
if ((index < 0) || (index >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return Character.codePointAtImpl(value, index, value.length);
}
//---------------------Character---------------------------------
static int codePointAtImpl(char[] a, int index, int limit) {
char c1 = a[index];
if (isHighSurrogate(c1) && ++index < limit) {
char c2 = a[index];
if (isLowSurrogate(c2)) {
return toCodePoint(c1, c2);
}
}
return c1;
}
public static int toCodePoint(char high, char low) {
// Optimized form of:
// return ((high - MIN_HIGH_SURROGATE) << 10)
// + (low - MIN_LOW_SURROGATE)
// + MIN_SUPPLEMENTARY_CODE_POINT;
return ((high << 10) + low) + (MIN_SUPPLEMENTARY_CODE_POINT
- (MIN_HIGH_SURROGATE << 10)
- MIN_LOW_SURROGATE);
}
5.5getChars
该方法用于拷贝字符串。
void getChars(char dst[], int dstBegin) {
System.arraycopy(value, 0, dst, dstBegin, value.length);
}
public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
if (srcBegin < 0) {
throw new StringIndexOutOfBoundsException(srcBegin);
}
if (srcEnd > value.length) {
throw new StringIndexOutOfBoundsException(srcEnd);
}
if (srcBegin > srcEnd) {
throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
}
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
}
这个方法是将调用者字符串的第srcBegin开始,到srcEnd结束的字符拷贝到char[]数组中,从dstBegin开始替换。
String str = "abc";
char[] endCh = {'1', '2', '3'};
str.getChars(1, 2, endCh, 0);
System.out.println(endCh);
查看结果:
5.6concat方法
该方法用于拼接字符串。
public String concat(String str) { //拼接字符串
int otherLen = str.length(); //获取待拼接字符串长度
if (otherLen == 0) {
return this;
}
int len = value.length; //获取原字符串长度
char buf[] = Arrays.copyOf(value, len + otherLen);//使用拷贝的方式,创建新的char数组,并赋值
str.getChars(buf, len); //相当于将参数字符串拷贝到原字符串尾部,返回字符数组
return new String(buf, true);//通过字符数组创建String对象返回
}
5.7getBytes方法
public byte[] getBytes(String charsetName)
throws UnsupportedEncodingException {
if (charsetName == null) throw new NullPointerException();
return StringCoding.encode(charsetName, value, 0, value.length);
}
public byte[] getBytes(Charset charset) {
if (charset == null) throw new NullPointerException();
return StringCoding.encode(charset, value, 0, value.length);
}
public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
}
三个方法都是用来获取某个字符串中每个字符的byte值。有参数的是根据指定编码进行解码后转为byte,无参数的是根据默认编码进行解码后转换。
5.8split
该方法用于分割字符串。
public String[] split(String regex) {
return split(regex, 0);
}
public String[] split(String regex, int limit) { //按照正则表达式进行分割字符串
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));
// Construct result
int resultSize = list.size();
if (limit == 0) {
while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
resultSize--;
}
}
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}