java源码分析-String类源码概要

1.简介

String类是我们java程序员使用频率非常大的类,我们都知道String类是不可变类,一旦定义就不可改变,那么它为什么不可变呢?它的内部做了什么处理呢?今天我们就好好研究一下String类。

先上源码:

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence

通过上面我们能知道:

  • String类被final修饰,所以是不可变类,或者说没有子类实现;
  • 实现Serializable接口,表示该类可以序列化;序列化的目的是将一个实现了Serializable接口的对象可以转换成一个字节序列,保存对象的状态。
  • 实现了Comparable接口,表示该实现类的对象会被整体排序(自然排序),实现Comparable接口的对象列表(和数组)可以通过Collections.sort(和Arrays.sort)进行自然排序。
  • 实现了CharSequence接口,此接口是char值的一个可读序列,它对许多不同种类的char序列提供统一的制度访问。

2.重要属性

/** The value is used for character storage. */
private final char value[];

用于存储字符串的字符数组,同样被final修饰,一旦创建就不可以被修改。

/** Cache the hash code for the string */
private int hash; // Default to 0

String的hash值是一个int类型的数值,表示一个4字节32位的数据,默认值为0,具体的计算方法见hashCode()方法;

static final boolean COMPACT_STRINGS;

COMPACT_STRINGS是字符串压缩标识,默认情况下,String是可压缩的。
在静态代码块中

static {
        COMPACT_STRINGS = true;
    }

表示String可压缩。而该字段的值是由jvm传入。

3.构造函数

String的构造函数有多个,截图如下:

stringredistemplate 操作原子类 string类的源码分析_字符串

3.1使用字符串构造

/**
     * Initializes a newly created {@code String} object so that it represents
     * the same sequence of characters as the argument; in other words, the
     * newly created string is a copy of the argument string. Unless an
     * explicit copy of {@code original} is needed, use of this constructor is
     * unnecessary since Strings are immutable.
     *
     * @param  original
     *         A {@code String}
     */
    public String(String original) {
        this.value = original.value;
        this.hash = original.hash;
    }

直接将实参字符串originall的value和hash值传给目标String,来构建一个新的String对象。简单说明,新创建的字符串是实参字符串的副本。

3.2使用数组构造

/**
     * Allocates a new {@code String} so that it represents the sequence of
     * characters currently contained in the character array argument. The
     * contents of the character array are copied; subsequent modification of
     * the character array does not affect the newly created string.
     *
     * @param  value
     *         The initial value of the string
     */
    public String(char value[]) {
        this.value = Arrays.copyOf(value, value.length);
    }

通过拷贝字符数组中的值来生成一个新的String对象,后续实参字符数组中的改变不会影响新创建的字符串;

public static char[] copyOf(char[] original, int newLength) {
    char[] copy = new char[newLength];//这里是创建了新的char数组
    System.arraycopy(original, 0, copy, 0,
                     Math.min(original.length, newLength));
    return copy;
}

3.3使用字节数组构造

/**
     * Constructs a new {@code String} by decoding the specified array of bytes
     * using the platform's default charset.  The length of the new {@code
     * String} is a function of the charset, and hence may not be equal to the
     * length of the byte array.
     *
     * <p> The behavior of this constructor when the given bytes are not valid
     * in the default charset is unspecified.  The {@link
     * java.nio.charset.CharsetDecoder} class should be used when more control
     * over the decoding process is required.
     *
     * @param  bytes
     *         The bytes to be decoded into characters
     *
     * @since  JDK1.1
     */
public String(byte bytes[]) {
    this(bytes, 0, bytes.length);
}

public String(byte bytes[], int offset, int length) {
        checkBounds(bytes, offset, length);
        this.value = StringCoding.decode(bytes, offset, length);
    }

public String(byte ascii[], int hibyte) {
        this(ascii, hibyte, 0, ascii.length);
    }

/* Common private utility method used to bounds check the byte array
     * and requested offset & length values used by the String(byte[],..)
     * constructors.
     */
private static void checkBounds(byte[] bytes, int offset, int length) {
    if (length < 0)
        throw new StringIndexOutOfBoundsException(length);
    if (offset < 0)
        throw new StringIndexOutOfBoundsException(offset);
    if (offset > bytes.length - length)
        throw new StringIndexOutOfBoundsException(offset + length);
}
StringCoding
static char[] decode(byte[] ba, int off, int len) {
    String csn = Charset.defaultCharset().name();	//获取编码
    try {
        // use charset name decode() variant which provides caching.
        return decode(csn, ba, off, len);	//根据编码进行解码
    } catch (UnsupportedEncodingException x) {
        warnUnsupportedCharset(csn);
    }
    try {
        return decode("ISO-8859-1", ba, off, len);
    } catch (UnsupportedEncodingException x) {
        // If this code is hit during VM initialization, MessageUtils is
        // the only way we will be able to get any kind of error message.
        MessageUtils.err("ISO-8859-1 charset not available: "
                         + x.toString());
        // If we can not find ISO-8859-1 (a required encoding) then things
        // are seriously wrong with the installation.
        System.exit(1);
        return null;
    }
}

static char[] decode(String charsetName, byte[] ba, int off, int len)
        throws UnsupportedEncodingException
    {
        StringDecoder sd = deref(decoder);
        String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
        if ((sd == null) || !(csn.equals(sd.requestedCharsetName())
                              || csn.equals(sd.charsetName()))) {
            sd = null;
            try {
                Charset cs = lookupCharset(csn);
                if (cs != null)
                    sd = new StringDecoder(cs, csn);
            } catch (IllegalCharsetNameException x) {}
            if (sd == null)
                throw new UnsupportedEncodingException(csn);
            set(decoder, sd);
        }
        return sd.decode(ba, off, len);
    }

在Java中,String实例中保存有一个char[]字符数组,char[]字符数组是以unicode码来存储的,String 和 char 为内存形式,byte是网络传输或存储的序列化形式。所以在很多传输和存储的过程中需要将byte[]数组和String进行相互转化。所以,String提供了一系列重载的构造方法来将一个字符数组转化成String,提到byte[]和String之间的相互转换就不得不关注编码问题。

String(byte[] bytes, Charset charset) 是指通过charset来解码指定的byte数组,将其解码成unicode的char[]数组,够造成新的String。

  • 这里的bytes字节流是使用charset进行编码的,想要将他转换成unicode的char[]数组,而又保证不出现乱码,那就要指定其解码方式。

如果我们在使用byte[]构造String的时候,会使用StringCoding.decode方法进行解码,使用的解码的字符集就是我们指定的charsetName或者charset。 我们在使用byte[]构造String的时候,如果没有指明解码使用的字符集的话,那么StringCoding的decode方法首先调用系统的默认编码格式,如果没有指定编码格式则默认使用ISO-8859-1编码格式进行编码操作。

4.内部类

CaseInsensitiveComparator内部类

public static final Comparator<String> CASE_INSENSITIVE_ORDER
                                         = new CaseInsensitiveComparator();
    private static class CaseInsensitiveComparator
            implements Comparator<String>, java.io.Serializable {
        // use serialVersionUID from JDK 1.2.2 for interoperability
        private static final long serialVersionUID = 8575799808933029326L;

        public int compare(String s1, String s2) {//比较两个字符串大小。(同时比较了UpperCase和LowerCase,是为了兼容Georgian字符)
            int n1 = s1.length();
            int n2 = s2.length();
            int min = Math.min(n1, n2);
            for (int i = 0; i < min; i++) {
                char c1 = s1.charAt(i);
                char c2 = s2.charAt(i);
                if (c1 != c2) { //1.同位字符比较
                    c1 = Character.toUpperCase(c1);
                    c2 = Character.toUpperCase(c2);
                    if (c1 != c2) { //2.全转为转为大写再比较
                        c1 = Character.toLowerCase(c1);
                        c2 = Character.toLowerCase(c2);
                        if (c1 != c2) {//全部转为小写再比较
                            // No overflow because of numeric promotion
                            return c1 - c2;
                        }
                    }
                }
            }
            return n1 - n2;
        }

        /** Replaces the de-serialized object. */
        private Object readResolve() { return CASE_INSENSITIVE_ORDER; }
    }

这个内部类主要是用于比较两个字符串的大小。(对大小写不敏感)

这里有人会问,为什么转换为大写之后判断不相等了,怎么还要判断小写,是不是多次一举?

其实不是的,在String类的regionMatches()方法的注释中已经说明了:

public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len) { char ta[] = value; int to = toffset; char pa[] = other.value; int po = ooffset; // Note: toffset, ooffset, or len might be near -1>>>1. if ((ooffset < 0) || (toffset < 0) || (toffset > (long)value.length - len) || (ooffset > (long)other.value.length - len)) { return false; } while (len-- > 0) { char c1 = ta[to++]; char c2 = pa[po++]; if (c1 == c2) { continue; } if (ignoreCase) { // If characters don't match but case may be ignored, // try converting both characters to uppercase. // If the results match, then the comparison scan should // continue. char u1 = Character.toUpperCase(c1); char u2 = Character.toUpperCase(c2); if (u1 == u2) { continue; } // Unfortunately, conversion to uppercase does not work properly // for the Georgian alphabet, which has strange rules about case // conversion. So we need to make one last check before // exiting. //这里已经说明!!! if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) { continue; } } return false; } return true; }

注释的意思是:转换为大写不能正常工作的Georgian字母表,它有关于大小写转换的奇怪规则。所以我们得在离开前做最后一次检查。

5.常见方法

5.1length方法

该方法用户用去字符串长度。

public int length() {
    return value.length;
}

String类获取字符串长度实际上就是获取char[]数组中的个数,也就是value的值。

5.2isEmpty方法

判断一个字符串是否为空字符串,也就是""。

public boolean isEmpty() {
        return value.length == 0;
    }

通过判断value的值是否为0,也就是char[]数组长度是否为0.

5.3charAt方法

该方法是获取指定下标处的字符。

public char charAt(int index) {
        if ((index < 0) || (index >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return value[index];
    }

通过获取char数组的指定下边的元素。

5.4codePoint方法

该方法是获取某个下标字符的编码值。

public int codePointAt(int index) {
        if ((index < 0) || (index >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointAtImpl(value, index, value.length);
    }

//---------------------Character---------------------------------
static int codePointAtImpl(char[] a, int index, int limit) {
        char c1 = a[index];
        if (isHighSurrogate(c1) && ++index < limit) {
            char c2 = a[index];
            if (isLowSurrogate(c2)) {
                return toCodePoint(c1, c2);
            }
        }
        return c1;
    }

public static int toCodePoint(char high, char low) {
        // Optimized form of:
        // return ((high - MIN_HIGH_SURROGATE) << 10)
        //         + (low - MIN_LOW_SURROGATE)
        //         + MIN_SUPPLEMENTARY_CODE_POINT;
        return ((high << 10) + low) + (MIN_SUPPLEMENTARY_CODE_POINT
                                       - (MIN_HIGH_SURROGATE << 10)
                                       - MIN_LOW_SURROGATE);
    }

5.5getChars

该方法用于拷贝字符串。

void getChars(char dst[], int dstBegin) {
        System.arraycopy(value, 0, dst, dstBegin, value.length);
    }

public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
        if (srcBegin < 0) {
            throw new StringIndexOutOfBoundsException(srcBegin);
        }
        if (srcEnd > value.length) {
            throw new StringIndexOutOfBoundsException(srcEnd);
        }
        if (srcBegin > srcEnd) {
            throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
        }
        System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
    }

这个方法是将调用者字符串的第srcBegin开始,到srcEnd结束的字符拷贝到char[]数组中,从dstBegin开始替换。

String str = "abc";
char[] endCh = {'1', '2', '3'};
str.getChars(1, 2, endCh, 0);
System.out.println(endCh);

查看结果:

stringredistemplate 操作原子类 string类的源码分析_字符数组_02

5.6concat方法

该方法用于拼接字符串。

public String concat(String str) {  //拼接字符串
        int otherLen = str.length();    //获取待拼接字符串长度
        if (otherLen == 0) {
            return this;
        }
        int len = value.length; //获取原字符串长度
        char buf[] = Arrays.copyOf(value, len + otherLen);//使用拷贝的方式,创建新的char数组,并赋值
        str.getChars(buf, len); //相当于将参数字符串拷贝到原字符串尾部,返回字符数组
        return new String(buf, true);//通过字符数组创建String对象返回
    }

5.7getBytes方法

public byte[] getBytes(String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null) throw new NullPointerException();
        return StringCoding.encode(charsetName, value, 0, value.length);
    }

public byte[] getBytes(Charset charset) {
        if (charset == null) throw new NullPointerException();
        return StringCoding.encode(charset, value, 0, value.length);
    }

public byte[] getBytes() {
        return StringCoding.encode(value, 0, value.length);
    }

三个方法都是用来获取某个字符串中每个字符的byte值。有参数的是根据指定编码进行解码后转为byte,无参数的是根据默认编码进行解码后转换。

5.8split

该方法用于分割字符串。

public String[] split(String regex) {
        return split(regex, 0);
    }

public String[] split(String regex, int limit) {    //按照正则表达式进行分割字符串
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }