1. 背景概述

      随着移动互联网的兴起,微服务架构的流行,现在很多大的系统会根据业务功能等因素划分成一个一个的微服务,那么在我们的应用中就会有很多rpc接口调用,为了高可用,一般都是双机房部署,当某个机房挂掉的时候可以去调用另外一个机房的服务进行重试,当A机房服务a比如连续N次不可用,或者1分钟内M次不可用,则把这个A机房的这个服务a标记1分钟内不可用,1分钟后再去探活,探活频率可以为1分钟内3次可用则标记该服务可用,这种RPC跨机房重试按理都应该在rpc框架去做,但是如果rpc某个版本没有提供这种服务,我们该如何去做?

2. 方案

1. 设计工厂bean

    比如我们依赖的接口是com.demo.HelloService,我们需要配置2个机房的helloService的consumer服务,比如配置一个A机房的helloServiceA ,一个B机房的helloServiceB,同时还有个默认分组参数比如默认配置A机房(既本系统当前分组也部署在A机房,本系统调用远程A机房的HelloService来实现同机房调用,上面依赖接口,默认分组名,以及2个consumer服务都封装在一个工厂bean中既RecoveryConsumerBean,这个RecoveryConsumerBean实现FactoryBean接口,如下所示通过动态代理生成代理client,参考部分代码

@Override
    public Object getObject() throws Exception {
        if (client == null) {
            Object obj = jsfConsume.values().iterator().next();
            client = ProxyFactory.buildProxy(obj.getClass().getInterfaces()[0], JDKInvocationHandler.createHandler(this));
        }
        return client;
    }

    @Override
    public Class<?> getObjectType() {
        return sameGroupUseRpc == null ? null : sameGroupUseRpc.getValidInstance().getClass().getInterfaces()[0];
    }

    @Override
    public boolean isSingleton() {
        return Boolean.TRUE;
    }

具体执行逻辑在JDKInvocationHandler,这个里面包括接口是否降级以及重试次数重试策略等,参考部分代码

public class JDKInvocationHandler implements InvocationHandler {

    @Override
    public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
        if (TOSTRING_METHOD_NAME.equals(method.getName())) {
            return "dynamicProxy";
        }
        /**
         * 获取当前接口的配置,如果降级开关打开则降级处理
         */
        final InterfaceConf interfaceConf = consumerBean.getInterfaceConf();
        if (interfaceConf.isDown()) {
            if (Void.class == method.getReturnType() || void.class == method.getReturnType()) {
                return null;
            }
            Object obj = interfaceConf.getDefaultValueInstance().get(method.getName());
            if (obj != null) {
                return obj;
            } else {
                log.error("invoke {} default is null", method.getName());
            }
        }

        Throwable ee = null;
        InterfaceConf.AutoRetryConf autRetry = interfaceConf.getAutoRetry();
        UseRpc sourceObject = null;
   
        int tryCount = autRetry.getCount() < 0 ? 0 : autRetry.getCount();
        for (int i = 0; i <= tryCount; i++) {
            try {
                sourceObject = consumerBean.select(sourceObject, autRetry.isSameDirection());
                Object invokeResult = method.invoke(sourceObject.getValidInstance(), args);
                callSuccess(sourceObject);
                return invokeResult;
            } catch (Throwable e) {
                if (e instanceof InvocationTargetException && e.getCause() != null) {
                    ee = e.getCause();
                } else {
                    ee = e;
                }
                // 不触发容灾策略异常,直接返回
                if (!interfaceConf.isValidException(ee)) {
                    break;
                }
                callFail(interfaceConf, ee, sourceObject);
                log.error("容灾代理对象调用方法异常,class:{},method:{},group:{}", method.getDeclaringClass(), method.getName(), sourceObject.getGroup(), ee);
            }
        }
        throw ee;
    }
    }

2.rpc接口异常校验和探活逻辑

校验策略1之持续失败:可以专门定义一个接口调用状态统计类RpcCallerStatus,比如连续3次失败就算失败,这个可以通过定义内部类PersistCounter,用AtomicLong原子递增判断就可以

static class PersistCounter {
        @Getter
        private final AtomicLong currentFails = new AtomicLong();

        public boolean fail(long maxFail) {
            log.debug("PersistCounter failed currentFails:{}", currentFails.get());
            if (currentFails.incrementAndGet() == maxFail) {
                return true;
            }
            return false;
        }

        private void reset() {
            currentFails.set(0);
        }


    }

校验策略2之频率失败:如果要统计1分钟内3次失败就标记不可用这种可以采用google的guava(最好采用滑动窗口设计,就是有个汇总值,每秒的调用次数累加的汇总值,每秒过期的时候从汇总值相应的减掉相应的调用次数)

static class FreqFailCounter {
        private LoadingCache<Long, AtomicLong> cache;
        /**
         * 记录的是60秒内失败的总次数
         */
        @Getter
        private final AtomicLong totalCount = new AtomicLong();

        public FreqFailCounter(int freqSecond) {
            cache = createCache(freqSecond);
        }

        public boolean fail(long freqFail) {
            try {
                AtomicLong value = cache.get(TimeUnit.SECONDS.convert(System.currentTimeMillis(), TimeUnit.MILLISECONDS));
                value.incrementAndGet();
                log.debug("FreqFailCounter failed totalCount:{}", totalCount.get());
                if (totalCount.incrementAndGet() == freqFail) {
                    cache.invalidateAll();
                    return true;
                }
            } catch (Exception e) {
            }
            return false;
        }

        public void reset(int freqSecond) {
            cache.invalidateAll();
            totalCount.set(0);
            cache = createCache(freqSecond);
        }

        private LoadingCache<Long, AtomicLong> createCache(int freqSecond) {
            return CacheBuilder.newBuilder()
                    .maximumSize(freqSecond)
                    .initialCapacity(freqSecond)
                    .expireAfterWrite(freqSecond, TimeUnit.SECONDS)
                    .removalListener(new RemovalListener<Long, AtomicLong>() {
                        public void onRemoval(RemovalNotification<Long, AtomicLong> removalNotification) {
                            long value = totalCount.addAndGet(-removalNotification.getValue().get());
                            if (value < 0) {
                                totalCount.set(0);
                            }
                            log.debug("FreqFailCounter remove totalCount:{}", totalCount.get());
                        }
                    })
                    .build(new CacheLoader<Long, AtomicLong>() {
                        @Override
                        public AtomicLong load(Long aLong) throws Exception {
                            return new AtomicLong();
                        }
                    });
        }
    }
延时队列:比如当上面2种策略标记A机房服务a不可用时,可以把a接口延迟1分钟,这个可以用延时接口Delayed来实现
static class DelayItem implements Delayed {
        private static final long NANO_ORIGIN = System.nanoTime();
        private static final AtomicLong SEQUENCER = new AtomicLong(0);

        private final long sequenceNumber;
        private final long time;

        final static long now() {
            return System.nanoTime() - NANO_ORIGIN;
        }

        /**
         * 初始化延迟对象,参数是60秒(单位是纳秒)
         * @param timeout
         */
        private DelayItem(long timeout) {
            this.time = now() + timeout;
            this.sequenceNumber = SEQUENCER.getAndIncrement();
        }

        /**
         * 通过这种计算时间,可以更精准的计算损耗
         * @param unit
         * @return
         */
        @Override
        public long getDelay(TimeUnit unit) {
            long d = unit.convert(time - now(), TimeUnit.NANOSECONDS);
            return d;
        }
        @Override
        public int compareTo(Delayed other) {
            if (other == this) {
                return 0;
            }
            if (other instanceof DelayItem) {
                DelayItem x = (DelayItem) other;
                long diff = time - x.time;
                if (diff < 0) {
                    return -1;
                } else if (diff > 0) {
                    return 1;
                } else if (sequenceNumber < x.sequenceNumber) {
                    return -1;
                } else {
                    return 1;
                }
            }
            long d = (getDelay(TimeUnit.NANOSECONDS) - other.getDelay(TimeUnit.NANOSECONDS));
            return (d == 0) ? 0 : ((d < 0) ? -1 : 1);
        }
    }
关于探活:就是根据当前调用状态的status和是否存在延时类等条件来判断
public boolean isOk() {
        // 存在延时工作禁令,特殊处理
        if (this.delay != null) {
            if (this.delay.getDelay(TimeUnit.NANOSECONDS) > 0) {
                // 延时工作禁令未失效,返回结果
                log.info("延时工作禁令未失效,剩余时间:" + this.delay.getDelay(TimeUnit.NANOSECONDS) / 1000000);
                if (status) {
                    clearDelay();
                }
                return status;
            } else {
                if (status) {
                    // 延时工作禁令失效,且服务已可用。则清除禁令返回结果。
                    clearDelay();
                    return status;
                } else {
                    // 延时工作禁令失效,服务还未可用。则重设禁令,并返回假设可用信息,让调用端探活
                    delayWork();
                    return !status;
                }
            }
        }
        return status;
    }

3. 配置管理

上面所列的rpc接口默认分组信息,接口信息,包括降级信息(是否可降级,降级数据等),失败校验的统计信息(比如1分钟内3次失败),恢复探活配置(1分钟内3次成功算成功)等这些值都需要配置到配置服务里,配置服务可以用etcd,zk,mysql等等,可以动态调整