1. 背景概述
随着移动互联网的兴起,微服务架构的流行,现在很多大的系统会根据业务功能等因素划分成一个一个的微服务,那么在我们的应用中就会有很多rpc接口调用,为了高可用,一般都是双机房部署,当某个机房挂掉的时候可以去调用另外一个机房的服务进行重试,当A机房服务a比如连续N次不可用,或者1分钟内M次不可用,则把这个A机房的这个服务a标记1分钟内不可用,1分钟后再去探活,探活频率可以为1分钟内3次可用则标记该服务可用,这种RPC跨机房重试按理都应该在rpc框架去做,但是如果rpc某个版本没有提供这种服务,我们该如何去做?
2. 方案
1. 设计工厂bean
比如我们依赖的接口是com.demo.HelloService,我们需要配置2个机房的helloService的consumer服务,比如配置一个A机房的helloServiceA ,一个B机房的helloServiceB,同时还有个默认分组参数比如默认配置A机房(既本系统当前分组也部署在A机房,本系统调用远程A机房的HelloService来实现同机房调用,上面依赖接口,默认分组名,以及2个consumer服务都封装在一个工厂bean中既RecoveryConsumerBean,这个RecoveryConsumerBean实现FactoryBean接口,如下所示通过动态代理生成代理client,参考部分代码
@Override
public Object getObject() throws Exception {
if (client == null) {
Object obj = jsfConsume.values().iterator().next();
client = ProxyFactory.buildProxy(obj.getClass().getInterfaces()[0], JDKInvocationHandler.createHandler(this));
}
return client;
}
@Override
public Class<?> getObjectType() {
return sameGroupUseRpc == null ? null : sameGroupUseRpc.getValidInstance().getClass().getInterfaces()[0];
}
@Override
public boolean isSingleton() {
return Boolean.TRUE;
}
具体执行逻辑在JDKInvocationHandler,这个里面包括接口是否降级以及重试次数重试策略等,参考部分代码
public class JDKInvocationHandler implements InvocationHandler {
@Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
if (TOSTRING_METHOD_NAME.equals(method.getName())) {
return "dynamicProxy";
}
/**
* 获取当前接口的配置,如果降级开关打开则降级处理
*/
final InterfaceConf interfaceConf = consumerBean.getInterfaceConf();
if (interfaceConf.isDown()) {
if (Void.class == method.getReturnType() || void.class == method.getReturnType()) {
return null;
}
Object obj = interfaceConf.getDefaultValueInstance().get(method.getName());
if (obj != null) {
return obj;
} else {
log.error("invoke {} default is null", method.getName());
}
}
Throwable ee = null;
InterfaceConf.AutoRetryConf autRetry = interfaceConf.getAutoRetry();
UseRpc sourceObject = null;
int tryCount = autRetry.getCount() < 0 ? 0 : autRetry.getCount();
for (int i = 0; i <= tryCount; i++) {
try {
sourceObject = consumerBean.select(sourceObject, autRetry.isSameDirection());
Object invokeResult = method.invoke(sourceObject.getValidInstance(), args);
callSuccess(sourceObject);
return invokeResult;
} catch (Throwable e) {
if (e instanceof InvocationTargetException && e.getCause() != null) {
ee = e.getCause();
} else {
ee = e;
}
// 不触发容灾策略异常,直接返回
if (!interfaceConf.isValidException(ee)) {
break;
}
callFail(interfaceConf, ee, sourceObject);
log.error("容灾代理对象调用方法异常,class:{},method:{},group:{}", method.getDeclaringClass(), method.getName(), sourceObject.getGroup(), ee);
}
}
throw ee;
}
}
2.rpc接口异常校验和探活逻辑
校验策略1之持续失败:可以专门定义一个接口调用状态统计类RpcCallerStatus,比如连续3次失败就算失败,这个可以通过定义内部类PersistCounter,用AtomicLong原子递增判断就可以
static class PersistCounter {
@Getter
private final AtomicLong currentFails = new AtomicLong();
public boolean fail(long maxFail) {
log.debug("PersistCounter failed currentFails:{}", currentFails.get());
if (currentFails.incrementAndGet() == maxFail) {
return true;
}
return false;
}
private void reset() {
currentFails.set(0);
}
}
校验策略2之频率失败:如果要统计1分钟内3次失败就标记不可用这种可以采用google的guava(最好采用滑动窗口设计,就是有个汇总值,每秒的调用次数累加的汇总值,每秒过期的时候从汇总值相应的减掉相应的调用次数)
static class FreqFailCounter {
private LoadingCache<Long, AtomicLong> cache;
/**
* 记录的是60秒内失败的总次数
*/
@Getter
private final AtomicLong totalCount = new AtomicLong();
public FreqFailCounter(int freqSecond) {
cache = createCache(freqSecond);
}
public boolean fail(long freqFail) {
try {
AtomicLong value = cache.get(TimeUnit.SECONDS.convert(System.currentTimeMillis(), TimeUnit.MILLISECONDS));
value.incrementAndGet();
log.debug("FreqFailCounter failed totalCount:{}", totalCount.get());
if (totalCount.incrementAndGet() == freqFail) {
cache.invalidateAll();
return true;
}
} catch (Exception e) {
}
return false;
}
public void reset(int freqSecond) {
cache.invalidateAll();
totalCount.set(0);
cache = createCache(freqSecond);
}
private LoadingCache<Long, AtomicLong> createCache(int freqSecond) {
return CacheBuilder.newBuilder()
.maximumSize(freqSecond)
.initialCapacity(freqSecond)
.expireAfterWrite(freqSecond, TimeUnit.SECONDS)
.removalListener(new RemovalListener<Long, AtomicLong>() {
public void onRemoval(RemovalNotification<Long, AtomicLong> removalNotification) {
long value = totalCount.addAndGet(-removalNotification.getValue().get());
if (value < 0) {
totalCount.set(0);
}
log.debug("FreqFailCounter remove totalCount:{}", totalCount.get());
}
})
.build(new CacheLoader<Long, AtomicLong>() {
@Override
public AtomicLong load(Long aLong) throws Exception {
return new AtomicLong();
}
});
}
}
延时队列:比如当上面2种策略标记A机房服务a不可用时,可以把a接口延迟1分钟,这个可以用延时接口Delayed来实现
static class DelayItem implements Delayed {
private static final long NANO_ORIGIN = System.nanoTime();
private static final AtomicLong SEQUENCER = new AtomicLong(0);
private final long sequenceNumber;
private final long time;
final static long now() {
return System.nanoTime() - NANO_ORIGIN;
}
/**
* 初始化延迟对象,参数是60秒(单位是纳秒)
* @param timeout
*/
private DelayItem(long timeout) {
this.time = now() + timeout;
this.sequenceNumber = SEQUENCER.getAndIncrement();
}
/**
* 通过这种计算时间,可以更精准的计算损耗
* @param unit
* @return
*/
@Override
public long getDelay(TimeUnit unit) {
long d = unit.convert(time - now(), TimeUnit.NANOSECONDS);
return d;
}
@Override
public int compareTo(Delayed other) {
if (other == this) {
return 0;
}
if (other instanceof DelayItem) {
DelayItem x = (DelayItem) other;
long diff = time - x.time;
if (diff < 0) {
return -1;
} else if (diff > 0) {
return 1;
} else if (sequenceNumber < x.sequenceNumber) {
return -1;
} else {
return 1;
}
}
long d = (getDelay(TimeUnit.NANOSECONDS) - other.getDelay(TimeUnit.NANOSECONDS));
return (d == 0) ? 0 : ((d < 0) ? -1 : 1);
}
}
关于探活:就是根据当前调用状态的status和是否存在延时类等条件来判断
public boolean isOk() {
// 存在延时工作禁令,特殊处理
if (this.delay != null) {
if (this.delay.getDelay(TimeUnit.NANOSECONDS) > 0) {
// 延时工作禁令未失效,返回结果
log.info("延时工作禁令未失效,剩余时间:" + this.delay.getDelay(TimeUnit.NANOSECONDS) / 1000000);
if (status) {
clearDelay();
}
return status;
} else {
if (status) {
// 延时工作禁令失效,且服务已可用。则清除禁令返回结果。
clearDelay();
return status;
} else {
// 延时工作禁令失效,服务还未可用。则重设禁令,并返回假设可用信息,让调用端探活
delayWork();
return !status;
}
}
}
return status;
}
3. 配置管理
上面所列的rpc接口默认分组信息,接口信息,包括降级信息(是否可降级,降级数据等),失败校验的统计信息(比如1分钟内3次失败),恢复探活配置(1分钟内3次成功算成功)等这些值都需要配置到配置服务里,配置服务可以用etcd,zk,mysql等等,可以动态调整