最近部署了基于MongoDB的CAS 5.2.x集群,偶尔会出现一些页面反复跳转问题。简单查了一下日志,似乎是票据(ticket)验证的问题。
前几天,在登录CAS服务仪表盘(dashboard)时,页面提示"cannot validate CAS ticket: ST-***"。查询三个节点的日志(grep ST-*** cas/log/*),发现在节点1上有记录:
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,144 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Ticket [ST-***] has expired and is now removed from the collection>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,144 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Deleting ticket [ST-***]>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,145 DEBUG [org.springframework.data.mongodb.core.MongoTemplate] - <Remove using query: { "ticketId" : "ST-***"} in collection: serviceTicketsCollection.>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,145 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Deleted ticket [ST-***] with result [WriteResult{acknowledged=false}]>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,146 WARN [org.apereo.cas.DefaultCentralAuthenticationService] - <Service ticket [ST-***] does not exist.>
在节点2上的记录是,
cas-2019-03-27-1.log:2019-03-27 10:23:12,183 DEBUG [org.apereo.cas.web.pac4j.CasSecurityInterceptor$1] - <url: https://sso.****/cas/status/dashboard?ticket=ST-***>
cas-2019-03-27-1.log:org.pac4j.core.exception.TechnicalException: cannot validate CAS ticket: ST-***
cas-2019-03-27-1.log:Caused by: org.jasig.cas.client.validation.TicketValidationException: Ticket 'ST-***' not recognized
从日志时间看,在节点1上的10:23:24时,该服务票据(service ticket)已过期、被删除。但在节点2上的10:23:12时,有外部请求过来要求验证该服务票据。
查CAS代码,发现serviceTicketsCollection在创建时会添加一个过期时间的索引(support/cas-server-support-mongo-ticket-registry/org/apereo/cas/ticket/registry/MongoDbTicketRegistry.java):
private DBCollection createTicketCollection(final TicketDefinition ticket, final MongoDbConnectionFactory factory) {
final String collectionName = ticket.getProperties().getStorageName();
LOGGER.debug("Setting up MongoDb Ticket Registry instance [{}]", collectionName);
factory.createCollection(mongoTemplate, collectionName, this.dropCollection);
LOGGER.debug("Creating indices on collection [{}] to auto-expire documents...", collectionName);
final DBCollection collection = mongoTemplate.getCollection(collectionName);
collection.createIndex(new BasicDBObject(TicketHolder.FIELD_NAME_EXPIRE_AT, 1),
new BasicDBObject(FIELD_NAME_EXPIRE_AFTER_SECONDS, ticket.getProperties().getStorageTimeout()));
return collection;
}
在ticket配置代码中有(support/cas-server-support-mongo-ticket-registry/org/apereo/cas/config/MongoDbTicketRegistryTicketCatalogConfiguration.java):
protected void buildAndRegisterServiceTicketDefinition(final TicketCatalog plan, final TicketDefinition metadata) {
metadata.getProperties().setStorageName("serviceTicketsCollection");
metadata.getProperties().setStorageTimeout(casProperties.getTicket().getSt().getTimeToKillInSeconds());
super.buildAndRegisterServiceTicketDefinition(plan, metadata);
}
即storageTimeout在service ticket中为timeToKillInSeconds。
Service TIcket的超时,在代码中为(core/cas-server-core-configuration/org/apereo/cas/configuration/model/core/ticket/ServiceTicketProperties.java):
public class ServiceTicketProperties implements Serializable {
private static final long serialVersionUID = -7445209580598499921L;
/**
* Controls number of times a service ticket can be used within CAS server. Usage in CAS context means service ticket validation
* transaction.
*/
private int numberOfUses = 1;
/**
* Time in seconds that service tickets should be considered live in CAS server.
*/
private long timeToKillInSeconds = 10;
所以,service ticket默认是10秒过期。
按照CAS协议(https://apereo.github.io/cas/5.2.x/protocol/CAS-Protocol-Specification.html),服务票据应该是在登录后由CAS服务器颁发给被认证方,被认证方发回CAS服务器再次验证。但从两个节点的记录上看来,这个时间顺序发生了颠倒,应该是节点2相关记录的时间在节点1相关记录的时间前。简单查看了一下两个节点的时间,发现偶尔相差超过10秒,应该是时间不同步导致的问题。
因为单独登录CAS各个节点查询时间,本就有所误差,为了进一步确认问题,写了一个脚本确认个节点的时间是否有问题:
import paramiko
class SSHConnection(object):
def __init__(self, host, port=22, username='root', pwd='****'):
self.host = host
self.port = port
self.username = username
self.pwd = pwd
self.__k = None
def connect(self):
transport = paramiko.Transport((self.host,self.port))
transport.connect(username=self.username,password=self.pwd)
self.__transport = transport
def close(self):
self.__transport.close()
def cmd(self, command):
ssh = paramiko.SSHClient()
ssh._transport = self.__transport
stdin, stdout, stderr = ssh.exec_command(command)
result = stdout.read()
return result
cas_node1 = SSHConnection(host='****')
cas_node1.connect()
cas_node2 = SSHConnection(host='****')
cas_node2.connect()
cas_node3 = SSHConnection(host='****')
cas_node3.connect()
print("CAS Node 1: %s" % cas_node1.cmd('date'))
print("CAS Node 2: %s" % cas_node2.cmd('date'))
print("CAS Node 3: %s" % cas_node3.cmd('date'))
cas_node1.close()
cas_node2.close()
cas_node3.close()
运行了几次,发现果然出现了一些超过10秒误差的记录:
CAS Node1: Wed Mar 27 11:27:54 CST 2019
CAS Node2: Wed Mar 27 11:28:07 CST 2019
CAS Node3: Wed Mar 27 11:28:03 CST 2019
之后查看了一下系统服务中的chronyd, ntp相关状态,发现chronyd开启,chronyd服务日志记录每过几分钟就报出现10多秒的误差。回忆了一下之前配置几个节点,发现的ntpdate相关配置导致的DNS问题,找到了/var/spool/cron/root中的nptdate相关配置。原来,该节点由某私有云配置时,自动写了定时服务,每5分钟与厂商的时间服务器同步一下。但根据mail记录,该时间服务器不够精准,过十多分钟就会有超过10秒的误差。根据chronyd文档,该服务更适合虚拟机、会断电设备,故而删去跟ntp有关的ntpdate定时任务。之后再运行时间检测脚本,发现各节点同步到同一秒了:
CAS Node 1: Mon Apr 1 21:47:11 CST 2019
CAS Node 2: Mon Apr 1 21:47:11 CST 2019
CAS Node 3: Mon Apr 1 21:47:11 CST 2019