最近部署了基于MongoDB的CAS 5.2.x集群,偶尔会出现一些页面反复跳转问题。简单查了一下日志,似乎是票据(ticket)验证的问题。

前几天,在登录CAS服务仪表盘(dashboard)时,页面提示"cannot validate CAS ticket: ST-***"。查询三个节点的日志(grep ST-*** cas/log/*),发现在节点1上有记录:

cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,144 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Ticket [ST-***] has expired and is now removed from the collection>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,144 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Deleting ticket [ST-***]>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,145 DEBUG [org.springframework.data.mongodb.core.MongoTemplate] - <Remove using query: { "ticketId" : "ST-***"} in collection: serviceTicketsCollection.>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,145 DEBUG [org.apereo.cas.ticket.registry.MongoDbTicketRegistry] - <Deleted ticket [ST-***] with result [WriteResult{acknowledged=false}]>
cas/log/cas-2019-03-27-2.log:2019-03-27 10:23:24,146 WARN [org.apereo.cas.DefaultCentralAuthenticationService] - <Service ticket [ST-***] does not exist.>

在节点2上的记录是,

cas-2019-03-27-1.log:2019-03-27 10:23:12,183 DEBUG [org.apereo.cas.web.pac4j.CasSecurityInterceptor$1] - <url: https://sso.****/cas/status/dashboard?ticket=ST-***>
cas-2019-03-27-1.log:org.pac4j.core.exception.TechnicalException: cannot validate CAS ticket: ST-***
cas-2019-03-27-1.log:Caused by: org.jasig.cas.client.validation.TicketValidationException: Ticket 'ST-***' not recognized

从日志时间看,在节点1上的10:23:24时,该服务票据(service ticket)已过期、被删除。但在节点2上的10:23:12时,有外部请求过来要求验证该服务票据。

查CAS代码,发现serviceTicketsCollection在创建时会添加一个过期时间的索引(support/cas-server-support-mongo-ticket-registry/org/apereo/cas/ticket/registry/MongoDbTicketRegistry.java):

private DBCollection createTicketCollection(final TicketDefinition ticket, final MongoDbConnectionFactory factory) {
        final String collectionName = ticket.getProperties().getStorageName();
        LOGGER.debug("Setting up MongoDb Ticket Registry instance [{}]", collectionName);
        factory.createCollection(mongoTemplate, collectionName, this.dropCollection);

        LOGGER.debug("Creating indices on collection [{}] to auto-expire documents...", collectionName);
        final DBCollection collection = mongoTemplate.getCollection(collectionName);
        collection.createIndex(new BasicDBObject(TicketHolder.FIELD_NAME_EXPIRE_AT, 1),
                new BasicDBObject(FIELD_NAME_EXPIRE_AFTER_SECONDS, ticket.getProperties().getStorageTimeout()));
        return collection;
    }

在ticket配置代码中有(support/cas-server-support-mongo-ticket-registry/org/apereo/cas/config/MongoDbTicketRegistryTicketCatalogConfiguration.java):

protected void buildAndRegisterServiceTicketDefinition(final TicketCatalog plan, final TicketDefinition metadata) {
        metadata.getProperties().setStorageName("serviceTicketsCollection");
        metadata.getProperties().setStorageTimeout(casProperties.getTicket().getSt().getTimeToKillInSeconds());
        super.buildAndRegisterServiceTicketDefinition(plan, metadata);
    }

即storageTimeout在service ticket中为timeToKillInSeconds。

Service TIcket的超时,在代码中为(core/cas-server-core-configuration/org/apereo/cas/configuration/model/core/ticket/ServiceTicketProperties.java):

public class ServiceTicketProperties implements Serializable {

    private static final long serialVersionUID = -7445209580598499921L;
    /**
     * Controls number of times a service ticket can be used within CAS server. Usage in CAS context means service ticket validation
     * transaction.
     */
    private int numberOfUses = 1;

    /**
     * Time in seconds that service tickets should be considered live in CAS server.
     */
    private long timeToKillInSeconds = 10;

所以,service ticket默认是10秒过期。

按照CAS协议(https://apereo.github.io/cas/5.2.x/protocol/CAS-Protocol-Specification.html),服务票据应该是在登录后由CAS服务器颁发给被认证方,被认证方发回CAS服务器再次验证。但从两个节点的记录上看来,这个时间顺序发生了颠倒,应该是节点2相关记录的时间在节点1相关记录的时间前。简单查看了一下两个节点的时间,发现偶尔相差超过10秒,应该是时间不同步导致的问题。

因为单独登录CAS各个节点查询时间,本就有所误差,为了进一步确认问题,写了一个脚本确认个节点的时间是否有问题:

import paramiko

class SSHConnection(object):

    def __init__(self, host, port=22, username='root', pwd='****'):
        self.host = host
        self.port = port
        self.username = username
        self.pwd = pwd
        self.__k = None

    def connect(self):
        transport = paramiko.Transport((self.host,self.port))
        transport.connect(username=self.username,password=self.pwd)
        self.__transport = transport

    def close(self):
        self.__transport.close()

    def cmd(self, command):
        ssh = paramiko.SSHClient()
        ssh._transport = self.__transport
        stdin, stdout, stderr = ssh.exec_command(command)
        result = stdout.read()
        return result


cas_node1 = SSHConnection(host='****')
cas_node1.connect()
cas_node2 = SSHConnection(host='****')
cas_node2.connect()
cas_node3 = SSHConnection(host='****')
cas_node3.connect()
print("CAS Node 1: %s" % cas_node1.cmd('date'))
print("CAS Node 2: %s" % cas_node2.cmd('date'))
print("CAS Node 3: %s" % cas_node3.cmd('date'))
cas_node1.close()
cas_node2.close()
cas_node3.close()

运行了几次,发现果然出现了一些超过10秒误差的记录:

CAS Node1: Wed Mar 27 11:27:54 CST 2019

CAS Node2: Wed Mar 27 11:28:07 CST 2019

CAS Node3: Wed Mar 27 11:28:03 CST 2019

之后查看了一下系统服务中的chronyd, ntp相关状态,发现chronyd开启,chronyd服务日志记录每过几分钟就报出现10多秒的误差。回忆了一下之前配置几个节点,发现的ntpdate相关配置导致的DNS问题,找到了/var/spool/cron/root中的nptdate相关配置。原来,该节点由某私有云配置时,自动写了定时服务,每5分钟与厂商的时间服务器同步一下。但根据mail记录,该时间服务器不够精准,过十多分钟就会有超过10秒的误差。根据chronyd文档,该服务更适合虚拟机、会断电设备,故而删去跟ntp有关的ntpdate定时任务。之后再运行时间检测脚本,发现各节点同步到同一秒了:

CAS Node 1: Mon Apr  1 21:47:11 CST 2019

CAS Node 2: Mon Apr  1 21:47:11 CST 2019

CAS Node 3: Mon Apr  1 21:47:11 CST 2019