公司所有的sybase server实际在一台物理服务器上,由于是开发数据库,所以备份很简单.对数据库每周执行一次dump database备份,同时每天执行1次dump transaction备份.
同事建立环境的时候,针对每个sybase实例,都建立了相应的backupserver,而且简单的将sybase server名称写在备份脚本中.增加删除sybase实例时,需要修改备份脚本,
所以此次修改了备份脚本,使其自动获取当前系统中运行的sybase server,同时删除了过多的备份服务器,所有sybase实例使用相同的备份服务器.
for server in .....
do
isql -Usa -Psybasedba -S$server <<EOF
sp_configure "allow updates",1
go
update sysservers set srvnetname='ASE15_BS' where srvname='SYB_BACKUP'
go
sp_configure "allow updates",0
go
EOF
done
修改脚本,完成上述修改后,简单测试没有发现错误, 就修改了cron job,运行1周后,检查日志时发现频繁出现错误
Can't open a connection to site 'SYB_BACKUP'. See the error log file in the ASE boot directory.
检查备份服务器的错误日志,频繁出现错误
Open Server Error: 16137.10.0: The maximum number of site handlers, 10 has been exceeded
开始时猜测是否与备份服务器的参数有关,使其无法备份所有sybase实例,尝试在备份服务器启动参数中增加-P100和-m128, 但实验中仍然出现上述错误.
后来仔细检查发现都与日志备份有关,而且此错误并非固定出现,有时连续多天正常,有时则连续多天错误.在百度上搜索也未找到类似错误,手册中也没有找到.
查找多天没有找到思路,当时也想建立多个备份服务器来避开这个问题.但总是觉得很奇怪,备份是串行的,在dump database,dump tran同时执行时,备份服务器
中也就执行2个备份而已,而且错误还多在仅有dump tran执行时发生,dump database日志中没有出现错误.
于是想通过truss来查找错误时的原因(当时想查看backupserver是否存在多个LWP或fork出的进程是否遇到什么错误). 在尝试重现错误时,无意发现短时间内
重复执行备份事务日志的脚本,错误一定会出现. 可以重现错误后,使用truss来trace备份服务器,发现当备份10个sybase实例后,新的sybase实例发起的socket建立
连接,随即没有其他操作,错误就写到错误日志中,关闭了socket. trace文件排除了多个LWP的可能性,也排除了fork出的进程出现异常的猜测.
按照trace文件,我的理解是备份服务器在备份完成后,没有立即释放某种资源(猜测是否考虑磁带释放或类似的原因),当其累计超过10个后,就提示不能执行备份.
这就解释了错误不固定出现的原因,当事务日志很少时,备份很快,错误就出现了,当事务日志多时,sybase实例备份间隔长,资源已经释放,错误就消失了.
基于上述理解,修改事务日志备份脚本,增加sleep 60后,错误消失.
下面是备份脚本,供大家参考. 其中sybase实例的获取是通过检查dataserver进程的启动参数来得到.
备份数据库的脚本:
#!/usr/bin/bash
# from process startup arguments to get all sybase server from current server
# before backup,remove db backup.
# wan qisheng
# 2014.7.10
# all database dump keep 14 days,so when dump database is doning,
# we have another backup before 7 days.
# wan qisheng
# 2014.10.11
. ~/.profile
backup_dir=$1
cd $backup_dir
for server in `ps -u $LOGNAME -o pid,comm|grep dataserver|awk '{print $1}'|xargs -l1 pargs -a |awk '{print $2}'|grep ^-s|sed 's/-s//g'`
do
date "+$server begin at %Y-%m-%d %H:%M:%S"
# remove backups before 14 days
find ${backup_dir} -name "${server}.*.db.dmp" -ctime +10 -exec rm {} \;
isql -Usa -Ppasswd -S${server} -w 1024 -b<<EOF >${server}.sql
set nocount on
select 'dump database '||name||
' to "${backup_dir}/${server}.'||name||'.'||
rtrim(convert(char(8),getdate(),112))||'.'||rtrim(convert(char(4),datepart(hh,getdate())*100 + datepart(mi,getdate())))||
'.db.dmp"'||char(10)||
'go'
from sysdatabases
go
EOF
isql -Usa -Ppasswd -S${server} -i ${server}.sql >${server}.log
date "+$server finish at %Y-%m-%d %H:%M:%S"
done
date "+ it's finished at %Y-%m-%d %H:%M:%S"
备份事务日志的脚本:
#!/usr/bin/bash
# from process startup arguments to get all sybase server from current server
# before backup,remove db backup.
# wan qisheng
# 2014.7.10
# all database dump keep 14 days,so when dump database is doning,
# we have another backup before 7 days.
# wan qisheng
# 2014.10.11
. ~/.profile
backup_dir=$1
cd $backup_dir
for server in `ps -u $LOGNAME -o pid,comm|grep dataserver|awk '{print $1}'|xargs -l1 pargs -a |awk '{print $2}'|grep ^-s|sed 's/-s//g'`
do
date "+$server begin at %Y-%m-%d %H:%M:%S"
# remove backups before 14 days
find ${backup_dir} -name "${server}.*.tran.dmp" -ctime +14 -exec rm {} \;
isql -Usa -Ppasswd -S${server} -w 1024 -b<<EOF >${server}.sql
set nocount on
select 'dump transaction '||name||
' to "${backup_dir}/${server}.'||name||'.'||
rtrim(convert(char(8),getdate(),112))||'.'||rtrim(convert(char(4),datepart(hh,getdate())*100 + datepart(mi,getdate())))||
'.tran.dmp"'||char(10)||
'go'
from sysdatabases
go
EOF
isql -Usa -Ppasswd -S${server} -i ${server}.sql >${server}.log
date "+$server finish at %Y-%m-%d %H:%M:%S"
# avoid the backup server error :The maximum number of site handlers, 10 has been exceeded
# after increase the sleep, the error not happened.
sleep 60
done
date "+ it's finished at %Y-%m-%d %H:%M:%S"