7.统计技术
(1)计算描述性统计
 整体特征:
 观察的数量、总和及其范围(最小值和最大值)
 集中趋势的度量,例如均值,中位数和众数
 变化的度量,例如标准偏差和方差
 统计信息:
 中位数:
 通过查询统计观察值的数量,根据这个数量,可以确定中位数的计算需要一个值还是两个值,以及在有序的观察值集合内它们排在第几个
 通过包含order by子句的查询,对观察值进行排序,并利用limit子句提取中间的一个或两个值
 .如果只有一个中间值,那么它就是中位数,如果中间值有两个,那么需要求它们的平均值
 奇数个数:select score from t order by score limit 18,1
 偶数个数:select score from t order by score limit 18,2
 众数:select score,count(score) as frequency from testscore group by score order by frequency desc
 合计函数:
 均值:mean()
 方差:
 总体方差:VAR_POP()
 抽样方差:VAR_SAMP()
 标准差:
 总体标准差:STDDEV_POP()
 抽样标准差:STDDEV_SAMP ()
 (2)分组描述性统计
 group by
 (3)生成频率分布
 根据数据集的汇总信息,导出频率分布
 频数分布:select score,count(score) as counts from testscore group by score
 百分比代替计数生成相对频率分布:set @n=(select count(score) from testscore);select score,(count(score)100)/@n as percent from testscore group by score;
 (4)统计缺失的值
 统计集合中null值的数量
 count()统计总行数,count(score)统计没有缺失的分数,两个值之差即为缺失分数的数量,并且通过与总数的差可以计算缺失分数的百分比:
 select count() as ‘n(total)’, count(score) as ‘n(nonmissing)’, count()-count(score) as ‘n(missing)’,((count()-count(score))100/count() as ‘%missing’) from t;
 使用sum(isnull(score))直接计算,如果参数为null,则isnull()函数返回1,否则返回0:select count() as ‘n(total)’,count(score) as ‘n(nonmissing)’,sum(isnull(score)) as ‘n(missing)’,(sum(isnull(score))100)/count() as ‘%missing’ from t;
 (5)计算线性回归或相关系数
 使用汇总函数来计算所需的统计结果
 计算回归方程所需的各项,包括观察数量,每个变量的均值、总和、平方和,以及每个变量的乘积之和
 select
 @n:=count(score) as N,
 @meanX:=avg(age) as ‘X mean’,
 @sumX:=SUM(age) as ‘X sum’,
 @sumXX:=SUM(ageage) as ‘X sum of squares’,
 @meanY:=AVG(score) as ‘Y mean’,
 @sumY:=SUM(score) as ‘Y sum’,
 @sumYY:=SUM(scorescore) as ‘Y sum of squares’,
 @sumXY:=SUM(agescore) as 'XY sum’
 from testscore\G
 根据上述各项,计算回归的斜率和截距
 set @b:=(@n*@sumXY-@sumX*@sumY)/(@n*@sumXY-@sumX);
 set @a:=(@meanY-@b*@meanX);
 select @b as slope,@a as intercept;
 那么回归方程是:
 select concat(‘Y=’,@b,‘X+’,@a) as ‘least-squares regression’;
 计算相关系数:
 select (@nsumXY-@sumX@sumY)/SQRT((@n*@sumXY-@sumX*@sumX)(@n@sumXY-@sumY*@sumY)) as correlation
 (6)生成随机数
 使用rand()函数,rand()函数可以产生0到1之间的随机数
 select rand()
 (7)随机排序
 order by rand()
 select name from t order by rand();
 (8)随机选择数据
 随机化值,然后选择第一个(如果需要多个值,则选择前几个):order by rand(),加入limit 1,就可以选择第一个值
 (9)计算相邻两行之差
 利用表的自连接,获取相邻两个行并计算它们之间的差值:
 select t1.id as id1,t2.id as id2,
 t2.date,
 t1.ab as ab1,t2.ab as ab2,
 t1.h as h1,t2.h as h2,
 t2.ab-t1.ab as abdiff,
 t2.h-t1.h as hdiff,
 truncate(ifnull((t2.h-t1.h)/(t2.ab-t1.ab),0),3) as ba
 from player_states as t1 inner join player_states as t2
 on t1.id+1=t2.id
 order by t1.id;
 (10)查找累计总和与移动平均值
 利用表的自连接生成每个测量点连续的观察值集合,然后使用合计函数处理每组值,以计算总和或平均值
 select t1.stage,t1.km,sum(t2.km) as ‘cum.km’
 from marathon as t1 inner join marathon as t2
 on t1.stage>=t2.stage
 group by t1.stage;
 (11)指定排名
 决定排名的方法,然后将值按所需顺序排列,并按照排名方法指定排名
 简单地用行号给有序集合中的值指定排名:
 set @rownum:=0;
 select @rownum:=@rownum+1 as rank,score from t order by score desc;
 这种排名没有考虑到相同名次(即多个相同的值)的可能性
 仅在值改变时提升排名(相同分数排名相同):
 set @rank=0,@prev_val=NULL;
 select @rank:=if(@prev_val=score,@rank,@rank+1) as rank,@prev_val:=score as score from t order by score desc;
 按行号对值进行排名,并在遇到相同值的情况下,将所有相同值的排名都设置为第一个值的行号:
 set @rownum=0,@rank=0,@prev_val=NULL
 select @rownum:=@rownum+1 as row,
 @rank:=if(@prev_val<>score,@rownum,@rank) as rank,
 @prev_val:=score as score
 from t order by score desc;