数据分析第一步,一定是先从数据库提取数据做基本的描述统计分析,整体查看做到心中有数,而80%的描述统计需要用SQL尝试取数后,或放到Excel或其他Tableau等工具,待基本了解数据整体现状之后,才会上模型上算法,去进一步挖掘数据价值。

不要忽略这第一步,实际工作中,它是极其高频的使用场景,80%的数据需求依赖于SQL解决,不会SQL的数据分析师, 只能站在数据仓库的门口望洋兴叹,因为自己手里没有"钥匙",没法进去盘点库存。

今儿大白话说《如何用SQL做数据分析》

1. 目的

用最简单的例子,达到80%的效果:

通过一个实例数据库,数据足够简单, 粗略演示下最基本的数据探查过程。真实的工作环境,数据环境可能会复杂些,我们一步一步来,以后会写更多文章来补充。

这篇文章,希望你能get到下面几点

  • 常用SQL练手
  • 数据探查思路
  • 隐藏价值提取

2. 数据

  • 网上共享Mysql实例数据库: employees, 以及安装方法

如果不想花时间搭建环境,我已经在云端搭建了数据学习环境,按照下面文章中第三部分“学习环境”的说明,申请远程访问权限:

2019年,数据民工的大白话规划

  • 数据库表结构



用sql做数据分析 如何用sql进行数据分析_sql 统计每月入职离职人数


mysql> show tables;+---------------------+| Tables_in_employees |+---------------------+| departments         || dept_emp            || dept_manager        || employees           || salaries            || titles              |+---------------------+# 部门表, 记录部门编号和部门名称CREATE TABLE `departments` (  `dept_no` char(4) NOT NULL,  `dept_name` varchar(40) NOT NULL,  PRIMARY KEY (`dept_no`),  UNIQUE KEY `dept_name` (`dept_name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 部门员工表,记录各部门下属员工编号, 以及员工在该部门工作的起始和终止日期CREATE TABLE `dept_emp` (  `emp_no` int(11) NOT NULL,  `dept_no` char(4) NOT NULL,  `from_date` date NOT NULL,  `to_date` date NOT NULL,  PRIMARY KEY (`emp_no`,`dept_no`),  KEY `emp_no` (`emp_no`),  KEY `dept_no` (`dept_no`),  CONSTRAINT `dept_emp_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE,  CONSTRAINT `dept_emp_ibfk_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 部门经理表, 记录对应的部门经理的员工编号, 以及在该部门工作的起始和终止日期CREATE TABLE `dept_manager` (  `dept_no` char(4) NOT NULL,  `emp_no` int(11) NOT NULL,  `from_date` date NOT NULL,  `to_date` date NOT NULL,  PRIMARY KEY (`emp_no`,`dept_no`),  KEY `emp_no` (`emp_no`),  KEY `dept_no` (`dept_no`),  CONSTRAINT `dept_manager_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE,  CONSTRAINT `dept_manager_ibfk_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 员工信息表, 记录员工初始日期, 姓名, 性别和雇佣日期信息CREATE TABLE `employees` (  `emp_no` int(11) NOT NULL,  `birth_date` date NOT NULL,  `first_name` varchar(14) NOT NULL,  `last_name` varchar(16) NOT NULL,  `gender` enum('M','F') NOT NULL,  `hire_date` date NOT NULL,  PRIMARY KEY (`emp_no`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 薪水表, 记录员工某段日期的薪水CREATE TABLE `salaries` (  `emp_no` int(11) NOT NULL,  `salary` int(11) NOT NULL,  `from_date` date NOT NULL,  `to_date` date NOT NULL,  PRIMARY KEY (`emp_no`,`from_date`),  KEY `emp_no` (`emp_no`),  CONSTRAINT `salaries_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 岗位表, 记录员工的岗位信息, 即该岗位任职起始和终止日期信息CREATE TABLE `titles` (  `emp_no` int(11) NOT NULL,  `title` varchar(50) NOT NULL,  `from_date` date NOT NULL,  `to_date` date DEFAULT NULL,  PRIMARY KEY (`emp_no`,`title`,`from_date`),  KEY `emp_no` (`emp_no`),  CONSTRAINT `titles_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

3. 好奇心

几张简单的表, 放在你面前, 你会好奇那些问题? 我先抛砖:

数据能否反应公司发展趋势?

  • 公司各年工资支出
  • 公司各年入职和离职数量

数据能否反应公司对男女员工价值观?

  • 公司男女比例
  • 公司年龄分布

数据能否反应那种岗位涨薪最快?

  • 不同岗位, 每年收入增长
  • 有岗位升迁的员工, 每年收入增长

4. 满足好奇心

针对上面的好奇心,你可以先尝试自己完成数据提取和分析, 然后对比下面的思路

公司发展趋势

  • 公司各年工资支出 :线性递增
mysql> select year(from_date), sum(salary) as cost       from salaries       group by year(from_date);+-----------------+-------------+| year(from_date) | cost        |+-----------------+-------------+|            1985 |   972864875 ||            1986 |  2052895941 ||            1987 |  3156881054 ||            1988 |  4295598688 ||            1989 |  5454260439 ||            1990 |  6626146391 ||            1991 |  7798804412 ||            1992 |  9027872610 ||            1993 | 10215059054 ||            1994 | 11429450113 ||            1995 | 12638817464 ||            1996 | 13888587737 ||            1997 | 15056011781 ||            1998 | 16220495471 ||            1999 | 17360258862 ||            2000 | 17535667603 ||            2001 | 17507737308 ||            2002 | 10243347616 |+-----------------+-------------+18 rows in set (1.86 sec)

上面数据,放到Excel表里,很容易生成一个柱状图:


用sql做数据分析 如何用sql进行数据分析_sql 统计每月入职离职人数_02


公司各年入职和离职数量: 入大于出,但离职率在逐年递增

# 每年入职mysql> select year(tt.from_date) as from_date, count(*) as num       from       (select emp_no, min(from_date) as from_date, max(to_date) as to_date       from dept_emp       group by emp_no) tt       group by year(tt.from_date)       order by from_date;+-----------+-------+| from_date | num   |+-----------+-------+|      1985 | 18293 ||      1986 | 19878 ||      1987 | 19987 ||      1988 | 20103 ||      1989 | 20310 ||      1990 | 20042 ||      1991 | 19817 ||      1992 | 20203 ||      1993 | 19914 ||      1994 | 19894 ||      1995 | 19875 ||      1996 | 20176 ||      1997 | 20120 ||      1998 | 19828 ||      1999 | 19848 ||      2000 |  1736 |+-----------+-------+16 rows in set (0.71 sec)# 每年离职mysql> select year(tt.to_date) as to_date, count(*) as num       from       (select emp_no, min(from_date) as from_date, max(to_date) as to_date       from dept_emp       group by emp_no) tt       group by year(tt.to_date)       order by to_date;+---------+--------+| to_date | num    |+---------+--------+|    1985 |     89 ||    1986 |    374 ||    1987 |    668 ||    1988 |    905 ||    1989 |   1288 ||    1990 |   1594 ||    1991 |   1883 ||    1992 |   2317 ||    1993 |   2742 ||    1994 |   3135 ||    1995 |   3739 ||    1996 |   4330 ||    1997 |   5069 ||    1998 |   5941 ||    1999 |   6964 ||    2000 |   7610 ||    2001 |   7241 ||    2002 |   4011 ||    9999 | 240124 | ====> 表示当前在职+---------+--------+19 rows in set (0.57 sec)

通过上面数据,尝试计算下面两个指标:

  1. 每年在职员工数量
  2. 每年离职率


用sql做数据分析 如何用sql进行数据分析_sql 如何更新excel数据_03


是不是很容易,下面的数据,自己尝试画下图?

当然,具体绘图不是重点,重点是数据反应背后的现状

图只是让数据显得直观,更容易发现数据的特征

员工价值观

  • 公司男女和年龄交叉分布
mysql> select gender, round((to_days('2002-08-01') - to_days(birth_date))/365,0)  as ages,       count(*)       from       (select emp_no, birth_date, gender       from employees) t1       join       (select emp_no from titles where to_date='9999-01-01') t2 on t1.emp_no=t2.emp_no       group by gender, round((to_days('2002-08-01') - to_days(birth_date))/365,0);+--------+------+----------+| gender | ages | count(*) |+--------+------+----------+| M      |   38 |    10846 || M      |   39 |    11064 || M      |   40 |    11135 || M      |   41 |    11038 || M      |   42 |    10891 || M      |   43 |    11261 || M      |   44 |    11071 || M      |   45 |    11104 || M      |   46 |    11033 || M      |   47 |    11122 || M      |   48 |    11046 || M      |   49 |    11051 || M      |   50 |    11105 || M      |   51 |      347 || F      |   38 |     7139 || F      |   39 |     7311 || F      |   40 |     7329 || F      |   41 |     7381 || F      |   42 |     7513 || F      |   43 |     7465 || F      |   44 |     7469 || F      |   45 |     7149 || F      |   46 |     7342 || F      |   47 |     7428 || F      |   48 |     7515 || F      |   49 |     7325 || F      |   50 |     7415 || F      |   51 |      229 |+--------+------+----------+28 rows in set (0.86 sec)

岗位涨薪最快

  • 总体每年收入增长
mysql> select  round((max-min)/min/(years-1),2) as increaceByYear, count(*) as num       from       (select emp_no, count(*) as years,       max(salary) as max, min(salary) as min       from salaries       group by emp_no       having years>1) tt       group by round((max-min)/min/(years-1),2)  order by increaceByYear;+----------------+-------+| increaceByYear | num   |+----------------+-------+|           0.00 |  1705 ||           0.01 |  7925 ||           0.02 | 35345 ||           0.03 | 68993 ||           0.04 | 72082 ||           0.05 | 56231 ||           0.06 | 31329 ||           0.07 | 11807 ||           0.08 |  3868 ||           0.09 |  1502 ||           0.10 |   605 ||           0.11 |   223 |+----------------+-------+12 rows in set (1.13 sec)
  • 有岗位升迁的员工, 每年收入增长
mysql> select round((max-min)/min/(years-1),2) as increaceByYear, count(*) as num       from       (       select t1.emp_no, max(salary) as max, min(salary) as min, count(*) years       from       (select emp_no, salary       from salaries) t1       join       (select emp_no, count(*) as tt       from titles       group by emp_no       having tt>1) t2 on t1.emp_no=t2.emp_no       group by t1.emp_no       having years>1       ) tt       group by  round((max-min)/min/(years-1),2)  order by increaceByYear;+----------------+-------+| increaceByYear | num   |+----------------+-------+|           0.00 |    22 ||           0.01 |  1295 ||           0.02 | 16156 ||           0.03 | 35924 ||           0.04 | 37833 ||           0.05 | 29389 ||           0.06 | 14890 ||           0.07 |  4040 ||           0.08 |   645 ||           0.09 |    74 ||           0.10 |     2 |+----------------+-------+11 rows in set (1.91 sec)

各部门员工分布

mysql> select dept_no, count(*) from dept_emp group by dept_no;+---------+----------+| dept_no | count(*) |+---------+----------+| d001    |    20211 || d002    |    17346 || d003    |    17786 || d004    |    73485 || d005    |    85707 || d006    |    20117 || d007    |    52245 || d008    |    21126 || d009    |    23580 |+---------+----------+9 rows in set (0.32 sec)

其他:

员工忠诚度

  • 在岗时间
  • 在岗时间与岗位收入相关性