sql没有足够的值
(DATA SCIENCE)
SQL is a powerful language. SQL is a part of most of the tech stacks you’ll work with. For a developer, the use of SQL might be limited to inserting and retrieving data in the database, but for data analysts, data scientists and data engineers, it is usually much more than that. SQL gives you direct access to the database — there’s a whole lot of analytics that can be done right there — without getting data out of the database and loading it into pandas or PySpark. Obviously, what you can do within the database is limited because of the resources.
SQL是一种强大的语言。 SQL是您将使用的大多数技术堆栈的一部分。 对于开发人员而言,SQL的使用可能仅限于在数据库中插入和检索数据,但是对于数据分析人员,数据科学家和数据工程师而言,SQL的用途远不止于此。 SQL使您可以直接访问数据库-可以在此处直接进行很多分析-无需将数据从数据库中取出并加载到pandas或PySpark中。 显然,由于资源的原因,您在数据库中可以执行的操作受到限制。
From what I have observed over the years, people who work with a statistical programming language like R, Julia or Python tend to do almost everything in that language whereas some of the stuff can be actually done more efficiently with SQL at times. Apart from the basic selects, inserts, updates, joins and subqueries, there are a lot of advanced features of SQL which can be used for data analysis that we don’t exploit often enough.
根据我多年来的观察,使用诸如R,Julia或Python之类的统计编程语言的人往往会使用该语言来执行几乎所有操作,而有些事情实际上有时可以用SQL更有效地完成。 除了基本的选择,插入,更新,联接和子查询外,还有很多SQL的高级功能可用于数据分析,但我们很少利用这些功能。
There’s a post on KDNuggets which says that it’s the last guide that you’d need for data analysis. Although it’s a well written guide but I think that it definitely is NOT the last guide you’d need for data analysis. You’ll need to know more. I’d say that the Medium post you’re reading right now is also not the last guide you’ll need to be great at SQL. This just talks about a few neglected, underused but powerful features of SQL. Let’s go ahead and go over some of them.
KDNuggets上有一篇文章,说这是您进行数据分析所需的最新指南 。 尽管这是一本写得很好的指南,但我认为它绝对不是您进行数据分析所需的最后指南。 您需要了解更多。 我想说的是,您现在正在阅读的中级帖子也不是您精通SQL的最新指南。 此处仅讨论了一些被忽略,未充分利用但功能强大SQL功能。 让我们继续研究其中的一些。
(Hierarchical Queries)
Enterprise relational databases like Oracle had started supporting storage and retrieval of hierarchical data long time ago. Before MySQL 8 was released, MySQL was probably one of the few databases which didn’t support a straightforward way of querying hierarchical data. I have had to refer this article by Mike Hillyer many times over the last couple of years while implementing hierarchical storage in MySQL. It’s a great read.
像Oracle这样的企业关系数据库很久以前就开始支持分层数据的存储和检索。 在发布MySQL 8之前,MySQL可能是少数几个不支持直接查询分层数据的数据库之一。 在MySQL中实施分层存储时,最近几年我不得不多次引用Mike Hillyer的文章 。 这是一本好书。
Hierarchical data is everywhere if you think about it — categories and sub-categories and further subcategories of products, organizational hierarchy, animal and plant species, family trees and so on. Normal SQL features aren’t enough to query hierarchical data efficiently as it would result in a lot of subqueries (and confusion). In MySQL 5.7 or earlier, you’d use something called session variables to do hierarchical queries and in MySQL 8 or later and in other databases, you’d use recursive common table expressions.
如果您考虑一下,分层数据无处不在-产品的类别和子类别以及其他子类别,组织层次结构,动植物种类,家谱等。 普通SQL功能不足以有效地查询分层数据,因为它会导致大量子查询(和混乱)。 在MySQL 5.7或更早的版本中,您将使用称为会话变量的内容进行分层查询,而在MySQL 8或更高版本中以及其他数据库中,则将使用递归公用表表达式。
Hierarchical data is everywhere — product categories & subcategories, organizational hierarchies, family trees etc.
层次结构数据无处不在-产品类别和子类别,组织层次结构,族谱等。
I’ll give you some context on this. An ex-colleague called me up one day and asked me about how to run a hierarchical query on MySQL 5.7 — until this version MySQL did not support common table expressions. So, here’s what the query would look like. Let’s now talk about CTEs 😄
我会给你一些背景。 一位前同事一天打电话给我,问我如何在MySQL 5.7上运行分层查询-直到此版本MySQL不支持公用表表达式。 因此,这就是查询的样子。 现在让我们谈谈CTE s
create database random;
drop table if exists random.organization_dimension;
create table random.organization_dimension as
select 1 organization_key,
1 organization_parent_key;
alter table random.organization_dimension change organization_parent_key organization_parent_key int(1) null;
truncate table random.organization_dimension;
select * from random.organization_dimension;
insert into random.organization_dimension
select 1, null union all
select 2, 1 union all
select 7, 1 union all
select 3, 2 union all
select 4, 2 union all
select 8, 7 union all
select 9, 7 union all
select 5, 4 union all
select 6, 4 union all
select 10, 9 union all
select 13, 9 union all
select 11, 10 union all
select 12, 10;
select * from random.organization_dimension;
alter table random.organization_dimension
add column organization_name varchar(20);
update random.organization_dimension set organization_name = concat('Firm',organization_key);
-- For given Firm ID = 7 as in the example.
select t.organization_key,
t.organization_parent_key,
t.organization_name
from random.organization_dimension t
where not exists (select 1 from random.organization_dimension t0
where t0.organization_parent_key = t.organization_key)
and t.organization_key >= 7
union
select t1.organization_key,
t1.organization_parent_key,
t1.organization_name
from random.organization_dimension t1
where t1.organization_key in
(select t.organization_parent_key
from random.organization_dimension t
where not exists (select 1 from random.organization_dimension t0
where t0.organization_parent_key = t.organization_key)
and t.organization_key >= 7)
order by 2;
select r2.*
from (select @_id as _id,
(select @_id := organization_parent_key
from random.organization_dimension
where organization_key = _id) as parent_id,
@level := @level + 1 as _level
from (select @_id := 12, @level := 0) vars, random.organization_dimension h
where @_id <> 0) r1
join random.organization_dimension r2 on r1._id = r2.organization_key
order by r1._level desc;(Recursive Common Table Expressions)
Fondly known as a CTE, this is a very powerful construct that is supported in all major relational databases (including most relational data warehouses). It essentially reduced the need for writing one query within another within another within another, i.e., nested subqueries. A CTE is essentially a view defined within your query with a scope of that query. The CTE doesn’t exist as a database object, it exists as part of the query. CTEs are also extremely good for readability (if done well).
这是众所周知的CTE,是一种非常强大的构造,所有主要的关系数据库(包括大多数关系数据仓库)都支持该构造。 从本质上讲,它减少了在另一个查询(另一个嵌套子查询)中的另一个查询中编写一个查询的需求。 CTE本质上是在查询中定义的具有该查询范围的视图。 CTE不作为数据库对象存在,而是作为查询的一部分存在。 CTE的可读性也非常好(如果做得好)。
A CTE is essentially a view defined within your query with a scope of that query.
CTE本质上是在查询中定义的具有该查询范围的视图。
I wrote a piece about generating random data using SQL a while back and I have used CTEs in some of the queries there. You can go have a quick look here. Also a recursive CTE is just like a recursive function or method that calls itself. Imagine a tree data structure where you start parsing and keep parsing till you get to a leaf node. That’s recursivity or recursiveness of the data queried by using recursive CTEs.
不久前,我写了一篇有关使用SQL生成随机数据的文章,在那里我在某些查询中使用了CTE。 您可以在这里快速浏览。 另外,递归CTE就像调用自身的递归函数或方法一样。 想象一下一个树数据结构,在该结构中您开始进行解析并一直进行解析,直到到达叶节点为止。 这就是使用递归CTE查询的数据的递归性或递归性。
with recursive series as (
select 1 as id union all
select id + 1 as id
from series
where id < 100),
cities as (select 'Santa Clara' city union all
select 'Los Angeles' union all
select 'Santa Clarita' union all
select 'San Bernardino' union all
select 'Alameda' union all
select 'San Mateo' union all
select 'Santa Barbara'
),
first_names as (select 'John' first_name union all
select 'James' first_name union all
select 'David' first_name union all
select 'Jeremy' first_name union all
select 'Ron' first_name union all
select 'Katie' first_name union all
select 'Nikita' first_name union all
select 'Rachel' first_name union all
select 'Tom' first_name
),
last_names as (select 'Smith' last_name union all
select 'Johnson' last_name union all
select 'Williams' last_name union all
select 'Brown' last_name union all
select 'Jones' last_name union all
select 'Miller' last_name union all
select 'Davis' last_name union all
select 'Wilson' last_name union all
select 'West' last_name
)
select id, user_id, first_name, last_name
dob, city, salary - mod(salary, 100) salary
from (select id,
substring(md5(rand()),1,20) user_id,
(select first_name from first_names order by rand() limit 1) first_name,
(select last_name from last_names order by rand() limit 1) last_name,
date(concat_ws('-',(floor(1919+rand()*100)),
(floor(1+rand()*12)),
(floor(1+rand()*28))
)
) dob,
(select city from cities order by rand() limit 1) city,
floor((rand() * (120000 + 1)) + 35000) salary
from series) as t;(Window Functions)
The most commonly used analytic functions in SQL are aggregate functions. While aggregate functions in SQL work on the whole data set or parts of the data set in conjunction with the GROUP BY clause, window functions extend the functionality of aggregate functions to do more than just calculate basic aggregates. Window functions provide the ability to second order aggregate & summary operations on the dataset, and on parts of the dataset, optionally with moving windows and other interesting features. This gives us the chance to calculate things like
SQL中最常用的分析函数是聚合函数。 虽然SQL中的聚合函数与GROUP BY子句一起在整个数据集或部分数据集上工作,但窗口函数扩展了聚合函数的功能,而不仅仅是计算基本聚合。 窗口函数提供了对数据集以及部分数据集进行二阶聚合和汇总操作的功能,还可以选择移动窗口和其他有趣的功能。 这使我们有机会计算类似
- moving average of sales over the last seven working days
- running total of revenue month on month
This gives us great power to analyse data within SQL without taking it outside to be processed in pandas or PySpark.
这使我们拥有强大的能力来分析SQL中的数据,而无需将其带到大熊猫或PySpark中进行处理。
select ,
row_number() over () rn,
rank() over () rnk,
dense_rank() over () dns_rnk,
lead() over () lead_name,
lag() over () lag_name,
first_value() over () fv_name,
last_value() over () lv_name,
ntile(4) over () quartile,
ntile(5) over () quintile
from devstronomy.planet as p;Here’s a great in-depth piece about Window Functions in SQL. Do give it a read if you’re interested.
这是有关SQL中的Window函数的一篇很棒的深入文章。 如果您有兴趣,请阅读。
(Procedural Language)
From an earlier article I wrote about the many flavours of SQL —
在较早的文章中,我写了关于SQL的多种风格的文章-
One of the measures of how powerful your flavour of SQL is to go through the full feature set that the database has to offer — and find out whether your flavour of SQL is Turing Complete. But even a Turing complete SQL cannot guarantee handling specific use cases that the query DSL allows but the architecture doesn’t. For this reason, so many flavours of databases/SQL exist.
衡量SQL风格有多强大的一种方法是浏览数据库必须提供的全部功能集-并确定 SQL 风格是否 为Turing Complete 。 但是,即使是图灵完整SQL也不能保证处理查询DSL允许但架构不允许的特定用例。 因此,存在许多种数据库/ SQL。
The procedural language extension to SQL provided by some of the major relational databases make them Turning complete languages. Some of the extensions are PL/SQL, T-SQL and plgsql. A procedural language lets you access a lot of constructs available in a full fledged programming language, natively within the database itself. For example, in the procedural version of SQL, you’ll be able to write loop constructs, you’ll be able to write structured programs using functions & procedures, you’ll have access to a lot of data types and data structures and you’ll be able to do object-oriented programming.
一些主要的关系数据库提供的对SQL的过程语言扩展使它们成为Turning完整的语言。 一些扩展是PL / SQL,T-SQL和plgsql。 程序语言使您可以在数据库本身内部访问以成熟的编程语言提供的许多构造。 例如,在SQL的过程版本中,您将能够编写循环结构,能够使用函数和过程来编写结构化程序,可以访问许多数据类型和数据结构,并且就能进行面向对象的编程。
You can imagine a lot can be done using a full programming language’s features right within SQL with direct access to data without pushing it to an external system. Doing that not only removes the additional layer, it also saves a lot of network time spent on moving data around.
您可以想象在SQL中使用完整的编程语言功能直接访问数据而无需将其推送到外部系统可以完成很多工作。 这样做不仅消除了额外的层,还节省了在移动数据上花费的大量网络时间。
(Conclusion)
There are many more superpowerful features of SQL that we could talk about but in my experience, these were some of them which I felt were underused by a lot of people (and teams) I worked with and observed — and these included database engineers too!
我们可以讨论SQL的许多其他超强大功能,但是根据我的经验,我觉得其中的一些功能被我与之合作并观察到的很多人(和团队)没有充分利用-其中也包括数据库工程师!
翻译自: https://towardsdatascience.com/4-advanced-sql-features-you-havent-used-enough-919f154ff530
















