How Redshift differs from PostgreSQL

转载

wx5bcd90b347f01 2021-10-25 14:59:14

文章标签 PostgreSQL redshift postgresql sed sql 文章分类 代码人生

How Redshift differs from PostgreSQL

Amazon Redshift debuted in 2012 as the first cloud data warehouse, and remains the most popular one today. But Redshift wasn't developed in-house. Amazon invested $20 million in a company called ParAccel, and in return gained the license to use code from ParAccel Analytic Database (PADB) for Redshift. PADB was notable because it was a columnar database that ran on commodity hardware, which made it a natural choice as a basis for a cloud-based analytic database platform.

While that deal might have seemed worthwhile for ParAccel at the time, it worked out even better for Amazon in the long run. In 2015 Amazon CTO Werner Vogels called Redshift "the fastest-growing service in AWS, ever." Meanwhile, ParAccel was acquired by Actian in 2013, and PADB was renamed Actian Matrix.

PADB was itself based on PostgreSQL – so to some extent, Redshift is based on PostgreSQL – but "based on" leaves a lot of room for difference. If you're familiar with PostgreSQL features and syntax, how easy will it be to get used to Redshift?

First, there are architectural differences between Redshift and PostgreSQL:

Under the hood, PostgreSQL is a traditional row-oriented relational database, great for processing transactional data. Redshift is a columnar database better suited for analytics, and thus a more appropriate platform for a data warehouse.
In PostgreSQL a single database connection cannot utilize more than one CPU, while Redshift is architected for parallel processing across multiple nodes.
When you load data into a Redshift table, Redshift distributes the rows of the table across nodes according to the table's distribution style. One of the distribution styles is key distribution, in which the rows are distributed according to the values in a specified column. Redshift's documentation says, "The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together." PostgreSQL lacks distribution styles and distribution keys.
Redshift doesn't support indexes. Instead, each table has a sort key, which determines how rows are ordered when the data is loaded. When you insert, update, or copy data in a Redshift table, new rows get added to an unsorted region, and are sorted only when the table is vacuumed or deep copied.
Redshift doesn't enforce primary key, foreign key, or uniqueness constraints, though Amazon says "primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity."

The architectural changes Amazon made to Redshift make it better able to handle large volumes of data for analytical queries. PostgreSQL can serve as a data warehouse for smaller volumes of data, but it can't match the performance of Redshift's column-oriented architecture. But the good news is that if you're familiar with PostgreSQL commands and concepts, learning Redshift isn't like learning a foreign language. It's more like the difference between dialects – the American and British versions of English, for example.

Differences in SQL

Both databases use SQL as their native language. While a lot of the two platforms' SQL syntax is the same, there are plenty of differences as well. The list of Redshift SQL commands differs from the list of PostgreSQL commands, and even when both platforms implement the same command, their syntax is often different.

Per Amazon's documentation, here are some of the major differences between Redshift and PostgreSQL SQL commands:

CREATE TABLE
Redshift doesn't support tablespaces, table partitioning, inheritance, and certain constraints. The Redshift implementation of CREATE TABLE enables you to define the sort and distribution algorithms for tables to optimize parallel processing.
ALTER TABLE
Redshift doesn't support ALTER COLUMN actions, and ADD COLUMN supports adding only one column in each ALTER TABLE statement.
COPY
The Redshift COPY command is specialized to enable loading of data from Amazon S3 buckets and Amazon DynamoDB tables and to facilitate automatic compression.
INSERT, UPDATE, and DELETE
Redshift doesn't support the WITH clause.
VACUUM
The parameters for VACUUM are different between the two databases. For example, the default VACUUM operation in PostgreSQL reclaims space and makes it available for reuse. The default VACUUM operation in Redshift is VACUUM FULL, which reclaims disk space and resorts all rows.

Unsupported features and datatypes

Amazon lists many PostgreSQL features and functions that are not supported in Redshift. In addition to the aforementioned lack of support for constraints and indexes, the list of unsupported features includes stored procedures, triggers, value expressions, and inheritance.

Redshift lacks support for several PostgreSQL data types:

Arrays
BIT, BIT VARYING
BYTEA
Composite types
Date/time types INTERVAL and TIME
Enumerated types
Geometric types
JSON
Network address types
Numeric types SERIAL, BIGSERIAL, SMALLSERIAL, and MONEY
Object identifier types
Pseudo-types
Range types
Text search types
TXID_SNAPSHOT
UUID
XML

Redshift does support these data types:

Data Type	Aliases	Description
SMALLINT	INT2	Signed two-byte integer
INTEGER	INT, INT4	Signed four-byte integer
BIGINT	INT8	Signed eight-byte integer
DECIMAL	NUMERIC	Exact numeric of selectable precision
REAL	FLOAT4	Single precision floating-point number
DOUBLE PRECISION	FLOAT8, FLOAT	Double precision floating-point number
BOOLEAN	BOOL	Logical Boolean (true/false)
CHAR	CHARACTER, NCHAR, BPCHAR	Fixed-length character string
VARCHAR	CHARACTER VARYING, NVARCHAR, TEXT	Variable-length character string with a user-defined limit
DATE		Calendar date (year, month, day)
TIMESTAMP	TIMESTAMP WITHOUT TIME ZONE	Date and time (without time zone)
TIMESTAMPTZ	TIMESTAMP WITH TIME ZONE	Date and time (with time zone)

Conclusion

If you have SQL skills you developed from working with PostgreSQL, you'll be able to get by in Amazon Redshift pretty well – but you'll have to familiarize yourself with the differences between the two platforms. Redshift is optimized for analytical queries, and its commands and syntax reflect that. As long as you keep a command reference page bookmarked, you should be fine.