How Redshift differs from PostgreSQL

 

Amazon Redshift debuted in 2012 as the first cloud data warehouse, and remains the most popular one today. But Redshift wasn't developed in-house. Amazon invested $20 million in a company called ParAccel, and in return gained the license to use code from ParAccel Analytic Database (PADB) for Redshift. PADB was notable because it was a columnar database that ran on commodity hardware, which made it a natural choice as a basis for a cloud-based analytic database platform.

While that deal might have seemed worthwhile for ParAccel at the time, it worked out even better for Amazon in the long run. In 2015 Amazon CTO Werner Vogels called Redshift "the fastest-growing service in AWS, ever." Meanwhile, ParAccel was acquired by Actian in 2013, and PADB was renamed Actian Matrix.

PADB was itself based on PostgreSQL – so to some extent, Redshift is based on PostgreSQL – but "based on" leaves a lot of room for difference. If you're familiar with PostgreSQL features and syntax, how easy will it be to get used to Redshift?

First, there are architectural differences between Redshift and PostgreSQL:

  • Under the hood, PostgreSQL is a traditional row-oriented relational database, great for processing transactional data. Redshift is a ​​columnar database​​ better suited for analytics, and thus a more appropriate platform for a data warehouse.
  • In PostgreSQL a single database connection ​​cannot utilize more than one CPU​​, while Redshift is architected for parallel processing across multiple nodes.
  • When you load data into a Redshift table, Redshift distributes the rows of the table across nodes according to the table's ​​distribution style​​. One of the distribution styles is key distribution, in which the rows are distributed according to the values in a specified column. Redshift's documentation says, "The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together." PostgreSQL lacks distribution styles and distribution keys.
  • Redshift doesn't support indexes. Instead, each table has a sort key, which determines how rows are ordered when the data is loaded. When you insert, update, or copy data in a Redshift table, new rows get added to an unsorted region, and are sorted only when the table is ​​vacuumed​​ or ​​deep copied​​.
  • Redshift doesn't enforce primary key, foreign key, or uniqueness constraints, though ​​Amazon says​​ "primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity."

The architectural changes Amazon made to ​​Redshift​​ make it better able to handle large volumes of data for analytical queries. PostgreSQL can serve as a data warehouse for smaller volumes of data, but it can't match the performance of Redshift's column-oriented architecture. But the good news is that if you're familiar with PostgreSQL commands and concepts, learning Redshift isn't like learning a foreign language. It's more like the difference between dialects – the American and British versions of English, for example.

Differences in SQL

Both databases use SQL as their native language. While a lot of the two platforms' SQL syntax is the same, there are plenty of differences as well. The list of ​​Redshift SQL commands​​ differs from the list of ​​PostgreSQL commands​​, and even when both platforms implement the same command, their syntax is often different.

Per ​​Amazon's documentation​​, here are some of the major differences between Redshift and PostgreSQL SQL commands:

  • CREATE TABLE
    Redshift doesn't support tablespaces, table partitioning, inheritance, and certain constraints. The Redshift implementation of CREATE TABLE enables you to define the sort and distribution algorithms for tables to optimize parallel processing.
  • ALTER TABLE
    Redshift doesn't support ALTER COLUMN actions, and ADD COLUMN supports adding only one column in each ALTER TABLE statement.
  • COPY
    The Redshift COPY command is specialized to enable loading of data from Amazon S3 buckets and Amazon DynamoDB tables and to facilitate automatic compression.
  • INSERT, UPDATE, and DELETE
    Redshift doesn't support the WITH clause.
  • VACUUM
    The parameters for VACUUM are different between the two databases. For example, the default VACUUM operation in PostgreSQL reclaims space and makes it available for reuse. The default VACUUM operation in Redshift is VACUUM FULL, which reclaims disk space and resorts all rows.

Unsupported features and datatypes

Amazon lists many ​​PostgreSQL features​​ and ​​functions​​ that are not supported in Redshift. In addition to the aforementioned lack of support for constraints and indexes, the list of unsupported features includes stored procedures, triggers, value expressions, and inheritance.

Redshift ​​lacks support​​ for several PostgreSQL data types:

  • Arrays
  • BIT, BIT VARYING
  • BYTEA
  • Composite types
  • Date/time types INTERVAL and TIME
  • Enumerated types
  • Geometric types
  • JSON
  • Network address types
  • Numeric types SERIAL, BIGSERIAL, SMALLSERIAL, and MONEY
  • Object identifier types
  • Pseudo-types
  • Range types
  • Text search types
  • TXID_SNAPSHOT
  • UUID
  • XML

Redshift ​​does support​​ these data types:

Data Type

Aliases

Description

SMALLINT

INT2

Signed two-byte integer

INTEGER

INT, INT4

Signed four-byte integer

BIGINT

INT8

Signed eight-byte integer

DECIMAL

NUMERIC

Exact numeric of selectable precision

REAL

FLOAT4

Single precision floating-point number

DOUBLE PRECISION

FLOAT8, FLOAT

Double precision floating-point number

BOOLEAN

BOOL

Logical Boolean (true/false)

CHAR

CHARACTER, NCHAR, BPCHAR

Fixed-length character string

VARCHAR

CHARACTER VARYING, NVARCHAR, TEXT

Variable-length character string with a user-defined limit

DATE

 

Calendar date (year, month, day)

TIMESTAMP

TIMESTAMP WITHOUT TIME ZONE

Date and time (without time zone)

TIMESTAMPTZ

TIMESTAMP WITH TIME ZONE

Date and time (with time zone)

Conclusion

If you have SQL skills you developed from working with PostgreSQL, you'll be able to get by in ​​Amazon Redshift​​ pretty well – but you'll have to familiarize yourself with the differences between the two platforms. Redshift is optimized for analytical queries, and its commands and syntax reflect that. As long as you keep a command reference page bookmarked, you should be fine.