Introduction to SQL Window Functions / 1
Window functions are an extremely powerful powerful part of the SQL 2003 standard, supported by most modern releases of databases such as Oracle 8+, Postgres 9.1+, SQL Server 2005+ and others. Sadly neither SQLLite or MySql seem to support them yet, but if you are working with a database where they are available, do use them: they can make your life a lot easier. Generally, with window functions, you can write simpler and faster code than you would without.
Let’s start with a very simple schema and some data:
CREATE TABLE Sales (
article_id VARCHAR(10),
year INTEGER,
quantity INTEGER
);
INSERT INTO Sales VALUES ('Shirts', 2008, 63);
INSERT INTO Sales VALUES ('Shirts', 2009, 66);
INSERT INTO Sales VALUES ('Shirts', 2010, 68);
INSERT INTO Sales VALUES ('Shirts', 2011, 72);
INSERT INTO Sales VALUES ('Shirts', 2012, 69);
INSERT INTO Sales VALUES ('Shirts', 2013, 71);
INSERT INTO Sales VALUES ('Shirts', 2014, 74);
INSERT INTO Sales VALUES ('Jeans', 2008, 31);
INSERT INTO Sales VALUES ('Jeans', 2009, 33);
INSERT INTO Sales VALUES ('Jeans', 2010, 36);
INSERT INTO Sales VALUES ('Jeans', 2011, 33);
INSERT INTO Sales VALUES ('Jeans', 2012, 38);
INSERT INTO Sales VALUES ('Jeans', 2013, 41);
INSERT INTO Sales VALUES ('Jeans', 2014, 40);
You can play with this dataset on SQLFiddle.
Let’s start simple: let’s extract the average, minimum and maximum sales for each article type. This can be achieved with a simple GROUP BY
query:
SELECT
article_id,
AVG(quantity) as average,
MIN(quantity) as worst,
MAX(quantity) as best
FROM Sales
GROUP BY article_id
which returns:
ARTICLE_ID AVERAGE WORST BEST
Jeans 36 31 41
Shirts 69 63 74
What if we wanted to extract the year with the best result? Then the query becomes more complicated. For example, we could write it like this:
SELECT
article_id,
year,
quantity
FROM Sales s1
WHERE s1.quantity = ( SELECT MAX(quantity)
FROM Sales s2
WHERE s1.article_id = s2.article_id )
which returns:
ARTICLE_ID YEAR QUANTITY
Shirts 2014 74
Jeans 2013 41
Another option is to use the ALL
clause, and write the query like this:
SELECT
article_id,
year,
quantity
FROM Sales s1
WHERE s1.quantity >= ALL ( SELECT s2.quantity
FROM Sales s2
WHERE s1.article_id = s2.article_id )
The two queries above should have the same execution plan and performances.
Simple window functions Link to heading
Ok, so far so good. Let us now suppose you wanted to extract the sales for each year and article and the maximum for that article type. We can do it like this:
SELECT
article_id,
year,
quantity,
(SELECT MAX(s2.quantity)
FROM Sales s2
WHERE s2.article_id = s1.article_id) AS maxForArticle
FROM Sales s1
This query returns, as expected:
ARTICLE_ID YEAR QUANTITY MAXFORARTICLE
Shirts 2008 63 74
Shirts 2009 66 74
Shirts 2010 68 74
Shirts 2011 72 74
Shirts 2012 69 74
Shirts 2013 71 74
Shirts 2014 74 74
Jeans 2008 31 41
Jeans 2009 33 41
Jeans 2010 36 41
Jeans 2011 33 41
Jeans 2012 38 41
Jeans 2013 41 41
Jeans 2014 40 41
This is a situation where window functions can help. Let’s start by seeing the query:
SELECT
article_id,
year,
quantity,
MAX(quantity) OVER (PARTITION BY article_id) AS maxForArticle
FROM Sales
The syntax is: window function name (in this case MAX
), then the OVER
clause and a window definition. In this case, we want to take the maximum value of the field quantity
on the subset of records with the given article_id
, so we’ll PARTITION
the data by article_id
. We can use different functions; AVG
, MIN
, COUNT
and others that we’ll discuss later.
Window definitions can include a PARTITION
clause and/or an ORDER BY
clause. Note that we can use multiple windows in a query:
SELECT
article_id,
year,
quantity,
MAX(quantity) OVER (PARTITION BY article_id) AS maxByArticle,
MAX(quantity) OVER (PARTITION BY year) AS maxByYear
FROM Sales s1
ORDER BY article_id, year
In this case we have two windows, one that works by article_id
and one by year
.
Window functions, for what we saw so far, seem little more than just syntax, but actually they can be far more efficient than sub-selects. Databases have special support for window functions and can execute better plans for them.
LAG and LEAD functions Link to heading
Let’s now start with an harder request. Let’s suppose we want to extract the difference in sales between one year and the previous. We can use the functions LAG
or LEAD
to solve this problem. These window functions extract (respectively) the previous or next value in the given window.
Let’s take a look at this query:
SELECT
article_id,
year,
quantity,
LAG(quantity) OVER (PARTITION BY article_id
ORDER BY year) AS prevYear
FROM Sales s1
ORDER BY year, article_id
The data returned by this query is:
ARTICLE_ID YEAR QUANTITY PREVYEAR
Jeans 2008 31 (null)
Shirts 2008 63 (null)
Jeans 2009 33 31
Shirts 2009 66 63
Jeans 2010 36 33
Shirts 2010 68 66
Jeans 2011 33 36
Shirts 2011 72 68
Jeans 2012 38 33
Shirts 2012 69 72
Jeans 2013 41 38
Shirts 2013 71 69
Jeans 2014 40 41
Shirts 2014 74 71
We are asking for the value value (LAG
) of the given expression - in this case quantity
- on the previous record OVER
the given window - in this case PARTITION BY article_id ORDER BY year
. This is a partition that includes the sales of the same article as the current row. With the ORDER BY
clause, we have sorted the data by year
and so we know what the previous record is - the record for the previous year. Note that even though we have a different ORDER BY
in the outer SELECT
, the window function has no problem pulling the correct data.
To get the value in the next record, we can use LEAD
.
Let’s now solve the request above and extract the difference between the sales of one year from the previous:
SELECT
article_id,
year,
quantity,
quantity - LAG(quantity) OVER (PARTITION BY article_id
ORDER BY year)
AS diffFromPrev
FROM Sales s1
ORDER BY year, article_id
which gives this result:
ARTICLE_ID YEAR QUANTITY DIFFFROMPREV
Jeans 2008 31 (null)
Shirts 2008 63 (null)
Jeans 2009 33 2
Shirts 2009 66 3
Jeans 2010 36 3
Shirts 2010 68 2
Jeans 2011 33 -3
Shirts 2011 72 4
Jeans 2012 38 5
Shirts 2012 69 -3
Jeans 2013 41 3
Shirts 2013 71 2
Jeans 2014 40 -1
Shirts 2014 74 3
Finally, if you want to use a window function over all the data set, you can use the syntax OVER ()
.
Conclusions Link to heading
Window functions can not only solve lots of problems that would require writing very complex queries, or ad-hoc iterative code in the requesting application, but they can also often do it with higher performances. Go, learn and use them!
Update: part two is available.