Introduction to SQL Window Functions / 1
Window functions are an extremely powerful powerful part of the SQL 2003 standard, supported by most modern releases of databases such as Oracle 8+, Postgres 9.1+, SQL Server 2005+ and others. Sadly neither SQLLite or MySql seem to support them yet, but if you are working with a database where they are available, do use them: they can make your life a lot easier. Generally, with window functions, you can write simpler and faster code than you would without.
Let’s start with a very simple schema and some data:
CREATE TABLE Sales ( article_id VARCHAR(10), year INTEGER, quantity INTEGER ); INSERT INTO Sales VALUES ('Shirts', 2008, 63); INSERT INTO Sales VALUES ('Shirts', 2009, 66); INSERT INTO Sales VALUES ('Shirts', 2010, 68); INSERT INTO Sales VALUES ('Shirts', 2011, 72); INSERT INTO Sales VALUES ('Shirts', 2012, 69); INSERT INTO Sales VALUES ('Shirts', 2013, 71); INSERT INTO Sales VALUES ('Shirts', 2014, 74); INSERT INTO Sales VALUES ('Jeans', 2008, 31); INSERT INTO Sales VALUES ('Jeans', 2009, 33); INSERT INTO Sales VALUES ('Jeans', 2010, 36); INSERT INTO Sales VALUES ('Jeans', 2011, 33); INSERT INTO Sales VALUES ('Jeans', 2012, 38); INSERT INTO Sales VALUES ('Jeans', 2013, 41); INSERT INTO Sales VALUES ('Jeans', 2014, 40);
You can play with this dataset on SQLFiddle.
Let’s start simple: let’s extract the average, minimum and maximum sales for each article type. This can be achieved with a simple
GROUP BY query:
SELECT article_id, AVG(quantity) as average, MIN(quantity) as worst, MAX(quantity) as best FROM Sales GROUP BY article_id
ARTICLE_ID AVERAGE WORST BEST Jeans 36 31 41 Shirts 69 63 74
What if we wanted to extract the year with the best result? Then the query becomes more complicated. For example, we could write it like this:
SELECT article_id, year, quantity FROM Sales s1 WHERE s1.quantity = ( SELECT MAX(quantity) FROM Sales s2 WHERE s1.article_id = s2.article_id )
ARTICLE_ID YEAR QUANTITY Shirts 2014 74 Jeans 2013 41
Another option is to use the
ALL clause, and write the query like this:
SELECT article_id, year, quantity FROM Sales s1 WHERE s1.quantity >= ALL ( SELECT s2.quantity FROM Sales s2 WHERE s1.article_id = s2.article_id )
The two queries above should have the same execution plan and performances.
Simple window functions
Ok, so far so good. Let us now suppose you wanted to extract the sales for each year and article and the maximum for that article type. We can do it like this:
SELECT article_id, year, quantity, (SELECT MAX(s2.quantity) FROM Sales s2 WHERE s2.article_id = s1.article_id) AS maxForArticle FROM Sales s1
This query returns, as expected:
ARTICLE_ID YEAR QUANTITY MAXFORARTICLE Shirts 2008 63 74 Shirts 2009 66 74 Shirts 2010 68 74 Shirts 2011 72 74 Shirts 2012 69 74 Shirts 2013 71 74 Shirts 2014 74 74 Jeans 2008 31 41 Jeans 2009 33 41 Jeans 2010 36 41 Jeans 2011 33 41 Jeans 2012 38 41 Jeans 2013 41 41 Jeans 2014 40 41
This is a situation where window functions can help. Let’s start by seeing the query:
SELECT article_id, year, quantity, MAX(quantity) OVER (PARTITION BY article_id) AS maxForArticle FROM Sales
The syntax is: window function name (in this case
MAX), then the
OVER clause and a window definition. In this case, we want to take the maximum value of the field
quantity on the subset of records with the given
article_id, so we’ll
PARTITION the data by
article_id. We can use different functions;
COUNT and others that we’ll discuss later.
Window definitions can include a
PARTITION clause and/or an
ORDER BY clause. Note that we can use multiple windows in a query:
SELECT article_id, year, quantity, MAX(quantity) OVER (PARTITION BY article_id) AS maxByArticle, MAX(quantity) OVER (PARTITION BY year) AS maxByYear FROM Sales s1 ORDER BY article_id, year
In this case we have two windows, one that works by
article_id and one by
Window functions, for what we saw so far, seem little more than just syntax, but actually they can be far more efficient than sub-selects. Databases have special support for window functions and can execute better plans for them.
LAG and LEAD functions
Let’s now start with an harder request. Let’s suppose we want to extract the difference in sales between one year and the previous. We can use the functions
LEAD to solve this problem. These window functions extract (respectively) the previous or next value in the given window.
Let’s take a look at this query:
SELECT article_id, year, quantity, LAG(quantity) OVER (PARTITION BY article_id ORDER BY year) AS prevYear FROM Sales s1 ORDER BY year, article_id
The data returned by this query is:
ARTICLE_ID YEAR QUANTITY PREVYEAR Jeans 2008 31 (null) Shirts 2008 63 (null) Jeans 2009 33 31 Shirts 2009 66 63 Jeans 2010 36 33 Shirts 2010 68 66 Jeans 2011 33 36 Shirts 2011 72 68 Jeans 2012 38 33 Shirts 2012 69 72 Jeans 2013 41 38 Shirts 2013 71 69 Jeans 2014 40 41 Shirts 2014 74 71
We are asking for the value value (
LAG) of the given expression - in this case
quantity - on the previous record
OVER the given window - in this case
PARTITION BY article_id ORDER BY year. This is a partition that includes the sales of the same article as the current row. With the
ORDER BY clause, we have sorted the data by
year and so we know what the previous record is - the record for the previous year. Note that even though we have a different
ORDER BY in the outer
SELECT, the window function has no problem pulling the correct data.
To get the value in the next record, we can use
Let’s now solve the request above and extract the difference between the sales of one year from the previous:
SELECT article_id, year, quantity, quantity - LAG(quantity) OVER (PARTITION BY article_id ORDER BY year) AS diffFromPrev FROM Sales s1 ORDER BY year, article_id
which gives this result:
ARTICLE_ID YEAR QUANTITY DIFFFROMPREV Jeans 2008 31 (null) Shirts 2008 63 (null) Jeans 2009 33 2 Shirts 2009 66 3 Jeans 2010 36 3 Shirts 2010 68 2 Jeans 2011 33 -3 Shirts 2011 72 4 Jeans 2012 38 5 Shirts 2012 69 -3 Jeans 2013 41 3 Shirts 2013 71 2 Jeans 2014 40 -1 Shirts 2014 74 3
Finally, if you want to use a window function over all the data set, you can use the syntax
Window functions can not only solve lots of problems that would require writing very complex queries, or ad-hoc iterative code in the requesting application, but they can also often do it with higher performances. Go, learn and use them!
Update: part two is available.