Land Technologies
< all posts

Postgres SQL Distinct On

Posted by Vijay Patel on June 29, 2021 · 4 min read datasqlpostgres

The data team at LandTech were recently looking at the Postgres DISTINCT ON function for a particular data transformation in our pipelines. Fair warning, DISTINCT ON is Postgres specific function - in other databases, the window partition function would be used instead. In the article we have included both versions, and explain window partition functions further in a separate post.

Take an example data question: I want to find the orders that a specific type of item was most recently added too. So in the example below, Milk was most recently added to order_id: 4, whereas Bread was most recently added to order_id: 1.

 id | order_id | order_item_id | item_id | item_short_name | item_added_date 
----+----------+---------------+---------+-----------------+-----------------
  1 |        1 |             1 |       1 | Toothpaste      | 2020-06-01
  2 |        1 |             2 |       1 | Toothpaste      | 2020-06-01
  3 |        1 |             3 |       2 | Bread           | 2020-06-01
  4 |        1 |             4 |       3 | Milk            | 2020-06-01
  5 |        2 |             1 |       1 | Toothpaste      | 2020-06-05
  6 |        3 |             1 |       2 | Bread           | 2020-05-30
  7 |        4 |             1 |       3 | Milk            | 2020-06-15

First go

Our natural inclination was to start with something like:

select item_id, max(item_added_date) from order_items group by item_id;
 item_id |    max     
---------+------------
       3 | 2020-06-15
       2 | 2020-06-01
       1 | 2020-06-05

But the problem is immediately obvious - we lose all the other order information for those items e.g. order_id.

Now, we could join the data back to the original table to find the orders:

select o.* from order_items o 
join (
	select item_id, max(item_added_date) from order_items group by item_id
) as t 
on t.item_id=o.item_id 
and t.max = o.item_added_date;
 id | order_id | order_item_id | item_id | item_short_name | item_added_date 
----+----------+---------------+---------+-----------------+-----------------
  3 |        1 |             3 |       2 | Bread           | 2020-06-01
  5 |        2 |             1 |       1 | Toothpaste      | 2020-06-05
  7 |        4 |             1 |       3 | Milk            | 2020-06-15

But it seems a little tedious to inner join to the subquery, esp. if we only want one order from each grouping.

A better way using DISTINCT ON

We are always looking for simpler ways to express things in SQL, and DISTINCT ON stood out as a way to simplify such a query:

select distinct on (item_id) * 
from order_items 
order by item_id, item_added_date desc;
 id | order_id | order_item_id | item_id | item_short_name | item_added_date 
----+----------+---------------+---------+-----------------+-----------------
  5 |        2 |             1 |       1 | Toothpaste      | 2020-06-05
  3 |        1 |             3 |       2 | Bread           | 2020-06-01
  7 |        4 |             1 |       3 | Milk            | 2020-06-15

On first look, it’s a little difficult to decipher what’s going on. So how does it work?

  1. The DISTINCT ON creates groups (partitions) against item_id from the full dataset
  2. Then it orders each group by item_id and item_added_date descending
  3. And finally picks the first item from each ordered group

Or more visually: Distinct On Visualised

One thing to be careful of is that if you do not use an ORDER BY clause with DISTINCT ON (in our case, order by item_added_date), then the first row selected from each group/partition is unpredictable.

In summary, the DISTINCT ON approach is useful as it retains all the original row information of the resulting orders e.g. order_id and order_item_id. And the query is nice and concise.

An alternative using a window function

There are other options like a window function; however, for such a case a window function felt more complicated:

select * from ( 
	select 
		*, 
		ROW_NUMBER() over (partition by item_id order by item_added_date desc) as rank
	from order_items
) as t WHERE rank = 1

This works by:

  1. Partitioning data on item_id
  2. Ordering each partition by item_added_date
  3. Adding a row number as rank, on a per-partition basis
  4. Finally, selecting the rows ranked 1, to get the rows at the top of each partition
 id | order_id | order_item_id | item_id | item_short_name | item_added_date | rank 
----+----------+---------------+---------+-----------------+-----------------+------
  5 |        2 |             1 |       1 | Toothpaste      | 2020-06-05      |    1
  3 |        1 |             3 |       2 | Bread           | 2020-06-01      |    1
  7 |        4 |             1 |       3 | Milk            | 2020-06-15      |    1

Seems a very round about approach for this use case, though windows functions are generally more powerful and flexible for more complicated use cases. They also are more suitable in non Postgres databases where DISTINCT ON is not available. See our post on window functions for more on this.


datasqlpostgres
Land Technologies

We are the engineers behind LandInsight and LandEnhance. We’re helping property professionals build more houses, one line of code at a time. We're a remote-first company with our tech team in Europe, and yes, we're hiring!

© 2021, Built with Gatsby
LAND TECHNOLOGIES LTD, m: WeWork c/o LandTech, 10 Devonshire Square, London, EC2M 4PL t: 0203 086 7855
© 2016 Land Technologies Ltd. Registered Company Number 08845300.