Pagination of query results in Postgres

Marko's recent blog post on speeding up count(*) in Postgres sent me to his previous post where I saw a very valid reason from Darren Duncan for pagination of results from a database. The point being that web applications are usually expected to show the page links and allow the user to jump to different pages in the result.

One solution is to run the query twice, once with just a count(*) and then the actual query, but that'd be very inefficient.

A few experiments later I think I have a solution; a bit inefficient than running the plain query but much better than running the query twice.
with
REAL_QUERY as
 (select relname, oid from pg_class limit 5)
-- User should ignore the rest of the query
select *, show_rowcounter() as row_count
from (select REAL_QUERY.*, increment_rowcounter() as row_number
      from REAL_QUERY,
      (select reset_rowcounter()) as _c
      order by row_number
      offset 0) as _subq;

Don't get scared by the length of the query above, all one needs to do is put the real query between the first pair of parenthesis after the REAL_QUERY identifier.

The result would be the same as the original query, but with two additional columns: row_number and row_count. row_number numbers each row starting from 1, and row_count shows the total number of rows in the result.
.   relname    | oid  | row_number | row_count 
---------------+------+------------+-----------
 pg_statistic  | 2619 |          1 |         5
 pg_type       | 1247 |          2 |         5
 pg_attribute  | 1249 |          3 |         5
 pg_toast_1262 | 2844 |          4 |         5
 pg_toast_2604 | 2830 |          5 |         5
(5 rows)
The 'ORDER BY row_number' clause adds the overhead, but it is necessary so that the row_count is calculated before the first row is produced at the top level. I wish I could introduce a MATERIALIZE node in the query plan at will.

If your REAL_QUERY has an ORDER BY or GROUP BY clause then you can remove the ORDER BY row_number clause from the outer query, since Postgres will  make sure that show_rowcounter() will not be called before the last call of increment_rowcounter().

The above trick uses the Common-Table-Expression feature (introduced in Postgres version 8.4), because I wanted to make it look like a template where user's real query is visible at the top.

If you are running on an older version you can easily modify it to be a simple query because CTE used above is not recursive.

select *, show_rowcounter() as row_count
from (select REAL_QUERY.*, increment_rowcounter() as row_number
      from (select relname, oid from pg_class limit 5) as REAL_QUERY,
      (select reset_rowcounter()) as _c
      order by row_number
      offset 0) as _subq;
Now the guts of the trickery: PL/Perl functions:

-- PL/Perl
create or replace function reset_rowcounter() returns int as $p$
    $_SHARED{rowcounter} = 0;
$p$ language plperl stable;

create or replace function increment_rowcounter() returns int as $p$
    $_SHARED{rowcounter} = $_SHARED{rowcounter} + 1;
    return $_SHARED{rowcounter};
$p$ language plperl;

create or replace function show_rowcounter() returns int as $p$
    return $_SHARED{rowcounter};
$p$ language plperl;
BTW, this also shows how one can get Oracle's ROWNUM like feature in Postgres.

9 comments:

  1. I use a simpler solution that takes advantage of using COUNT() as a window function:

    SELECT ...
    ,COUNT(*) OVER() AS fullRowCount
    FROM ...
    WHERE ...
    ORDER BY ...
    LIMIT 25 OFFSET 50;

    ReplyDelete
  2. That's interesting... I didn't think about that one. Maybe this will help me eliminate that expensive ORDER BY; all I need is a blocking plan node that waits until all rows have been consumed, and the emits the rows in a FIFO order.

    This wouldn't work in pre-8.4 versions though.

    ReplyDelete
  3. Gurjeet,

    What is the PL/PgSQL equivalent of $_SHARED{}? Thanks!

    ReplyDelete
  4. Thanks for the suggestion, Michael. I tested the tip and it worked both correctly and faster.

    In the old design, my "count(*)" query took 300 ms to run, while the query for a single page of results took 58 ms. (So, 358 ms total). Using the window function method, the single query took 314 ms. That's about 10% faster, not counting any overhead associated with running two queries instead of one.

    A great tip!

    ReplyDelete
  5. :-) Glad it worked for you, Mark. I adapted it from a strategy I used when we were running Oracle.

    ReplyDelete
  6. @gwchamb, PL/PgSQL does not have any equivalent of $_SHARED. I'd have loved to use it if there were anything similar, to avoid external dependency on perl.

    @Mark, if you adapted from the above query, can you please share your version of the windowing query, thanks.

    ReplyDelete
  7. Here's an example of doing it w/ window functions. The following will even work on databases that support window functions but don't have a LIMIT OFFSET feature.

    WITH base AS (
    -- Real Query
    SELECT r.*,
    100.0 AS rpp, -- results per page
    row_number() OVER (ORDER BY rental_date) row_num,
    COUNT(1) OVER () row_count
    FROM rental r
    WHERE staff_id = 1
    )
    SELECT ceil(row_num / rpp) AS page,
    ceil(row_count / rpp) AS pages,
    base.*
    FROM base
    WHERE ceil(row_num / rpp) = 2
    ORDER BY row_num

    ReplyDelete
  8. Why not use CURSORs instead?
    MOVE by required offset, then FETCH page of query results and MOVE FORWARD ALL to get number of total query results.

    ReplyDelete
  9. @Tomas, it appears that cursors are not well supported in DBD::Pg:

    http://search.cpan.org/dist/DBD-Pg/Pg.pm#Cursors

    Besides, the windowing method only adds a small amount of syntax to existing query... what could be simpler than that?

    Cursors use PL/PGSQL, which ads more complexity.

    ReplyDelete