Table "irclog"
  Column   |           Type           |                      Modifiers
-----------+--------------------------+-----------------------------------------------------
 id        | integer                  | not null default nextval('irclog_id_seq'::regclass)
 channel   | character varying(30)    |
 day       | character(10)            |
 nick      | character varying(40)    |
 timestamp | integer                  |
 line      | text                     |
 spam      | boolean                  | default false
 time      | timestamp with time zone | default now()
Indexes:
    "irclog_pkey" PRIMARY KEY, btree (id)
    "time_idx" btree ("time")
I tried a few different queries, and tried to optimize them using indexes (single- and multi-column).
select channel, exists(select 1
                       from irclog as i
                       where i.channel = o.channel
                       and i.nick = o.nick) as I_interacted
from irclog as o
where o.nick = 'gurjeet';
Sometimes I had to drop a previously created index as I was running out of disk.
But because the table was huge for my machine (1.3 GB of table data on 128 MB VPS hosted machine), the queries were taking too long, and I had to cancel them after running at max for 20 minutes. I knew the full table scans take only about 58 seconds, and aggregate queries to check most recent activity in channels took about 60 seconds, so I was targeting to make my "interacted" query run in almost the same time.
After a few failed trials, I finally tried this query:
select channel,
       sum( (nick='gurjeet')::int ) as I_interacted,
       count(*) as channel_rows
from irclog
group by channel
order by 3 asc;
This query ran in the same time as a full table scan, and gave me everything I needed to know; which channels I have never interacted with, and which of those I should focus on deleting to free up the most space.
A DELETE and a VACUUM FULL later, I was back to about 65% disk usage, and table size reduced by 50%.
This goes on to show that indexes are not always the solution to long-running queries, and that SQL is tricky, you need to know the right question to get the right answer.
I guess this is the reason we can devise an 'Index Advisor', but don't have any way of implementing a 'Query Advisor'.
 
No comments:
Post a Comment