I have a table with 5M rows, the table contains lab observation codes and display names, both columns have type varchar(2000) and both are indexed.
The query looks like:
select code_1_text, count(code_1_text)
from demo.observation_lab
group by code_1_text
order by 2 desc
The table contains ~1000 distinct display names.
It takes 4 minutes for the query to complete on a VM with some pretty old Xeon, 4 cores, 32G RAM, NVME SSD and Linux on board.
Isn't it too slow? During the execution I see no active hardware resources consumption - CPU load is 12-25%, RAM is almost free, swap file is not used.
The execution plan is as follows:
Relative Cost = 3098072
• Call module F, which populates temp-file B.
• Call module J, which populates temp-file C.
• Read temp-file C, looping on count([value]) and a counter.
• For each row:
- Output the row.
Module: F
• Divide master map demo.observation_lab.IDKEY into subranges of IDs.
• Call module A in parallel on each subrange, piping results into temp-file D.
• Read temp-file D, looping on a counter.
• For each row:
- Check distinct values for %SQLUPPER(code_1_text) using temp-file B,
subscripted by a hash of the %SQLUPPER(code_1_text).
- For each distinct row:
· Add a row to temp-file B, subscripted by the hash,
with node data of %SQLUPPER(code_1_text).
- Update the accumulated count([value]) in temp-file B,
subscripted by the hash
Module: J
• Read temp-file B, looping on the hash subscript.
• For each row:
- Add a row to temp-file C, subscripted by count([value]) and a counter,
with node data of the uncollate expression.
Module: A
• Call module B, which populates temp-file A.
• Read temp-file A, looping on the hash subscript.
• For each row:
- Add a row to temp-file D, subscripted by a counter, with node data of %SQLUPPER(code_1_text) and count([value]).
Module: B
• Read master map demo.observation_lab.IDKEY, looping on the subrange of ID.
• For each row:
- Check distinct values for %SQLUPPER(code_1_text) using temp-file A,
subscripted by a hash of the %SQLUPPER(code_1_text).
- For each distinct row:
· Add a row to temp-file A, subscripted by the hash,
with node data of %SQLUPPER(code_1_text).
- Update the accumulated count([value]) in temp-file A,
subscripted by the hash
Why is it using temp files instead of RAM?
Also, if I rewrite the query to aggregate lab observation codes (which are much shorter than display names) the query starts fetching almost immediately.