Performance for 'select count(*) from (select distinct <c> from <t>)' is poor if <C> is wide varchar column with NULLs in all rows #8330

pavel-zotov · 2024-11-27T08:58:22Z

Weird cases have been encountered while re-implementing test for #542 (aka CORE-214).
In short:
if a table contain varchar column with length >= 100 and:
case-1) all rows in this column have NULL value then query
select count(*) from ( select distinct <text_column> from <table> )
works VERY SLOW compared to query
select count( distinct <text_column> ) from <table>.
It this is especially true for column with charset = utf8;
case-2) all rows in this column have NOT null but the same value (i.e. count(dictinct ) = 1) then ratio is reversed: first query can run FASTER then second for ~2x (but only if column has charset = utf8)

Below are details.
Ticket #542 has been fixed 31-may-2015 by following commit
(3.0.0.31846): https://sourceforge.net/p/firebird/code/61671/

Comment in this commit refers to the worst case of input data for quick-sort algorithm, namely: when all values in array are the same.

I've checked two tables, each has 100'000 rows and several columns of miscellaneous datatype (int, boolean, ... date, time, ..., varchar).
First table has name 'test_one_unique_value'. Each column in it contains only one unique NOT-null value.
Second table has name 'test_null_in_all_rows' and each its column has NULL in all rows.

I can confirm that this commit really did fix problem described in the ticket for queries:
query_1: select count(*) from (select distinct <col> from T )
query_2: select count(distinct <col>) from T
Before fix (on 3.0.0.31845) these queries ran for 288 vs 842 ms, while after fix time became 278 vs 249 ms (this is time from trace).

So, execution time of these queries became almost the same, i.e. time ratio is about 1. And it made it seem like the problem was completely solved.
But this not so: some hard-to-explain ratios can be noted if we use a table with long textual columns, especially if use utf8.
This is particularly strong for the case when all rows in such column are NULL.

To reproduce this:

download .fbk (it was created in actual FB 3.0.13):
core-0214_-_ods_12_0.fbk.7z.zip
unpack it, make restore in appropriate FB (i did checks for FB 3.x, 4.x, 5.x and 6.x);
Run following queries:

select count(*) from (select distinct F_TXT_1251 from test_one_unique_value); -- 1a
select count(distinct F_TXT_1251) from test_one_unique_value; -- 1b
select count(*) from (select distinct F_TXT_UTF8 from test_one_unique_value); -- 2a
select count(distinct F_TXT_UTF8) from test_one_unique_value; -- 2b

select count(*) from (select distinct NUL_TXT_1251 from test_null_in_all_rows); -- 3a
select count(distinct NUL_TXT_1251) from test_null_in_all_rows; -- 3b
select count(*) from (select distinct NUL_TXT_UTF8 from test_null_in_all_rows); -- 4a
select count(distinct NUL_TXT_UTF8) from test_null_in_all_rows; -- 4b

Columns F_TXT_1251 and F_TXT_UTF8 have non-empty (and the same) text in all rows.

F_TXT_1251 contains string generated as random ascii-only crahacters; original idea was to use cyrillic text in this fiels but I later gave up on it;
F_TXT_UTF8 contains string generated as random greek characters (without any ascii).

I've checked for these queries:

time spent by CPU (difference between counters 'cpu.user_time' provided by Python package psutil)
elapsed time in the trace and
DB page_size was 8192 in all cases.
Several values of declared column width were tested: 10; 100; 1000; 2000 and 4097 characters.

Results can be found in attached .zip, file: "core-0214_-_time_for_misc_field_length.xlsx"
This how it looks:

Also, several runs were made for FB 3.x, 4.x, 5.x and 6.x - all with purpose to find whether such uncommon ratios (less than 0.5 or more than 10) present for other data types (non-varchar).
Result can be found in attached zip, file: "core-0214_-_cpu_time_for_misc_datatypes.xlsx".
It looks like this:

performance-issues-for-textual-columns-while-reimplementing-test-for-core-0214.zip

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance for 'select count(*) from (select distinct <c> from <t>)' is poor if <C> is wide varchar column with NULLs in all rows #8330

Performance for 'select count(*) from (select distinct <c> from <t>)' is poor if <C> is wide varchar column with NULLs in all rows #8330

pavel-zotov commented Nov 27, 2024

Performance for 'select count(*) from (select distinct <c> from <t>)' is poor if <C> is wide varchar column with NULLs in all rows #8330

Performance for 'select count(*) from (select distinct <c> from <t>)' is poor if <C> is wide varchar column with NULLs in all rows #8330

Comments

pavel-zotov commented Nov 27, 2024