Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time comparison does not always capture the correct data when the result set is limited #27504

Closed
2 of 3 tasks
eschutho opened this issue Mar 14, 2024 · 15 comments
Closed
2 of 3 tasks

Comments

@eschutho
Copy link
Member

eschutho commented Mar 14, 2024

Bug description

When fetching a set of data over a period of time and adding a time comparison, if the result set is truncated, some of the comparison information will also be missing.
Here is an example with missing data when a small row limit is applied:
Screenshot 2024-03-13 at 4 55 31 PM

In the above example, I am trying to fetch the top 10 years from the years 1989 - 2004 where the name Alex was the most popular and compare that count to the period 10 years prior. As you can see, the values for 1981, 1982, 1983, and 1984 are missing, although I would expect to get those values, because as you can see, if I fetch more data by removing the row limit, those values exist.

Here is an example with all the data when a larger row limit is applied
Screenshot 2024-03-13 at 4 55 59 PM

I believe the reason why this bug exists is because we are making two separate db requests, one for each time period when the name is "Alex", and limiting each request by 10 and sorting on the count. Because the count for each response is going to be different, when we make the comparison, the values for the dates in the original time series may not exist.

Here are the current db requests when a row limit is applied:
Screenshot 2024-03-13 at 5 01 33 PM
Screenshot 2024-03-13 at 5 01 59 PM

I believe what needs to happen instead is that the first query with timerange A needs to join the data from timerange B on a left outer join, and then apply the limit, and that would give the correct results.

This is an example of a query that would fetch all the data required:

SELECT 
  query1.ds AS ds1,
  query1.count AS count1,
  query2.ds AS ds2,
  query2.count AS count2
FROM 
  (
    SELECT 
      DATE_TRUNC('year', ds) AS ds,
      COUNT(*) AS count
    FROM 
      public.birth_names
    WHERE 
      ds >= TO_TIMESTAMP('1989-03-13 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US') AND
      ds < TO_TIMESTAMP('2004-03-13 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US') AND
      name IN ('Alex')
    GROUP BY 
      DATE_TRUNC('year', ds)
    ORDER BY 
      ds DESC
  ) AS query1
LEFT JOIN 
  (
    SELECT 
      DATE_TRUNC('year', ds) AS ds,
      COUNT(*) AS count
    FROM 
      public.birth_names
    WHERE 
      ds >= TO_TIMESTAMP('1979-03-13 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US') AND
      ds < TO_TIMESTAMP('1994-03-13 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US') AND
      name IN ('Alex')
    GROUP BY 
      DATE_TRUNC('year', ds)
    ORDER BY 
      ds DESC
  ) AS query2 
ON 
  query1.ds = query2.ds + INTERVAL '10 years'
ORDER BY 
  count1 DESC
LIMIT 10;
Screenshot 2024-03-15 at 1 37 45 PM

How to reproduce the bug

  1. create a line chart with a date range
  2. create a time comparison to the chart
  3. truncate the values by limiting the top rows

Screenshots/recordings

See above

Superset version

master / latest-dev

Python version

3.10

Node version

18 or greater

Browser

Chrome

Additional context

I am using the examples db on Postgres with the birth_names dataset. This issue will also exist on some chart types that also have server pagination as it will apply the same row limit.

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 14, 2024

In the above example, I am trying to fetch the top 10 years from the years 1989 - 2004 where the name Alex was the most popular and compare that count to the period 10 years prior.

@eschutho I think you misunderstand the "TOK N" analytics meaning, and what is "order by" and "limit" clause in SQL meaning.

The first screenshot from your post in the explore page will ask sort by the count, and then only retrieve the top 10 rows. If you really want to fetch the top 10 years, you should set ds as value of SORT BY control.

image image

@zhaoyongjie
Copy link
Member

@eschutho BTW, I'll open the SIP as soon as possible in the week.

@eschutho
Copy link
Member Author

eschutho commented Mar 15, 2024

Hi, thanks @zhaoyongjie. Apologies, but the last query that I showed as the correct example I should have sorted by count instead of by date. That may be where the confusion came from. I updated it in my example. But essentially, I'm looking to sort by the count, so in this case in 1995 there were 12 instances, and that row should be first. I want the top ten years where the name had the highest count. This is the corrected version:
Screenshot 2024-03-15 at 1 37 45 PM

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 17, 2024

so in this case in 1995 there were 12 instances, and that row should be first. I want the top ten years where the name had the highest count. This is the corrected version:

it should be another topic that how to sort the result set after fetched the data from database, there is a sort operator in the post processing for it.

Your SQL have some issues:

  1. The order by and limit clause should combine together always, it means getting "top N" records from the table. The subquery from your post didn't provide limit clause in subquery.
  2. The SQL and the user interface(Explore page) are inconsistent, "top ten years" was selected in the UI, then the SQL clause should be order by ds limit 10 not only in subquery, but also in the outer query.
  3. Significant extra computing, 10 records dataset join <---> unpredictable records.

As the mention before, if you really need a secondary sort by, the sort operator might help.

@Antonio-RiveroMartnez
Copy link
Member

Antonio-RiveroMartnez commented Mar 18, 2024

  1. ... The subquery from your post didn't provide limit clause in subquery.
  2. Significant extra computing, 10 records dataset join <---> unpredictable records.

Including that in the subquery would truncate the set used to JOIN correct? Shouldn't we leverage the DBMSs optimizations when executing the query as a whole instead? Otherwise we end up having incomplete result sets if for example the LIMIT is too low, as as stated before data inconsistency is something we should never aim for.

To tackle this bug I would say that in general we have two limitations depending on which database are we working on:

  1. Not all databases support CTE
  2. Not all databases support JOINS

For the first one we would have to write something that check whether or not CTE are supported, if it does, make use of it, if not, write the query manually.

For the second, if the database supports joins we should ensure data consistency at pre-query execution with the outer joins, if not supported we should fallback to DataFrame joins at post processing time, but again, just as a fallback because data inconsistency might arise.

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 18, 2024

@Antonio-RiveroMartnez
Upon rereading this issue, I believe the confusion lies in understanding the 'limit' control and the 'sort by' control in the UI, as well as how to apply them in SQL.

  1. The value of the 'limit' control is 10, and the 'sort by' control is 'DS'. This indicates sorting by the 'DS' column and then returning the top 10 records.
  2. The value of the 'limit' control is 10, and the 'sort by' control is 'count'. This indicates sorting by the 'count' column (metric) and then returning the top 10 records.

The logic is included in all visualizations. Currently, there isn't a control to specify how to sort a column as a secondary sort by. In other words, Superset doesn't have a way to define sorting by DS and returning the top 10, then sorting by count and returning the top 10. I also posted the result in the comment. The result remains the same whether you use Pandas join or SQL join.

@zhaoyongjie
Copy link
Member

  1. Not all databases support CTE
  2. Not all databases support JOINS

It's not only compatibility about CTE and JOIN, but also how to how to define time delta in different DB.

@Antonio-RiveroMartnez
Copy link
Member

Antonio-RiveroMartnez commented Mar 18, 2024

... Superset doesn't have a way to define sorting by DS and returning the top 10, then sorting by count and returning the top 10.

Totally agree on that, but that's not the point of my comment, I'm talking when sorting and limiting on just 1 column. Just with one column for example ds as you mentioned or the count, and limiting such result.

I disagree with the result being the same whether you use Pandas or SQL joins, unless I'm missing something an you can join Pandas at pre-query execution, if you rely on Pandas, the result set may have changed from execution of query A and query B plus you don't necessarily have the comparison you might be looking to apply since data from outer query might not be present in query B.

I believe the main thing to tackle is how LIMIT and OFFSET could impact data consistency when the comparison is made?

@eschutho
Copy link
Member Author

eschutho commented Mar 18, 2024

@zhaoyongjie I was trying this out with a name with more results. Do you think this is expected behavior if I were to fetch 50 rows sorted by count vs 10 rows? It seems that the data should be consistent if I put a row limit on it. In this case 1994 doesn't return a value for the time shift when I put a limit on the rows. WDYT?

Screenshot 2024-03-15 at 1 50 42 PM Screenshot 2024-03-15 at 1 50 28 PM

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 18, 2024

@eschutho
This behavior is totally by design. The SQL from your second screen to a plain speaking is that "show me the top 10 results sort by birth count and descending",

The 10th record in the first data slice(1979-2004) might be 1994 or 1995(getting it from first screenshot), because they have same count:

ds                        count
1994-01-01                  19

or

ds                        count
1995-01-01                  19

The 10th record in the second data slice(1969-1994) is :

ds                        count
1985-01-01                  10

Since the 10th record in the first data slice is unpredictable, you also can't predicate the results of join in the 10th row. Apparently, your database returns 1994-01-01 as the 10th record, the left join will give a null.

@eschutho
Copy link
Member Author

eschutho commented Mar 18, 2024

Right, I agree, because we are fetching the data separately, in two separate queries, we aren't guaranteed that the results in the first set will be available in the second set, which makes it difficult to do the pandas post-processing comparison. My original example shows this happening on the first and seventh rows as another example. I'm not sure that most users would understand that there is actually data there, but that there are discrepancies in the row limit because of how our business logic is written. Do you think we can improve on this so that we can show all the data in these comparisons?

Screenshot 2024-03-13 at 4 48 53 PM

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 20, 2024

we aren't guaranteed that the results in the first set will be available in the second set, which makes it difficult to do the pandas post-processing comparison. My original example shows this happening on the first and seventh rows as another example.

@eschutho, I believe this issue may not be stemming from the JOIN logic, but rather from the ORDER BY clause. Depending on human intuition, when constructing a time comparison, our intuition typically prefers to navigate through the time dimension. Thus, you've addressed the question in your initial comment — sorting by ds in the subquery and then applying the COUNT in the outer query.

However, this approach might lead to inconsistent results between time comparison and directly construct query.

Let's revisit your last screenshot. The value 1875 indicates the highest value from 1989 to 2004 (the first data slice), and the corresponding dimension value is 1990. However, the shifted time data slice does not contain data for 1980, so a left join will return a null. The KEY point is why 1980 is not in the second data slice? It's because it's not in the TopN series in the shifted data slice. If you don't use time comparison in Explore and directly select a time range, 1980 will still not be in the result set.

If we automatically apply the ds in the subquery, 1980 will appear in the result set when building the time comparison. However, if we directly pick a time range, 1980 should disappear again.

Do you think we can improve on this so that we can show all the data in these comparisons?

I also consider this issue, I don't know if should provide a way to tell user should use a time column as ORDER BY when they want to construct a time comparison? so that we can use a popup or modal info user.

@eschutho
Copy link
Member Author

Thanks @zhaoyongjie! I agree on your assessment of why we're missing data, but quick question on this:

It's because it's not in the TopN series in the shifted data slice.

Do you think by putting the limit on the outer scope after the join, we can solve this?

I also consider this issue, I don't know if should provide a way to tell user should use a time column as ORDER BY when they want to construct a time comparison?

I had a similar question on the example here and maybe this is at the crux of what we're trying to solve: the use case where someone is not wanting the time dimension as the order or group by. It seems like a common use case to sort by top 10 names or want to group by product type rather than date. How can we do this, and do you think the join query could solve both these problems?

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 21, 2024

@eschutho

Do you think by putting the limit on the outer scope after the join, we can solve this?

The key issue is that if we change any order by or limit in the inner query, in the other words, move the limit cluase to the outer query, we would get an inconsistent results set between time comparison and without time comparison.

I had a similar question on the example #27545 and ....

I've replied on the Table with time comparison PR

@eschutho
Copy link
Member Author

eschutho commented May 6, 2024

Fix for this is in https://github.com/apache/superset/pull/27718/files#diff-d3bf055ecf6a74aa0acbb0650d176b6c251aea7796543f77a2df3fd8b7e4c4b4

df = dataframe_utils.full_outer_join_df(
    left_df=df,
    right_df=offset_df,
    rsuffix=R_SUFFIX,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants