-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time comparison does not always capture the correct data when the result set is limited #27504
Comments
@eschutho I think you misunderstand the "TOK N" analytics meaning, and what is "order by" and "limit" clause in SQL meaning. The first screenshot from your post in the explore page will ask sort by the count, and then only retrieve the top 10 rows. If you really want to fetch the top 10 years, you should set |
@eschutho BTW, I'll open the SIP as soon as possible in the week. |
Hi, thanks @zhaoyongjie. Apologies, but the last query that I showed as the correct example I should have sorted by count instead of by date. That may be where the confusion came from. I updated it in my example. But essentially, I'm looking to sort by the count, so in this case in 1995 there were 12 instances, and that row should be first. I want the top ten years where the name had the highest count. This is the corrected version: |
it should be another topic that how to sort the result set after fetched the data from database, there is a Your SQL have some issues:
As the mention before, if you really need a secondary sort by, the sort operator might help. |
Including that in the subquery would truncate the set used to JOIN correct? Shouldn't we leverage the DBMSs optimizations when executing the query as a whole instead? Otherwise we end up having incomplete result sets if for example the LIMIT is too low, as as stated before data inconsistency is something we should never aim for. To tackle this bug I would say that in general we have two limitations depending on which database are we working on:
For the first one we would have to write something that check whether or not CTE are supported, if it does, make use of it, if not, write the query manually. For the second, if the database supports joins we should ensure data consistency at pre-query execution with the outer joins, if not supported we should fallback to DataFrame joins at post processing time, but again, just as a fallback because data inconsistency might arise. |
@Antonio-RiveroMartnez
The logic is included in all visualizations. Currently, there isn't a control to specify how to sort a column as a secondary sort by. In other words, Superset doesn't have a way to define sorting by DS and returning the top 10, then sorting by count and returning the top 10. I also posted the result in the comment. The result remains the same whether you use Pandas join or SQL join. |
It's not only compatibility about CTE and JOIN, but also how to how to define time delta in different DB. |
Totally agree on that, but that's not the point of my comment, I'm talking when sorting and limiting on just 1 column. Just with one column for example ds as you mentioned or the count, and limiting such result. I disagree with the result being the same whether you use Pandas or SQL joins, unless I'm missing something an you can join Pandas at pre-query execution, if you rely on Pandas, the result set may have changed from execution of query A and query B plus you don't necessarily have the comparison you might be looking to apply since data from outer query might not be present in query B. I believe the main thing to tackle is how LIMIT and OFFSET could impact data consistency when the comparison is made? |
@zhaoyongjie I was trying this out with a name with more results. Do you think this is expected behavior if I were to fetch 50 rows sorted by count vs 10 rows? It seems that the data should be consistent if I put a row limit on it. In this case 1994 doesn't return a value for the time shift when I put a limit on the rows. WDYT? |
@eschutho The 10th record in the first data slice(1979-2004) might be 1994 or 1995(getting it from first screenshot), because they have same count:
or
The 10th record in the second data slice(1969-1994) is :
Since the 10th record in the first data slice is unpredictable, you also can't predicate the results of join in the 10th row. Apparently, your database returns |
@eschutho, I believe this issue may not be stemming from the JOIN logic, but rather from the ORDER BY clause. Depending on human intuition, when constructing a time comparison, our intuition typically prefers to navigate through the time dimension. Thus, you've addressed the question in your initial comment — sorting by However, this approach might lead to inconsistent results between time comparison and directly construct query. Let's revisit your last screenshot. The value If we automatically apply the
I also consider this issue, I don't know if should provide a way to tell user should use a time column as ORDER BY when they want to construct a time comparison? so that we can use a popup or modal info user. |
Thanks @zhaoyongjie! I agree on your assessment of why we're missing data, but quick question on this:
Do you think by putting the limit on the outer scope after the join, we can solve this?
I had a similar question on the example here and maybe this is at the crux of what we're trying to solve: the use case where someone is not wanting the time dimension as the order or group by. It seems like a common use case to sort by top 10 names or want to group by product type rather than date. How can we do this, and do you think the join query could solve both these problems? |
The key issue is that if we change any
I've replied on the Table with time comparison PR |
Fix for this is in https://github.com/apache/superset/pull/27718/files#diff-d3bf055ecf6a74aa0acbb0650d176b6c251aea7796543f77a2df3fd8b7e4c4b4
|
Bug description
When fetching a set of data over a period of time and adding a time comparison, if the result set is truncated, some of the comparison information will also be missing.
Here is an example with missing data when a small row limit is applied:
In the above example, I am trying to fetch the top 10 years from the years 1989 - 2004 where the name Alex was the most popular and compare that count to the period 10 years prior. As you can see, the values for 1981, 1982, 1983, and 1984 are missing, although I would expect to get those values, because as you can see, if I fetch more data by removing the row limit, those values exist.
Here is an example with all the data when a larger row limit is applied
I believe the reason why this bug exists is because we are making two separate db requests, one for each time period when the name is "Alex", and limiting each request by 10 and sorting on the count. Because the count for each response is going to be different, when we make the comparison, the values for the dates in the original time series may not exist.
Here are the current db requests when a row limit is applied:
I believe what needs to happen instead is that the first query with timerange A needs to join the data from timerange B on a left outer join, and then apply the limit, and that would give the correct results.
This is an example of a query that would fetch all the data required:
How to reproduce the bug
Screenshots/recordings
See above
Superset version
master / latest-dev
Python version
3.10
Node version
18 or greater
Browser
Chrome
Additional context
I am using the examples db on Postgres with the birth_names dataset. This issue will also exist on some chart types that also have server pagination as it will apply the same row limit.
Checklist
The text was updated successfully, but these errors were encountered: