Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nondeterministic results on TPCH 15 #861

Closed
rommelDB opened this issue Jul 8, 2020 · 3 comments
Closed

[BUG] Nondeterministic results on TPCH 15 #861

rommelDB opened this issue Jul 8, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@rommelDB
Copy link
Contributor

rommelDB commented Jul 8, 2020

Describe the bug
The single-node SF1 validation runs in a non-deterministic way, sometimes the test passes but sometimes reports:

AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (1, 5)
[right]: (0, 5)

The expected output is:

idx | s_suppkey | s_name | s_address | s_phone | total_revenue
1 | 8449 | Supplier#000008449 | Wp34zim9qYFbVctdW | 20-469-856-8873 | 1772627.2305372064

@rommelDB rommelDB added bug Something isn't working ? - Needs Triage needs team to review and classify labels Jul 8, 2020
@wmalpica
Copy link
Contributor

@rommelDB can you please add the actual query to the issue. Please provide more context.

@wmalpica wmalpica removed the ? - Needs Triage needs team to review and classify label Jul 10, 2020
@rommelDB
Copy link
Contributor Author

TPCH Query 15

	query ="""
		with revenue (suplier_no, total_revenue) as (
			select
				l_suppkey,
				sum(l_extendedprice * (1-l_discount))
			from
				lineitem
			where
				l_shipdate >= date '1996-01-01'
				and l_shipdate < date '1996-01-01' + interval '3' month
			group by
				l_suppkey
		)
		select
			s_suppkey,
			s_name,
			s_address,
			s_phone,
			total_revenue
		from
			supplier, revenue
		where 
			s_suppkey = suplier_no
			and total_revenue = (
				select
					max(total_revenue)
				from
					revenue
			)
		order by
			s_suppkey
	"""

@rommelDB
Copy link
Contributor Author

Analyzing the logical plan I found that the problem occurs because there is a floating-point subexpression that is calculated more than once with slightly different results. So, this issue is more related to issue #696 cc @williamBlazing @Christian8491

DEBUG: com.blazingdb.calcite.application.RelationalAlgebraGenerator - optimized
LogicalSort(sort0=[$0], dir0=[ASC])
  LogicalProject(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$4])
    LogicalJoin(condition=[=($4, $5)], joinType=[inner])
      LogicalProject(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], EXPR$1=[$5])
        LogicalJoin(condition=[=($0, $4)], joinType=[inner])
          BindableTableScan(table=[[main, supplier]], projects=[[0, 1, 2, 4]], aliases=[[s_suppkey, s_name, s_address, s_phone]])
          LogicalProject(l_suppkey=[$0], EXPR$1=[CASE(=($2, 0), null:FLOAT, $1)])
            LogicalAggregate(group=[{0}], EXPR$1=[$SUM0($1)], agg#1=[COUNT($1)])
              LogicalProject(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
                BindableTableScan(table=[[main, lineitem]], filters=[[AND(>=($3, 1996-01-01 00:00:00), <($3, 1996-04-01 00:00:00))]], projects=[[2, 5, 6, 10]], aliases=[[l_suppkey, $f1, $f2, $f3]])
      LogicalAggregate(group=[{}], EXPR$0=[MAX($0)])
        LogicalProject(EXPR$1=[$1])
          LogicalProject(l_suppkey=[$0], EXPR$1=[CASE(=($2, 0), null:FLOAT, $1)])
            LogicalAggregate(group=[{0}], EXPR$1=[$SUM0($1)], agg#1=[COUNT($1)])
              LogicalProject(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
                BindableTableScan(table=[[main, lineitem]], filters=[[AND(>=($3, 1996-01-01 00:00:00), <($3, 1996-04-01 00:00:00))]], projects=[[2, 5, 6, 10]], aliases=[[l_suppkey, $f1, $f2, $f3]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants