Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rishabh/clarify questions system prompt #181

Merged
merged 14 commits into from
Jun 25, 2024
12 changes: 6 additions & 6 deletions data/instruct_advanced_postgres.csv
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ SPM (Selling Profit Margin) = (Total Amount from Sells - (Tax + Commission)) / T
TAC = Total Active Customers who joined within a specified timeframe
CR = Rank customers by their total transaction volume, identifying the customer with the highest transaction volume as rank 1. This involves joining price data with ticker identifiers and filtering for a specified date range."
car_dealership,instructions_cte_join,"What is the average number of days between the sale date and payment received date, rounded to 2 decimal places?","When getting duration between sale and payment date for each sale, get the latest payment for sale by aggregating over the payments_received table first.","WITH sale_payments AS (SELECT s.id AS sale_id, s.sale_date, MAX(p.payment_date) AS latest_payment_date FROM sales s JOIN payments_received p ON s.id = p.sale_id GROUP BY 1,2) SELECT ROUND(AVG(latest_payment_date - sale_date), 2) AS avg_days_to_payment FROM sale_payments","When getting duration between sale and payment date for each sale, get the latest payment for sale by aggregating over the payments_received table first. ASP = Calculate the average price of sales within a specific timeframe Last 30 days = Use a range from the current date minus a certain interval to the current date, always ensure to make the necessary joins before utilizing the sales data. TSC = Count of sales within a specified period"
car_dealership,instructions_cte_join,"Return the highest sale price for each make and model of cars that have been sold and are no longer in inventory, ordered by the sale price from highest to lowest.","When getting a car's inventory status, always take the latest status from the inventory_snapshots table","WITH latest_inventory_status AS (SELECT car_id, is_in_inventory, ROW_NUMBER() OVER(PARTITION BY car_id ORDER BY snapshot_date DESC, crtd_ts DESC) AS rn FROM inventory_snapshots) SELECT c.make, c.model, MAX(s.sale_price) AS highest_sale_price FROM cars c JOIN sales s ON c.id = s.car_id JOIN latest_inventory_status lis ON c.id = lis.car_id WHERE lis.is_in_inventory = FALSE AND lis.rn = 1 GROUP BY c.make, c.model ORDER BY highest_sale_price DESC","TSC = Count of sales within a specified period
car_dealership,instructions_cte_join,"Return the highest sale price for each make and model of cars that have been sold and are no longer in inventory, ordered by the sale price from highest to lowest. Use the most recent date in the inventory_snapshots table to determine that car's inventory status.","When getting a car's inventory status, always take the latest status from the inventory_snapshots table","WITH latest_inventory_status AS (SELECT car_id, is_in_inventory, ROW_NUMBER() OVER(PARTITION BY car_id ORDER BY snapshot_date DESC) AS rn FROM inventory_snapshots) SELECT c.make, c.model, MAX(s.sale_price) AS highest_sale_price FROM cars c JOIN sales s ON c.id = s.car_id JOIN latest_inventory_status lis ON c.id = lis.car_id WHERE lis.is_in_inventory = FALSE AND lis.rn = 1 GROUP BY c.make, c.model ORDER BY highest_sale_price DESC","Recall that a car can have multiple entries in the inventory_snapshot table.
TSC = Count of sales within a specified period
MoM = Change in total receivable amounts from one month to the next, comparing with the immediately preceding month.
ASP = Mean sale price for a designated start period
When getting a car's inventory status, always take the latest status from the inventory_snapshots table"
ASP = Mean sale price for a designated start period"
car_dealership,instructions_cte_join,"Who are the top 5 salespersons by total sales amount? Return their ID, first name, last name and total sales amount.","To get the total sales amount per salesperson, join the salespersons and sales tables, group by salesperson, and sum the sale_price. Always order results with NULLS last.","WITH salesperson_sales AS (SELECT s.id, s.first_name, s.last_name, SUM(sa.sale_price) AS total_sales FROM salespersons s LEFT JOIN sales sa ON s.id = sa.salesperson_id GROUP BY s.id) SELECT id, first_name, last_name, total_sales FROM salesperson_sales ORDER BY total_sales DESC NULLS LAST LIMIT 5","PMSR = per month sales revenue
Always join sales with cars before using the sales table
Weekend days are Saturday and Sunday
Expand All @@ -58,11 +58,11 @@ To get the number of sales made by each salesperson in the past 30 days, join th
ASP = Calculate the average sale price without specifying the period
GPM = Define gross profit margin as a ratio without specifying how to calculate total revenue or total cost"
car_dealership,instructions_cte_window,"Return the first name, last name, total sales amount, number of sales, and SR for each salesperson",SR = sales rank of each salesperson ordered by their total sales amount descending,"WITH salesperson_sales AS (SELECT salesperson_id, SUM(sale_price) AS total_sales, COUNT(*) AS num_sales FROM sales GROUP BY salesperson_id) SELECT s.first_name, s.last_name, ss.total_sales, ss.num_sales, RANK() OVER (ORDER BY ss.total_sales DESC) AS sales_rank FROM salesperson_sales ss JOIN salespersons s ON ss.salesperson_id = s.id","SR = sales rank of each salesperson ordered by their total sales amount descending To determine the sales performance per territory, sum the sales amount and count the sales, grouping by territory To calculate the average sale price, join the sales table with itself on the salesperson_id and find the ratio of total sales amount to number of sales To assess inventory turnover, compare inventory snapshots with sales on matching days, focusing on the quantity of items sold."
car_dealership,instructions_cte_window,What is the total payments received per month? Also calculate the MoM change for each month.,"MoM change = (current month value - prev month value). Return months with no payments as 0. MoM will always be zero for the first month that appears in your answer.","WITH monthly_totals AS (SELECT DATE_TRUNC('month', payment_date) AS dt, SUM(payment_amount) AS total_payments FROM payments_received GROUP BY dt), monthly_range AS (SELECT generate_series(DATE_TRUNC('month', MIN(payment_date)), DATE_TRUNC('month', MAX(payment_date)), '1 month'::interval) AS dt FROM payments_received), monthly_totals_with_zero AS (SELECT mr.dt, COALESCE(mt.total_payments, 0) AS total_payments FROM monthly_range mr LEFT JOIN monthly_totals mt ON mr.dt = mt.dt) SELECT m.dt::DATE AS MONTH, m.total_payments, m.total_payments - lag(m.total_payments, 1) OVER (ORDER BY dt) AS mom_change FROM monthly_totals_with_zero m ORDER BY m.dt;WITH monthly_payments AS (SELECT DATE_TRUNC('month', pr.payment_date) AS MONTH, SUM(pr.payment_amount) AS total_payments FROM payments_received pr GROUP BY MONTH ORDER BY MONTH), monthly_range AS (SELECT generate_series(DATE_TRUNC('month', MIN(pr.payment_date)), DATE_TRUNC('month', MAX(pr.payment_date)), '1 month'::interval) AS MONTH FROM payments_received pr), monthly_payments_with_zeros AS (SELECT mr.month, COALESCE(mp.total_payments, 0) AS total_payments FROM monthly_range mr LEFT JOIN monthly_payments mp ON mr.month = mp.month) SELECT mp.month, mp.total_payments, COALESCE(mp.total_payments - lag(mp.total_payments, 1) OVER (ORDER BY mp.month), 0) AS mom_change FROM monthly_payments_with_zeros mp ORDER BY mp.month;","To ascertain the volume of sales conducted by each salesperson over a recent period, merge the salespersons and sales tables, applying a filter for recent sales transactions.
car_dealership,instructions_cte_window,What is the total payments received per month? Also calculate the MoM change for each month.,"MoM change = (current month value - prev month value). Return all months in your answer, including those where there were no payments. MoM will always be zero for the first month that appears in your answer.","WITH monthly_totals AS (SELECT DATE_TRUNC('month', payment_date) AS dt, SUM(payment_amount) AS total_payments FROM payments_received GROUP BY dt), monthly_range AS (SELECT generate_series(DATE_TRUNC('month', MIN(payment_date)), DATE_TRUNC('month', MAX(payment_date)), '1 month'::interval) AS dt FROM payments_received), monthly_totals_with_zero AS (SELECT mr.dt, COALESCE(mt.total_payments, 0) AS total_payments FROM monthly_range mr LEFT JOIN monthly_totals mt ON mr.dt = mt.dt) SELECT m.dt::DATE AS MONTH, m.total_payments, m.total_payments - lag(m.total_payments, 1) OVER (ORDER BY dt) AS mom_change FROM monthly_totals_with_zero m ORDER BY m.dt;WITH monthly_payments AS (SELECT DATE_TRUNC('month', pr.payment_date) AS MONTH, SUM(pr.payment_amount) AS total_payments FROM payments_received pr GROUP BY MONTH ORDER BY MONTH), monthly_range AS (SELECT generate_series(DATE_TRUNC('month', MIN(pr.payment_date)), DATE_TRUNC('month', MAX(pr.payment_date)), '1 month'::interval) AS MONTH FROM payments_received pr), monthly_payments_with_zeros AS (SELECT mr.month, COALESCE(mp.total_payments, 0) AS total_payments FROM monthly_range mr LEFT JOIN monthly_payments mp ON mr.month = mp.month) SELECT mp.month, mp.total_payments, mp.total_payments - lag(mp.total_payments, 1) OVER (ORDER BY mp.month) AS mom_change FROM monthly_payments_with_zeros mp ORDER BY mp.month;","To ascertain the volume of sales conducted by each salesperson over a recent period, merge the salespersons and sales tables, applying a filter for recent sales transactions.
To determine the average duration from sale date to payment date, perform a join between the sales and payments tables
To calculate the average selling price, join the sales and products tables, group by product name, and compute the ratio of total sales amount to the number of sales
MoM change = (current month value - prev month value). Return months with no payments as 0."
car_dealership,instructions_date_join,"What are the PMSPS and PMSR in the last 6 months excluding the current month, for salespersons hired between 2022 and 2023 (both inclusive)? Include months where metrics are 0. Order by month ascending.",PMSPS = per month salesperson sales count. PMSR = per month sales revenue in dollars. Truncate date to month for aggregation.,"WITH date_range AS (SELECT generate_series(date_trunc('month', CURRENT_DATE - interval '6 months'), date_trunc('month', CURRENT_DATE - interval '1 month'), '1 month')::DATE AS month_start), sales_metrics AS (SELECT date_trunc('month', s.sale_date) AS sale_month, COUNT(s.id) AS PMSPS, SUM(s.sale_price) AS PMSR FROM sales s JOIN salespersons sp ON s.salesperson_id = sp.id WHERE EXTRACT(YEAR FROM sp.hire_date) BETWEEN 2022 AND 2023 AND s.sale_date >= date_trunc('month', CURRENT_DATE - interval '6 months') AND s.sale_date < date_trunc('month', CURRENT_DATE) GROUP BY sale_month) SELECT dr.month_start, COALESCE(sm.PMSPS, 0) AS PMSPS, COALESCE(sm.PMSR, 0) AS PMSR FROM date_range dr LEFT JOIN sales_metrics sm ON dr.month_start = sm.sale_month ORDER BY dr.month_start ASC","PMSPS = per month salesperson sales count. PMSR = per month sales revenue in dollars. Truncate date to month for aggregation.
MoM change = (current month value - prev month value). Return all months in your answer, including those where there were no payments."
car_dealership,instructions_date_join,"What are the PMSPS and PMSR in the last 6 months excluding the current month, for salespersons hired between 2022 and 2023 (both inclusive)? Return all months in your answer, including those where metrics are 0. Order by month ascending.",PMSPS = per month salesperson sales count. PMSR = per month sales revenue in dollars. Truncate date to month for aggregation.,"WITH date_range AS (SELECT generate_series(date_trunc('month', CURRENT_DATE - interval '6 months'), date_trunc('month', CURRENT_DATE - interval '1 month'), '1 month')::DATE AS month_start), sales_metrics AS (SELECT date_trunc('month', s.sale_date) AS sale_month, COUNT(s.id) AS PMSPS, SUM(s.sale_price) AS PMSR FROM sales s JOIN salespersons sp ON s.salesperson_id = sp.id WHERE EXTRACT(YEAR FROM sp.hire_date) BETWEEN 2022 AND 2023 AND s.sale_date >= date_trunc('month', CURRENT_DATE - interval '6 months') AND s.sale_date < date_trunc('month', CURRENT_DATE) GROUP BY sale_month) SELECT dr.month_start, COALESCE(sm.PMSPS, 0) AS PMSPS, COALESCE(sm.PMSR, 0) AS PMSR FROM date_range dr LEFT JOIN sales_metrics sm ON dr.month_start = sm.sale_month ORDER BY dr.month_start ASC","PMSPS = per month salesperson sales count. PMSR = per month sales revenue in dollars. Truncate date to month for aggregation.
ASP = Average Sale Price during a specific timeframe
To calculate the average days between a sale date and when the payment was received, join the relevant tables.
TSC = Total Sales Count for a given period"
Expand Down
2 changes: 1 addition & 1 deletion data/questions_gen_postgres.csv
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,6 @@ Assume the rating of a business to be its average rating, and compute it before
"Which merchants earliest coupon start date was within a year of the merchant's registration? Return the merchant id, registration date, and earliest coupon id and start date","WITH earliest_coupons AS (SELECT c.merchant_id, MIN(c.start_date) AS earliest_coupon_start_date FROM consumer_div.coupons c GROUP BY c.merchant_id) SELECT m.mid AS merchant_id, m.created_at AS merchant_registration_date, ec.earliest_coupon_start_date, c.cid AS earliest_coupon_id FROM consumer_div.merchants m JOIN earliest_coupons ec ON m.mid = ec.merchant_id JOIN consumer_div.coupons c ON ec.merchant_id = c.merchant_id AND ec.earliest_coupon_start_date = c.start_date WHERE ec.earliest_coupon_start_date <= m.created_at + INTERVAL '1 year';",ewallet,date_functions,
"Return the name and phone number of the salesperson with the shortest time from being hired to getting fired. Return the number of days he/she was employed for.","SELECT s.first_name, s.last_name, s.phone, s.termination_date - s.hire_date AS days_employed FROM salespersons s ORDER BY days_employed ASC LIMIT 1;",car_dealership,date_functions,
"Return the number of payments made on weekends to the vendor named 'Utility Company'","SELECT COUNT(*) AS weekend_payments FROM payments_made WHERE vendor_name = 'Utility Company' AND EXTRACT(DOW FROM payment_date) IN (0, 6);",car_dealership,date_functions,
"show me the daily total amount of payments received in the whole of last week, split by the payment_method","SELECT payment_date, payment_method, SUM(payment_amount) AS total_amount FROM payments_received WHERE payment_date BETWEEN DATE_TRUNC('WEEK', CURRENT_DATE) - INTERVAL '1 week' AND DATE_TRUNC('WEEK', CURRENT_DATE) GROUP BY payment_date, payment_method ORDER BY payment_date DESC, payment_method ASC;",car_dealership,date_functions,
"show me the daily total amount of payments received in the whole of the last ISO week, split by the payment_method","SELECT payment_date, payment_method, SUM(payment_amount) AS total_amount FROM payments_received WHERE payment_date >= DATE_TRUNC('WEEK', CURRENT_DATE) - INTERVAL '1 week' AND payment_date < DATE_TRUNC('WEEK', CURRENT_DATE) GROUP BY payment_date, payment_method ORDER BY payment_date DESC, payment_method ASC;",car_dealership,date_functions,
"Which cars were in inventory in the latest snapshot for march 2023? Return the car id, make, model, and year.","WITH latest_snapshot AS (SELECT MAX(snapshot_date) AS snapshot_date FROM inventory_snapshots WHERE snapshot_date BETWEEN '2023-03-01' AND '2023-03-31' ), latest_snapshot_data AS (SELECT inv.car_id FROM inventory_snapshots inv JOIN latest_snapshot ls ON inv.snapshot_date = ls.snapshot_date WHERE inv.is_in_inventory = TRUE ) SELECT c.id, c.make, c.model, c.year FROM cars c JOIN latest_snapshot_data lsd ON c.id = lsd.car_id;",car_dealership,date_functions,
"What were the total quarterly sales in 2023 grouped by customer's state? Represent each quarter as the first date in the quarter.","SELECT DATE_TRUNC('QUARTER', s.sale_date) AS QUARTER, c.state, SUM(s.sale_price) AS total_sales FROM sales s JOIN customers c ON s.customer_id = c.id WHERE EXTRACT(YEAR FROM s.sale_date) = 2023 GROUP BY c.state, QUARTER HAVING SUM(s.sale_price) > 0 ORDER BY QUARTER, c.state ;",car_dealership,date_functions,
5 changes: 0 additions & 5 deletions eval/anthropic_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,6 @@ def run_anthropic_eval(args):
row = input_rows[i]
result_dict = f.result()
query_gen = result_dict["query"]
print("Query for")
print(query_gen)
reason = result_dict["reason"]
err = result_dict["err"]
# save custom metrics
Expand Down Expand Up @@ -142,9 +140,6 @@ def run_anthropic_eval(args):
os.makedirs(output_dir)
output_df.to_csv(output_file, index=False, float_format="%.2f")

# get average rate of exact matches
avg_acc = output_df["exact_match"].sum() / len(output_df)
print(f"Average rate of exact match: {avg_acc:.2f}")
# get average rate of correct results
avg_subset = output_df["correct"].sum() / len(output_df)
print(f"Average correct rate: {avg_subset:.2f}")
Expand Down
19 changes: 4 additions & 15 deletions eval/bedrock_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def process_row(row, model_id, decimal_points):
body = json.dumps(
{
"prompt": row["prompt"],
"max_gen_len": 400,
"max_gen_len": 600,
"temperature": 0,
"top_p": 1,
}
Expand All @@ -38,20 +38,9 @@ def process_row(row, model_id, decimal_points):
generated_query = model_response["generation"]
end_time = time()

if "```sql" in generated_query:
generated_query = (
generated_query.split("[/SQL]")[0]
.split("```sql")[-1]
.split("```")[0]
.split(";")[0]
.strip()
+ ";"
)
else:
generated_query = (
generated_query.split("[/SQL]")[0].split("```")[1].split(";")[0].strip()
+ ";"
)
generated_query = (
generated_query.split("```sql")[-1].split("```")[0].split(";")[0].strip() + ";"
)

row["generated_query"] = generated_query
row["latency_seconds"] = end_time - start_time
Expand Down
8 changes: 5 additions & 3 deletions prompts/prompt_cot.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
<|start_header_id|>user<|end_header_id|>
<|start_header_id|>system<|end_header_id|>

Follow instructions to the letter, and answer questions without making any additional assumptions.<|start_header_id|>user<|end_header_id|>

Generate a SQL query to answer this question: `{user_question}`
{instructions}
Expand All @@ -9,8 +11,8 @@ DDL statements:

I will reflect on the user's request before answering the question.

The question for which a SQL query must be generated is the following: `{user_question}`
I was asked to generate a SQL query for this question: `{user_question}`

{instruction_reflections}
With this in mind, here is the SQL query that best answers the question `{user_question}`, and only references the appropriate tables and columns in the DDL statements:
With this in mind, here is the SQL query that best answers the question while only using appropriate tables and columns from the DDL statements:
```sql
Loading
Loading