- Relational databases: Postgres, MySQL, Oracle, MS SQL
- Clients: DBeaver, JetBrains products
- Table/relation, record/row/tuple, schema, database
- Connection string
- Select, projection
count()
- Conditions: =, >, <, !=, <>, in, like, ~, ~~, is null
- Sorting
- Table and column aliases
- Conventions. "Name" vs Name
- Formatting
- Connect to the database
- Look through the tables and their columns - find some corresponding entities and their attributes on UI
- Try changing something on the UI and see the database state change accordingly
- Select injections with name that ends with 01
- Select peaks of some injection, sort peaks by area
- Select Injections of some Batch (open a batch, take its ID from address bar), order them by Acquisition Time. Now try by Name desc.
- Find chromatograms in some injection with no association with substances (
substance
column is null)
- Select from multiple tables without joins
- Unique columns (Keys), Primary Key, Secondary Key
- Foreign Key
- Joining 2 tables: get chromatogram with peaks
- Joining 3 tables: get injection with chromatograms and their peaks
- One-to-One, One-to-Many, Many-to-Many
- Outer joins (Left, Right), Inner join. The default is different for different vendors.
- Additional joining conditions
using
keyword
- (Does not relate to joins) Select all the injections that aren't added to any batches, ordered by import time. We should get the same records as this page shows.
- Get all the peaks in all batches sorted by Area. Note, that peaks reference injections, and injections reference batches. So there will be 3 tables involved.
- Find batches and their peaks, but this time we're interested only in peaks on Total chromatograms (see
chromatograms.total_signal
column) - Count the number of rows in the last query. And compare it to the number of rows we got before that.
- Get structure, MF, alias of all substances (the structure itself is in
structures
table). We're not interested in substances w/o structure. Try solving the last condition first withleft join
and additionalwhere
statement, then try with justinner join
. - Get all peaks from injections of some batch. Select from injections, the rest should be joined. We don't want to see injections w/o peaks.
- Modify last query to only capture
modified_manually
peaks. Try doing this withwhere
and with additional conditions insidejoin
. - Can you rewrite the last statement with a
right join
if you swap tables injoin
andselect
statements?
- Unique substances within peaks
- Number of peaks in injection
- Injections with n of peaks > 0
- Distinct keyword
- Group by multiple fields: injections w/ 2 peaks and 2 substances
- Get injection w/ max number of peaks, join them with peaks
- Aggregation functions: count(), max(), min(), avg(), sum()
string_agg(col, 'separator' order by col)
- Look at
detector_runs
table. It represents physical detectors which may produce more than 1 chromatogram. Now find a way to see alltype
s of detectors currently present in DB. Try doing this withdistinct
, then see if you can achieve the same withgroup by
. - Count how many rows of each detector type are present in DB.
- Each chromatogram references its
detector_run
. Count how manychromatograms
of each detectortype
are present in DB. This will require bothjoin
andgroup by
. - Select all chromatograms and an average peak area within those chromatograms. We're interested only in stats across not
modified_manually
.
- See if there are injections with duplicated names in the database. Note, that
ID
is unique, whilename
of injections isn't. Usinghaving
andcount()
you can filter out those injection names that are not duplicated leaving just the duplicates. - Get injections within some batch that have more than 2 chromatograms
- Then add a comma-separated list of
detector_runs.type
s within each injection - Add a number of peaks within each injection and an average peak area
- Then add a comma-separated list of
- Similar to prev task get all peaks within a batch, and for each injection get an average area. But now we need 2 rows for each injection - one for
modified_manually
peaks (and the average area among these peaks), and for others (same - with average area). If injection has only one type of peaks, then there will be only 1 row, not 2.- Now leave only those injections that have more than 1 peak. Filter out the rest.
- Sub-select in
where
: injections with peaks where peak area = sum(all peaks on that chromatogram) - Sub-select in
select
: peak area compared to the sum of peak areas on that chromatogram
- Select a single peak with the largest area from the table. First we need to select the max area within the table (sub-select), and in the outer select we can find the peak with that exact area.
- Now select the largest peak across each chromatogram. The output should show chromatogram data and an additional column
max_peak_area
. First do this with ajoin
andgroup by
. Then try doing the same with a sub-select:select (select ... from ..) from ...
. - Now that you have chromatograms with their Max Peak Area, calculate the sum of these areas per injection. So the output should have columns:
injection, max_peak_area_sum
. Notice that you couldn't do this without sub-selects this time - as first we had to prepare the data set to sum across.
- Strings: upper(), lower(), replace(), concat()
- coalesce()
- Numbers: algebraic operations, round(), least()/greatest()
- Division of integers vs double
- Casting
- boolean
- Dates:
now()
,date_trunc('day', creation_time)
,extract(year from creation_time)
,to_char(current_timestamp, 'month')
,'2022-11-29'::timestamp
interval '1 day'
,extract(days from creation_time - date_trunc('year', creation_time))
- Using functions in conditions, selects, group by
- Select all users from the
users
table. Combine firstname and lastname into 1 column, separate them with a space. - There are injections with the same name (injection and INJECTION are considered same for this task) in our database. Find all the rows of such duplicates, select all the information about these injections.
- Peaks have area, as well as chromatograms. Find the percentage of the peak/chromatogram. Compare this to the column
area_perc
in the peak - does it give the same result? - Calculate a number of injections created monthly, you should get something like this
month | injections_created 2022 Nov | 133 2022 Dec | 564
- Find injections that contain two dashes in their names:
xxxxxxx-xxxx-xxxx
. The first 2 parts of it (xxxxxxx-xxxx
) is an experiment name. List all the experiments and the number of injections in each of them.
- We need to calculate the stats on our tables - the number of rows in different tables and the ratio between these numbers. Try to do this with sub-selects and then with CTEs. Example of an output:
n_of_peaks | n_of_chromatograms | n_of_injections | peaks_per_chromatogram | peaks_per_injection | chromatograms_per_injection
500 | 200 | 10 | 2.5 | 50 | 20
- List all peaks (full rows) which are the largest peaks on a chromatogram and their area is greater than all other peaks on that chromatogram combined (you can't use chromatogram area as a shortcut - you must sum up peak areas). We're not interested in cases when there's only 1 peak per chromatogram.
- row_number()
- sum()
- lag()
- List the largest peak of each chromatogram
- List the first peak of each chromatogram (by
rt_minutes
) - List peaks of chromatograms and show how much time elapsed between two peaks (use
start_minutes
andend_minutes
of peaks). Add yet another columnnot_resolved
and set it totrue
if peak touches (the values of borders are equal) the previous peak.
Calculate conversion for each injection in a batch, the result should be similar to this table.
Conversion represents the amount of reactant (e.g. Core) that went into forming Product molecules. It tells us the effectiveness of a reaction. One way to calculate it is: product_amount / (reactant_amount + product_amount)
. Because we don't really have amounts - we will use Peak Area instead as an approximation. So: conversion=product_peak_area/(product_peak_area + reactant_peak_area)
.
Notes:
- We need to consider only peaks on a single chromatogram - typically we need to choose some extracted UV chromatogram (like
UV 254
). Seechromatograms.nm
column. - It's possible that both Product and Core don't have peaks at all. In this case we want to show
N/A
- If we have only Core, then it's 0
- If it's only Product, then 1
- It's possible there are more than 1 peak of Product or Core. In such a case let's take the largest.
Calculate chromatogram names the same way Peaksel does it. Notice that some Injections may contain the same detectors more than once. In some cases even though physically it's the same detector, it may take different measurements at different times, and so in the extreme cases we may get dozens and even hundreds of detector_runs per Injection, see this example: 03JUN2020_COV_AAA_PL_021.
In order to differentiate between detector_runs with the same name, Peaksel suffixes them with a letter: A, B, C, etc. You need to write a query that returns a list of chromatograms with their names within the injection the same way Peaksel does it. Note, that if there's just 1 instance of a detector_run of each type in the injection, then we don't have A/B/C suffixes.
These names also consist of:
- Analytical Method (UV, MS, ELS, etc).
- Detector Sub-Type (if applicable): SQD, QTOF, etc.
- If it's a Mass Spec, then ion_mode will say if it's positive or negative (+ or - sign)
- In select statement
- Inside
count()
- For each injection we'd like to see its "status". Possible values:
- TBD (To Be Done) - when an injection contains no substances
- Analyzed - when there's at least one substance
- Curated Manually - when there's at least one peak
modified_manually
- Also add another column: if injection has a creator, then let's show their First Last name. If not, let's show "Unknown".
- Selecting from functions:
now()
,random()
,generate_series()
- Cross join
- For each day of current and previous years, get the number of injections uploaded. Days without injections should have 0.
- There are 5 dice: two are 4-sided (numbers 1 through 4) and the other 3 are 6-sided (numbers 1 through 6). What's the probability the sum is going to be 15? To figure it out you need to generate all possible combinations of these 5 dice, then count the number of combinations that sum up to 15 and number of all possible combinations. Then divide one by the other.
- We will consider a famous Tuesday Boy problem in Probability Theory. Our goal would be to generate a set of rows and then calculate probabilities (frequencies).
Problem1 - probability of a boy given families with at least one boy
There's a set of families with 2 children. We're interested in families where at least one of the kids is a boy. When selecting a random family out of this set, what's the probability that there are 2 boys?
Feel free to calculate the probability, but then we'll need to check it with SQL:
- Generate a set of rows that represent families. Columns:
child1_boy
(boolean),child2_boy
(boolean). The each value could be eithertrue
orfalse
with 50% chance. - Out of the set, filter out families that don't have boys
- And finally count a) families with 2 boys b) number of all families. Then calculate the proportion of one to another
Is the result consistent with what you predicted?
Problem2 - probability of a boy given families with at least one boy born on Tue
Now the condition is a little more complicated: our families have at least one boy born on Tue. When selecting a random family, what's the probability it's a boy?
- Let's add 2 additional columns to our generated data set:
child1_weekday
,child2_weekday
. Use either numbers or strings to denote days of weeks. Each day is equally probable. - From the generated set filter out only rows where there's at least one boy. And at least one of the boys has birthday on Tue.
- Now calculate the proportion of families with 2 boys
Does the result surprise you? Can you explain why this is the case?
union all
vsunion
We want to generate statistics for uploading injections, which should look like this:
user | uploaded_month | count
---------------------------------
Rick | Jan | 100
Rick | Feb | 50
...
Morty | Jan | 111
Morty | Feb | 232
...
all | Jan | 666
all | Feb | 282
...
Rick | all | 2523
Morty | all | 1880
...
all | all | 12513