-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pg_duckdb #253
pg_duckdb #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: should we split up another variant between pg-duckdb-motherduck & pg-duckdb-parquet
Yeah, I think that makes sense. I/we can contribute the pg-duckdb-motherduck one if needed.
TODO: test, view probably doesn't work, but seems weird inlining huge schema on every query. can have some sed replace on \bhits\b I guess
Using a view should work fine, that's currently the recommended approach when reading from the same parquet file multiple times.
for i in $(seq 1 $TRIES); do | ||
psql postgres://postgres:duckdb@localhost:5432/postgres -t -c '\timing' -c "$query" | grep 'Time' | ||
done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening and closing the postgres connection for each query has significant overhead when connecting to MotherDuck, because each postgres backend needs to open its own connection to motherduck. Writing a Python script instead might be preferable, otherwise you can do something like this.
for i in $(seq 1 $TRIES); do | |
psql postgres://postgres:duckdb@localhost:5432/postgres -t -c '\timing' -c "$query" | grep 'Time' | |
done; | |
echo "$query" | |
( | |
echo '\timing' | |
yes "$query" | head -n $TRIES | |
) | psql postgres://postgres:duckdb@localhost:5432/postgres -t | grep 'Time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, for MotherDuck it's important that the machine that runs the benchmark is in AWS us-east-1 to get the lowest latency to MotherDuck.
@@ -0,0 +1,4 @@ | |||
SELECT duckdb.cache('s3://clickhouse-public-datasets/hits_compatible/hits.parquet', 'parquet'); | |||
create view hits as | |||
select WatchID, JavaEnable, Title, GoodEvent, 'epoch'::timestamp + (EventTime || 'second')::interval EventTime, 'epoch'::timestamp + (EventDate || 'day')::interval EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, 'epoch'::timestamp + (ClientEventTime || 'second')::interval ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, 'epoch'::timestamp + (LocalEventTime || 'second')::interval LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doing 'epoch'::timestamp + (EventTime || 'second')::interval EventTime
because 'epoch'::timestamp + interval '1second' * EventTime
complained about interval * double
not being allowed, I adjusted PG type on EventTime so rewrite wouldn't get rid of explicit EventTime::bigint
but then pg_duckdb somehow had expression be EventTime::bigint::double
so that it could still complain about interval * double
This queries local parquet files, a
pg_duckdb-motherduck
variant can be added in future using motherduck integration instead ofread_parquet