-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trimmed means? #14
Comments
Just FYI as I am experimenting with TDIGEST I quickly hacked function to convert TDIGEST to DOUBLE PRECISION[], CREATE OR REPLACE FUNCTION tdigest_to_array(digest TDIGEST) RETURNS DOUBLE PRECISION[] AS $$
SELECT
regexp_split_to_array(
left(regexp_replace(
digest::TEXT,
'^flags 0 count ([0-9]*) compression ([0-9]*) centroids ([0-9]*)','0,\1) (\2, \3)'
),-1),
'(\) \()|,')::DOUBLE PRECISION[]
$$ LANGUAGE sql IMMUTABLE; |
Ah, right - the API does not allow calculating of trimmed means. The t-digest allows that in principle, but it's not implemented. Adding it should not be very difficult though, I just didn't need it and no one requested it so far. As for exposing the t-digest as double precision, I'm not against doing that in principle, but it seems a bit fragile - people need to know which field is what, etc. I wonder if exposing it as a more structured type (e.g. json) would be better. Or maybe something like
BTW what's your use case for aggregating the data outside the database? |
Thanks for the reply! The API for trimmed means would be really great and maybe more people would appreciate it:) As for the serialization of TDIGEST to DOUBLE PRECISION[] you are right of course, it is fragile, thats why I called it a hack, but it got the job done for me. Maybe define a CAST to JSONB or something would be sufficient. It would be great if there was standard and compact text/string serialization format for TDIGEST so you can generate TDIGESTs on DB side and consume them in any other programming language/framework. My use case is fetching pre-aggregated multi-dimensional statistics to front-end JS code and do the additional filtering and grouping in crossfilter2 https://github.com/crossfilter/crossfilter on the client-side. |
Interesting, does crossfilter support tdigest out of the box, or does it require some custom code? I'm not very familiar with this JS stuff, so maybe it's naive question. Anyway, for JS I think JSON format would be a good match - the question is whether to tweak the the text format or simply add a custom |
I've pushed two WIP changes to this branch: https://github.com/tvondra/tdigest/tree/json-output The first one adds a cast to JSON, so that it's possible to do something like this:
or more SQL-standard-compliant
The format is fairly simple, I think. The one detail I decided not to include is the The second commit adds
Both patches need more testing, but some initial feedback would be handy. |
Ad cast and trimmed mean: great news, thank you very much! Will try to test it ASAP and let you know... Ad crossfilter: no, crossfilter does not support tdigest out of the box, but you can define custom reducers for any data structure with just couple lines of code. |
I tried the json-output branch against official docker image postgres:12.4 and I can confirm that it builds, installs and works fine. For my use case it would be very beneficial if you could compact the JSON representation to a minimum: minimal JSON property names and especially trimming the trailing zeros from numbers in sum array. I am transferring (tens or possibly hundreds of) thousands data points from PG backend to client so every byte counts to the required bandwidth and transfer time. Further it would be also very beneficial if you could extend the trimmed mean API for TDIGEST type itself so I can calculate trimmed mean of precalculated TDIGEST(s). Everyones use-case is different, so take it as just my 2 cents to the discussion. Thanks for your work on this excellent PG extension! |
Yeah, I'll definitely add the trimmed mean for pre-calculated tdigests, I've been too lazy to do that in the experimental WIP patch. I'll see what I can do to minimize the json - I'm not a huge fan of minimizing the property names, because there are not that many and it'd make it harder to work with. So that doesn't seem like a good trade-off. About the trailing zeroes - I suppose you mean trailing zeroes after the decimal point, right? Like 0.5130000 should become 0.513 etc. I'll see what I can do about it - I'm simply relying on API that does the formatting (using |
I've pushed (into the branch) a bit more complete version, adding the function to calculate mean from t-digest and tweaking the formatting. Can you check if this works fine for your use case? |
Yes, new version also builds, installs and works fine on official docker postgres:12.4. |
I have pushed most of the improvements discussed in this thread into the master branch, after some improvements and polishing. That includes:
Can you take a look at the improvements, and see if there's something that doesn't work the way you'd like? All this is in 1.3.0-dev version. It's not the default version, so it needs to be specified in CREATE EXTENSION explicitly. I've also modified the tests quite a bit, to improve the coverage and reduce duration. |
Hello, in the heading of README you mention:
"This PostgreSQL extension implements t-digest, a data structure for on-line accumulation of rank-based statistics such as quantiles and trimmed means."
But I have the impression that you don't expose API for getting trimmed mean or at least list of centroids and their parameters programatically.
Am I missing something? Is it possible to calculate trimmed mean using your extension?
The text was updated successfully, but these errors were encountered: