-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we really care a lot about duplicate commands in histdb
call? Improves histdb perf by 100x
#52
Comments
Ok looks like sqlitebrowser also has a huge impact on performance so it's not good to use it to compare. Running it in the shell, the time difference is 200ms with the original query, 3ms with my adjustment. When filtering by a directory, the original query takes 400ms in SqliteBrowser but only 100ms from shell. The modified query takes 10ms in sqlitebrowser, 4ms in shell. Let me know if you think this change is a good idea, I can send a PR. |
Indexing on time makes a lot of sense, as this doesn't affect the outputs. Similarly the changes in your other issue make sense for this reason. I am not sure about changing the outputs, since quite a lot of people use this and I personally often get upset when people change the behavior of things that I use. I think my preferred solution if you want the speed benefit of removing the group by is just to hack it locally. This is probably also influenced by the fact that I don't find the delay perceptible (and I work on a pretty old computer). |
Well for a normal histdb call it doesn't really matter much i think - but do you also not notice a delay in the interactive version, where it does a synchronous query on every keystroke? I mean my goal is to have |
For whatever reason, in the reverse-isearch mine is not noticeably slow. That hstr thing looks like something you could imitate well using fzf and a command to run your faster query. If you make something which works well let me know and I will put it in the readme / in as a script since many people are keen on that sort of interface. |
Forgot to respond to this earlier: sadly indexing on time doesn't help at all as long as you group it (try explain query plan and see it will still scan the whole table)
It's not bad enough to be really annoying, but it's definitely noticeable for me, especially for the first three chars you type in (try "cat"), the delay is gone when you enter a longer string. My patch is below btw, I think the interactive part can be applied without any visible changes (since you only see one result anyways). The creation of the time index is not in there. I could also send a PR if you want, but probably not worth another migration (just on creation of new databases).
Would probably be a bit harder since I also want to have an interactive toggle to make it only search for commands in the current dir (which IMO is kind of a killer feature of histdb). Or maybe change order according to that. I'll let you know when I make something. diff --git a/histdb-interactive.zsh b/histdb-interactive.zsh
index decc877..c2139d7 100644
--- a/histdb-interactive.zsh
+++ b/histdb-interactive.zsh
@@ -59,7 +59,7 @@ _histdb_isearch_query () {
commands.argv,
places.dir,
places.host,
-datetime(max(history.start_time), 'unixepoch')
+datetime(history.start_time, 'unixepoch')
from history left join commands
on history.command_id = commands.rowid
left join places
@@ -67,8 +67,7 @@ on history.place_id = places.rowid
where commands.argv glob '*$(sql_escape ${BUFFER})*'
${where_host}
${where_dir}
-group by commands.argv, places.dir, places.host
-order by ${maxmin}(history.start_time) ${ascdesc}
+order by history.start_time ${ascdesc}
limit 1
offset ${offset}"
local result=$(_histdb_query -separator $'\n' "$query")
diff --git a/sqlite-history.zsh b/sqlite-history.zsh
index 033199c..70a7057 100644
--- a/sqlite-history.zsh
+++ b/sqlite-history.zsh
@@ -311,7 +311,7 @@ histdb () {
local seps=$(echo "$cols" | tr -c -d ',' | tr ',' $sep)
cols="${cols}, replace(commands.argv, '
', '
-$seps') as argv, max(start_time) as max_start"
+$seps') as argv, start_time as max_start"
local mst="datetime(max_start, 'unixepoch')"
local dst="datetime('now', 'start of day')"
@@ -325,7 +325,6 @@ from
left join commands on history.command_id = commands.id
left join places on history.place_id = places.id
where ${where}
-group by history.command_id, history.place_id
order by max_start desc
${limit:+limit $limit}) order by max_start asc"
@@ -336,7 +335,6 @@ from
left join commands on history.command_id = commands.id
left join places on history.place_id = places.id
where ${where}
-group by history.command_id, history.place_id
order by max_start desc) order by max_start asc"
if [[ $debug = 1 ]]; then |
OK - thanks - I'll try and look into this a bit more this week. Even for single character queries I find the isearch is basically instant, so I think there must be something weird going on but I can't guess what. |
I have a feeling the query plan for this is a bit stupid anyway - it should be scanning commands for matching history rows and then using the index on command_id to efficiently search history, rather than scanning history. I will see whether I can coerce it into doing something else with the same output by moving some joins around. |
This might be a definite no and I quite understand if it is, but is there any chance I could have your history file to see whether I can reproduce the performance you are getting? Alternatively if you could select count(*) in all the tables or some other statistics (e.g. distinct count on columns) I can try generating some test data of the same size. |
also, as another line of inquiry, could you try |
I tried making that index before and sqlite used it in the select query but it didn't really change performance I think. My histdb file is maybe 90% imported with https://github.com/phiresky/ts-histdbimport. That could change some perf characteristics cause most things are in the same "place" I guess? I also vacuumed it before testing. I just wrote a script to generate a history file: https://gist.github.com/phiresky/f9de114c7eda7cf80409a1fe304787ca . I'll send a copy of a generated db and also see if that is the same as mine. |
Here you go, here's a slightly larger file (generated with above script): 400k entries, 40MB. zsh-history.db.gz (After running Timing of a simple histdb call: 0.99s to 1.003s. via SCAN TABLE (once for count, once for results) After creating the index and applying the patch: 0.8s - 0.90s. Almost all of that time is spent on the select count(*), without the count the time is 0.05s. If i replace the count query with a simplified one, it goes down to ~0.15s: @@ -325,19 +326,16 @@ from
left join commands on history.command_id = commands.id
left join places on history.place_id = places.id
where ${where}
-group by history.command_id, history.place_id
order by max_start desc
${limit:+limit $limit}) order by max_start asc"
## min max date?
- local count_query="select count(*) from (select ${cols}
+ local count_query="select count(*)
from
history
left join commands on history.command_id = commands.id
left join places on history.place_id = places.id
-where ${where}
-group by history.command_id, history.place_id
-order by max_start desc) order by max_start asc"
+where ${where}"
if [[ $debug = 1 ]]; then
echo "$query" (not sure why the select is wrapped in another select?) |
Thanks for all this effort - I'm quite busy with work at the moment so I don't have much spare time to think about other programs, but I will get to this sooner or later. The inner / outer select is there to get the presentation right, and it may be it could be rolled into one select. The difficulty is probably that it's sorted one way on the inside and the other way on the outside. |
I just had a quick go with the db you are looking at there; generating a query with
and for the count query
I have rearranged the joins a bit in that query from what master produces at the moment. If I chuck that into sqlite with This is a bit slow, but maybe 5X quicker than on your machine - roughly what spec are you running on? Mine is an i5-3320 at 2.6ghz with a reasonable SSD and 16G memory. The reason for the group by is that I often run the same command lots of times in a row because I'm changing inputs and inspecting outputs - I don't like having these dups in the history or the reverse-isearch as they are clutter. However I do store the repeat runs just in case I want to know about them later for some reason (admittedly this has never come up yet). I don't know if there is a more efficient way to get this effect - I could in principle prevent the append of the duplicates into the log, but then their time & exit status information would be lost. |
Sorry, I didn't really look into this further. I'm using histdb in parallel with normal history currently, so I only use it for archiving and specific queries (filtering by dir is awesome) - So I currently don't really care about performance. Just one more note since I just thought of this: The query with duplicates could have a LIMIT set to wanted_LIMIT * 1.5, then put into anther query that filters duplicates out. Then, if the number of returned entries is < then wanted number of rows the query can be reexecuted in a loop with the limit increasing exponentially (2x, then 4x, then 8x) until the desired number of rows is returned or limit reaches db size. that way the change would be invisible from user perspective and good for performance, with the only cost being code complexity. |
Think the same, this is the last wanted improvement that I hope |
As usual, happy to accept pull requests that improve perf whilst preserving the output I want. Duplicate commands in the log are not output that I want, so unfortunately the performance cost of removing them has to be paid. @tmpm697 you have made several requests for things to be different in some way, but you haven't shown any work you've done to make them different. I know this probably comes across harshly, but I want to make the terms on which this software is made available quite clear: I made this program to meet my needs, which it does, so I don't need to do any more work on it. If you want it to change, do some work to change it and you will find satisfaction. Phiresky's last comment describes one way that you could try to improve this aspect of query performance. I'm not going to do it because I also don't care to spend time making history queries faster; queries I make return in about 0.04 seconds, and I think I would not intervene until the latency was about 0.2 seconds. |
(semi-related to #2)
When you remove the
group by
command and add an index on time:create index hist_time on history(start_time);
For example compare this (slow):
to this:
For me, this reduces the duration of simple calls like
histdb
orhistdb borg
from 1100ms to 10ms.It would even be possible to wrap this in another call to remove duplicates with the only caveat that then you may get somewhat less results than you requested with
--limit
The text was updated successfully, but these errors were encountered: