-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Denote Candidate Generator #41
Comments
Thanks for writing this! I can definitely use your help in fleshing out the ideas for this feature to make it the most useful. Here are my current thoughts: I'm first trying to find the best architecture for extending the information available to search on. I'm thinking of adding a new concept called "mapping" (along side "candidate generators" and "priors") which take a candidate document with certain properties and adds additional properties which can be searched with. So for example, with the current filesystem candidate generator, the user can select a directory of denote files and get a basic search for each one:
This alone provides search based on the filename, content, and other filesystem things. So to add support for denote, I'm thinking of adding a new entity to the system called a "mapping". Each mapping would have an input property, and add new properties to the candidate. So the denote mapping would take in a "filename", and add "denote-date", "denote-title", "denote-keywords" (the three of which are taken directly from the filename), and "denote-tags" from the content of the file. A "bibtex" mapping could be created too, adding bibtex-related properties. Now we have the same candidates but they have more properties:
I'm leaning on having mappings separate from candidate generators, but another method would be to have a mechanism to extend existing candidate generators, so someone could use FILESYSTEM as their base generator, and create DENOTE or BIBTEX which has some stuff added to FILESYSTEM. This has the advantage of simplicity, but lacks compatibility. So for example, with mappings, one can add "denote" and "org" mappings at the same time. So besides the UX of creating the generators, I could definitely use your input on what kind of things related to denote you'd like to search. Something I can think of:
Definitely let me know what you think! |
May also be helpful to collect the document title, but I suspect that would come from other mappings. As you build this, I would really appreciate documentation for how I can extend p-search.
I think this mappings proposal would mostly work.
I definitely think this is the way to go (at least, for most files, again BibTeX may add some complications).
Adding mappings to generators should probably be an entirely separate step.
Those are very similar to what I was thinking, though it could also be prioritization based on other date predicates. I am a bit confused how you are differentiating tags and keywords in this case, but yes, those seem like a reasonable set to start. |
Sounds good! So I've been reading up on multi-field search and found the algorithm BM25F, an extension of what it's currently doing, to be a good fit for field search. I have an idea now of how things will fit together and am working on this now. Making this package as extensible as possible is definitely one of my goals and so after I get in these features, I will probably start on writing the documentation. I hope to have all of this done within a month and will definitely keep you posted. Definitely let me know any other ideas or feature requests you might have.
oh, maybe I misunderstood Denote's conventions. So the "...__a_b_c.org" and the front matter "#+filetags: :denote:testing:" are actually conveying the same information I guess. |
I wasn't thinking clearly there. They exist separately for a reason: not all denote files are org (can include, for example, images or PDFs). So denote keywords exist, and org tags may exist as well, but that would probably best be a different mapping. |
Ok, that makes sense. I'm not too familiar with denote so I'll definitely have questions how to make this the most useful. |
It might be reasonable to provide an example of a simple mapping, and I can try to work from there. |
sounds good. I'll message back when I get the mapping system complete. So far it's going well, here's the WIP #43, still a lot more to implement. |
This seems like a really clean interface! That said, I do have a couple questions. Given the example prior that you show, if I specify both the Also, for these sorts of matchers, how do you plan on supporting them? Because not everyone who uses p-search necessarily uses denote, this seems like it might be something that should be in a separate file. If so, would you be interested in me working with it a bit & contributing it once the mappings are merged? |
As a note, when I use a slightly edited version of your code, I am currently getting the following error:
When running |
Also, it's unclear if you can stack mappings, or what happens if a document can't be mapped. |
I want to clarify my previous comment to ask a few specific questions.
|
So the way it's currently set up is that if you don't specify fields, it will search all the fields, and give any extra specified weight to the field. If you do specify fields (multiple select WIP) it will only search in those fields. That said, another query feature that I'm still working out the details is being able to specify fields in the query string, so for example,
Currently nil means drop, but it doesn't have to be this way. In fact I was also thinking of having it not drop, or even better to have this as an argument that can be passed to the mapping (thus allowing filters from the mapping infrastructure)
Mappings do indeed compose. So the mappings run in order, adding any additional field or creating as many sub documents. So for example, one mapping I want to create in the future is one that splits the document up every certain number of lines. So in theory you could first run in through the denote mapping, giving the candidate the denote fields, then you could run in through the file-splitting mapping, keeping the denote fields, but creating several documents from each one. Improving the UI on this is definitely something I want to do as you can set up pretty complicated things.
The current implementation makes it so that a weight of N makes it so that a term in that field is counted as if it occurred N times. Its kind of hard to put a number to how much higher its score will be, but it can be thought like, if a title had a weight of 10 and a document had term X in its title, another document would need to have ten occurrences of X in its body to be given the same score.
Definitely! Also "time" was one that feels like it could come up a lot. I'm still trying to figure out the queries to make available for number fields. And thanks for that PR btw! I merged it into #43 and after certain details are figured out and its all tested, I'll merge it in. |
Okay, that makes a lot of sense, and I like the idea of a special query syntax. However, I think maybe a query syntax of
I'm not sure that mappings should remove documents, in general. I can get behind splitting documents (maybe), but removing them seems counterintuitive. As I was writing the mappings that you merged, this was confusing, and my thought was that I should only return changes, with nil signalling no change. Since that is not currently the case, I'll make another PR to make them so they don't drop items. A specific, separate value to remove the document seems reasonable
I think one of the major modes of composition I'd like to see is (perhaps) allowing multiple values for a given field, and letting new mappings add new types of fields as necessary. Consider, for example, title, in the case of Denote: I could store not only the sluggified title, but the document's regular title. If I call them both
I'm not sure that I as a developer should be setting the weight or priority of different fields, but it does seem reasonable that I can set a default. Perhaps providing a UI to modify this would be helpful (it may be something that's best done as part of creating the prior using it?)
Great. Some thoughts for number-related queries:
With similar queries for date/times?
I'll be submitting another one shortly, to address some misunderstandings that I had. |
As I have been thinking about this a bit more, I think the best way to approach this is:
This makes it so that, as a developer of mappings, I only have to concentrate on two cases:
This would simplify the code that I have in the denote and pdf info mappers, and would make it quite a bit easier to add future mappers (such as a denote graph mapper, or for users of org-roam, org-roam-related mappers. |
Oh yeah, that makes more sense.
I was thinking about this too. Sometimes you'd even want to extract multiple values for the same field. Like for example, in an HTML or markdown file, you might want to extract the various headings, so in that case you could have multiple
I agree these should definitely be easily customizable. Maybe something like
I like this approach too. It looks like it simplifies mapping and filtering. I'll update the code to have these semantics. This way, I think we could get the |
I merged in the branch. There should now be support for multiple field values (fields can be a list of strings instead of a string). Also a returned nil has the meaning you mentioned, of just doing nothing. There is now a default mapping (-f) for changing the functionality of this. I also moved the generator and mapping enhancements to a new subdirectory called "extensions" with a name prefix of "psx", analagous to how org babel extensions have the "ob-" prefix. There may definitely be bugs so definitely let me know if something is off. Next up I'll work on tackling some of the other issues. Also documentation is a huge need so I'll be working on that too. |
Is there an example of adding new values (not overwriting) to an existing field?
That makes sense!
Right of the bat, it looks like Other than that, I'll be using it and let you know if I find any other bugs. Thank you for all the work that you've put in to this! |
I could have been clearer here, sorry. As a developer, if I define a field, it seems reasonable that I can provide a default priority (ideally, yes, through customize). |
If this is going to be the approach, there are a couple of things that need clarified as a developer.
|
I don't have an example yet of this, but essentially the "new-fields" argument to
will also produce the same result.
oh, good catch! Updating this right now.
And thank you for collaborating with me on this! Having someone to go back and forth with makes a great difference.
Oh, I see what you mean. So the first thought that come to mind is to have an additional option on the query prior (probably the only one that will be dealing with these weights) which when selected, lets you input override weights. I think the key to this will be creating a brand new transient input to make setting the weights effortless. Another idea is to have a session customization session, like the typical Emacs defcustom, but it only applies to the current session. I'll need to think about this a little bit more.
So as eluded to above, if you want to just add to an existing field, you should be able to just call
So As for the other items, I think there could be like a standard definition of common fields. Doing this wouldn't change the working of p-search, but rather the way different extensions coordinate on defining properties. We'd definitely need to think through things. Like let's say we have Thanks again for your help working these things out! |
I think calling it
Hmm... To the earlier part of this, yes, standard definitions of common fields really does seem necessary. |
Ok, let's go with that then. I'll update the code changing the property title to be "name", and leave "title" to be a field. Then, I'll define the fields "author", "title", "keywords", "creation-date", "modification-date", "language", and "file-type" in p-seach.el. |
So the field refactoring, adding the default fields should be done in this PR #65 This also adds a new mechanism to query by category. So changed up psx-denote a bit since querying by category is now possible. If you still want to be able to query categorical fields like text, I created this issue #66 for being able to query categories as text. |
The formatted text version was a stop-gap until category queries were supported as it is. I'll take a look at #65, and on merge, I think this issue will be closed. |
Ok, should be merged in now. Definitely let me know anything's not working as expected. |
It works wonderfully so far! Thank you! |
And thank you for your ideas working with me through this! |
See also #18.
It could be useful to implement a candidate generator that takes all denote-managed files in a directory and treats them as candidates. Depending on future plans for metadata searchability, this could also provide information about tags, title, date, etc.
This may also suggest implementation of other priors (tags, linking metadata, etc.).
I am more than happy to help!
The text was updated successfully, but these errors were encountered: