Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denote Candidate Generator #41

Closed
swflint opened this issue Dec 12, 2024 · 28 comments
Closed

Denote Candidate Generator #41

swflint opened this issue Dec 12, 2024 · 28 comments

Comments

@swflint
Copy link
Contributor

swflint commented Dec 12, 2024

See also #18.

It could be useful to implement a candidate generator that takes all denote-managed files in a directory and treats them as candidates. Depending on future plans for metadata searchability, this could also provide information about tags, title, date, etc.

This may also suggest implementation of other priors (tags, linking metadata, etc.).

I am more than happy to help!

@zkry
Copy link
Owner

zkry commented Dec 12, 2024

Thanks for writing this! I can definitely use your help in fleshing out the ideas for this feature to make it the most useful. Here are my current thoughts:

I'm first trying to find the best architecture for extending the information available to search on. I'm thinking of adding a new concept called "mapping" (along side "candidate generators" and "priors") which take a candidate document with certain properties and adds additional properties which can be searched with.

So for example, with the current filesystem candidate generator, the user can select a directory of denote files and get a basic search for each one:

FILESYSTEM
|-> 20220610T162327--my-note-1__notetaking_philosophy.txt
      \--> filename, content, tilte
|-> 20220311T031232--my-note-2__cooking.txt
      \--> filename, content, tilte
...

This alone provides search based on the filename, content, and other filesystem things. So to add support for denote, I'm thinking of adding a new entity to the system called a "mapping". Each mapping would have an input property, and add new properties to the candidate. So the denote mapping would take in a "filename", and add "denote-date", "denote-title", "denote-keywords" (the three of which are taken directly from the filename), and "denote-tags" from the content of the file. A "bibtex" mapping could be created too, adding bibtex-related properties. Now we have the same candidates but they have more properties:

FILESYSTEM + DENOTE
|-> 20220610T162327--my-note-1__notetaking_philosophy.txt
      \--> filename, content, tilte, denote-date, denote-title, denote-keywords, denote-tags
|-> 20220311T031232--my-note-2__cooking.txt
      \--> filename, content, tilte, denote-date, denote-title, denote-keywords, denote-tags
...

I'm leaning on having mappings separate from candidate generators, but another method would be to have a mechanism to extend existing candidate generators, so someone could use FILESYSTEM as their base generator, and create DENOTE or BIBTEX which has some stuff added to FILESYSTEM. This has the advantage of simplicity, but lacks compatibility. So for example, with mappings, one can add "denote" and "org" mappings at the same time.

So besides the UX of creating the generators, I could definitely use your input on what kind of things related to denote you'd like to search. Something I can think of:

  • Input a target date and prioritize denote dates elements closer to that date
  • pagerank-like ranking, giving higher probability to documents with more inward links
  • search by keyword: you are given a prompt to select for a keyword from a list of available ones, the selected keyword gets a higher probability.
  • same as above but for tags

Definitely let me know what you think!

@swflint
Copy link
Contributor Author

swflint commented Dec 13, 2024

Thanks for writing this! I can definitely use your help in fleshing
out the ideas for this feature to make it the most useful. Here are my
current thoughts:

I'm first trying to find the best architecture for extending the
information available to search on. I'm thinking of adding a new
concept called "mapping" (along side "candidate generators" and
"priors") which take a candidate document with certain properties and
adds additional properties which can be searched with.

So for example, with the current filesystem candidate generator, the
user can select a directory of denote files and get a basic search for
each one:

FILESYSTEM
|-> 20220610T162327--my-note-1__notetaking_philosophy.txt
      \--> filename, content, tilte
|-> 20220311T031232--my-note-2__cooking.txt
      \--> filename, content, tilte
...

This alone provides search based on the filename, content, and other
filesystem things. So to add support for denote, I'm thinking of
adding a new entity to the system called a "mapping". Each mapping
would have an input property, and add new properties to the
candidate. So the denote mapping would take in a "filename", and add
"denote-date", "denote-title", "denote-keywords" (the three of which
are taken directly from the filename), and "denote-tags" from the
content of the file. A "bibtex" mapping could be created too, adding
bibtex-related properties. Now we have the same candidates but they
have more properties:

May also be helpful to collect the document title, but I suspect that would come from other mappings.

As you build this, I would really appreciate documentation for how I can extend p-search.
I can think of a number of other mappings that could be interesting (CODEOWNERS, information from pdfinfo, etc.), but I would have to think about that more.

FILESYSTEM + DENOTE
|-> 20220610T162327--my-note-1__notetaking_philosophy.txt
      \--> filename, content, tilte, denote-date, denote-title, denote-keywords, denote-tags
|-> 20220311T031232--my-note-2__cooking.txt
      \--> filename, content, tilte, denote-date, denote-title, denote-keywords, denote-tags
...

I think this mappings proposal would mostly work.
My only concern is with the BibTeX example, where a single file may have multiple candidates.
That said, it would probably make sense to figure out a way to support that as well.

I'm leaning on having mappings separate from candidate generators, but
another method would be to have a mechanism to extend existing
candidate generators, so someone could use FILESYSTEM as their base
generator, and create DENOTE or BIBTEX which has some stuff added to
FILESYSTEM. This has the advantage of simplicity, but lacks
compatibility. So for example, with mappings, one can add "denote" and
"org" mappings at the same time.

I definitely think this is the way to go (at least, for most files, again BibTeX may add some complications).

So besides the UX of creating the generators, I could definitely use
your input on what kind of things related to denote you'd like to
search. Something I can think of:

Adding mappings to generators should probably be an entirely separate step.
If (I have not quite played with this enough) there is a good way to save this in an editable way (as a dir-local?), all the better.

  • Input a target date and prioritize denote dates elements closer to
    that date
  • pagerank-like ranking, giving higher probability to documents with
    more inward links
  • search by keyword: you are given a prompt to select for a keyword
    from a list of available ones, the selected keyword gets a higher
    probability.
  • same as above but for tags

Those are very similar to what I was thinking, though it could also be prioritization based on other date predicates.
For example, I would prefer notes from a date after the provided, but if something else matches high for other reasons, show it.

I am a bit confused how you are differentiating tags and keywords in this case, but yes, those seem like a reasonable set to start.

@zkry
Copy link
Owner

zkry commented Dec 14, 2024

Sounds good! So I've been reading up on multi-field search and found the algorithm BM25F, an extension of what it's currently doing, to be a good fit for field search. I have an idea now of how things will fit together and am working on this now.

Making this package as extensible as possible is definitely one of my goals and so after I get in these features, I will probably start on writing the documentation. I hope to have all of this done within a month and will definitely keep you posted. Definitely let me know any other ideas or feature requests you might have.

I am a bit confused how you are differentiating tags and keywords in this case, but yes, those seem like a reasonable set to start.

oh, maybe I misunderstood Denote's conventions. So the "...__a_b_c.org" and the front matter "#+filetags: :denote:testing:" are actually conveying the same information I guess.

@swflint
Copy link
Contributor Author

swflint commented Dec 14, 2024

oh, maybe I misunderstood Denote's conventions. So the "...__a_b_c.org" and the front matter "#+filetags: :denote:testing:" are actually conveying the same information I guess.

I wasn't thinking clearly there. They exist separately for a reason: not all denote files are org (can include, for example, images or PDFs). So denote keywords exist, and org tags may exist as well, but that would probably best be a different mapping.

@swflint swflint closed this as completed Dec 14, 2024
@swflint swflint reopened this Dec 14, 2024
@zkry
Copy link
Owner

zkry commented Dec 14, 2024

Ok, that makes sense. I'm not too familiar with denote so I'll definitely have questions how to make this the most useful.

@swflint
Copy link
Contributor Author

swflint commented Dec 14, 2024

It might be reasonable to provide an example of a simple mapping, and I can try to work from there.

@zkry
Copy link
Owner

zkry commented Dec 15, 2024

sounds good. I'll message back when I get the mapping system complete. So far it's going well, here's the WIP #43, still a lot more to implement.

@zkry
Copy link
Owner

zkry commented Dec 19, 2024

So I've put in a lot of work concerning mappings and fields and I'm pretty close to having the feature done. The mapping API for creating extensions seems to be pretty solid so I thought I'd share it here.

So to add a new mapping type, this is the code that would be needed:

(p-search-def-field 'denote-title 'text :weight 3)
(p-search-def-field 'denote-type 'category)
(p-search-def-field 'denote-identifier 'text :weight 10)
(p-search-def-field 'denote-keywords 'category)

(defconst p-seach-mapping-denote
 (p-search-candidate-mapping
  :name "Denote"
  :required-property-list '(file-name)
  :input-spec '()
  :options-spec '()
  :function
  (lambda (_ document)
    (let* ((file-name (p-search-document-property document 'file-name)))
      (when (denote-file-is-note-p file-name)
        (let* ((id (p-search-document-property document 'id))
               (new-id (list 'denote id))
               (identifier (denote-retrieve-filename-identifier file-name))
               (title (denote-retrieve-filename-title file-name))
               (keywords (denote-retrieve-filename-keywords file-name))
               (type (denote-file-type file-name))
               (new-fields `((denote-type . ,type)
                             (denote-identifier . ,identifier))))
          (when keywords
            (push (cons 'denote-keywords (string-split keywords "_")) new-fields))
          (when title
            (push (cons 'denote-title title) new-fields))
          (push (cons 'denote-title title) new-fields)
          (list (p-search-document-extend document new-id new-fields))))))))

(add-to-list 'p-search-candidate-mappings p-seach-mapping-denote)

So you basically provide a function taking two arguments, the mappings args (blank in this case) and the document to be mapped. You then return a list of documents, allowing for 1-to-many mappings. Mappings have to specify the properties they expect to exist (:required-property-list '(file-name) for example) and you can get said property with the p-search-document-property document function from the document. Only documents with this property will be mapped, the remaining will be filtered out. You can then call p-search-document-extend document to create a new document with the additional fields. Fields can be given additional properties so for example here we're setting the denote-title's weight to be 10 (i.e. 10 times more counts if it occurs in this field).

Definitely let me know if there's anything awkward about this API or something could be improved.

You can see the result here that a text query with the field specified will only search said field.

Screenshot 2024-12-20 at 12 58 50 AM

@swflint
Copy link
Contributor Author

swflint commented Dec 19, 2024

This seems like a really clean interface! That said, I do have a couple questions.

Given the example prior that you show, if I specify both the title and denote-title fields, is that taken as a disjunction (so it matches if either fits) or a conjunction (both must match)?

Also, for these sorts of matchers, how do you plan on supporting them? Because not everyone who uses p-search necessarily uses denote, this seems like it might be something that should be in a separate file. If so, would you be interested in me working with it a bit & contributing it once the mappings are merged?

@swflint
Copy link
Contributor Author

swflint commented Dec 19, 2024

As a note, when I use a slightly edited version of your code, I am currently getting the following error:

  error("Not a transient prefix: %s" p-search-transient-dispatcher)
  transient-args(p-search-transient-dispatcher)
  (let* ((args (transient-args 'p-search-transient-dispatcher))) (p-search--add-mapping mapping args) (p-search-restart-calculation))
  p-search-transient-mapping-create(#s(p-search-candidate-mapping :name "Denote" :required-property-list (file-name) :input-spec nil :options-spec nil :function p-search-denote-mapper))
  (if (and (not input-specs) (not option-specs)) (p-search-transient-mapping-create mapping) (apply #'p-search-dispatch-transient (list (apply #'vector (cons "Input" (seq-map #'(lambda ... ...) input-specs))) (apply #'vector (cons "Options" (seq-map #'(lambda ... ...) option-specs))) (vector "Actions" (list "c" "create" (list 'lambda nil '(interactive) (list 'p-search-transient-mapping-create mapping)))))))
  (let* ((input-specs (progn (or (progn (and (memq ... cl-struct-p-search-candidate-mapping-tags) t)) (signal 'wrong-type-argument (list 'p-search-candidate-mapping mapping))) (aref mapping 3))) (option-specs (progn (or (progn (and (memq ... cl-struct-p-search-candidate-mapping-tags) t)) (signal 'wrong-type-argument (list 'p-search-candidate-mapping mapping))) (aref mapping 4)))) (if (and (not input-specs) (not option-specs)) (p-search-transient-mapping-create mapping) (apply #'p-search-dispatch-transient (list (apply #'vector (cons "Input" (seq-map #'... input-specs))) (apply #'vector (cons "Options" (seq-map #'... option-specs))) (vector "Actions" (list "c" "create" (list 'lambda nil '... (list ... mapping))))))))
  p-search-dispatch-add-mapping(#s(p-search-candidate-mapping :name "Denote" :required-property-list (file-name) :input-spec nil :options-spec nil :function p-search-denote-mapper))
  (let* ((available-mappings (seq-filter #'(lambda (mapping) (p-search-candidate-with-properties-exists-p (progn ... ...))) p-search-candidate-mappings)) (selections (seq-map #'(lambda (m) (cons (progn ... ...) m)) available-mappings)) (selection (completing-read "Mapping: " selections nil t)) (selected-mapping (alist-get selection selections nil nil #'equal))) (p-search-dispatch-add-mapping selected-mapping))
  p-search-add-mapping()
  funcall-interactively(p-search-add-mapping)
  command-execute(p-search-add-mapping)

When running M Denote RET.

@swflint
Copy link
Contributor Author

swflint commented Dec 20, 2024

Also, it's unclear if you can stack mappings, or what happens if a document can't be mapped.

@swflint
Copy link
Contributor Author

swflint commented Dec 20, 2024

I want to clarify my previous comment to ask a few specific questions.

  1. What are the semantics if p-search-candidate-mapping-function returns nil on a given document? Is the document dropped, or kept in its original form (I would prefer this, it seems more flexible)?
  2. Do mappings compose? If so, what is the order of composition? Is there a way to specify this more explicitly (e.g., order of mapping editing) or provide p-search with information about what fields a mapping may add to resolve mapping order?
  3. How should field weights be assigned? What is the scale of weights?
  4. Are there plans for field types other than text and category? For example, a numeric field would be particularly helpful as I have a draft of a mapping to use denote-explore to add denote-graph metadata.

@zkry
Copy link
Owner

zkry commented Dec 20, 2024

Given the example prior that you show, if I specify both the title and denote-title fields, is that taken as a disjunction (so it matches if either fits) or a conjunction (both must match)?

So the way it's currently set up is that if you don't specify fields, it will search all the fields, and give any extra specified weight to the field. If you do specify fields (multiple select WIP) it will only search in those fields. That said, another query feature that I'm still working out the details is being able to specify fields in the query string, so for example, foo:denote-title would search for foo in the "denote-title" field. This would in turn allow features like `foo:denote-tile AND bar:denote-type" to take it as conjunction.

Also, it's unclear if you can stack mappings, or what happens if a document can't be mapped.
What are the semantics if p-search-candidate-mapping-function returns nil on a given document? Is the document dropped, or kept in its original form (I would prefer this, it seems more flexible)?

Currently nil means drop, but it doesn't have to be this way. In fact I was also thinking of having it not drop, or even better to have this as an argument that can be passed to the mapping (thus allowing filters from the mapping infrastructure)

Do mappings compose? If so, what is the order of composition? Is there a way to specify this more explicitly (e.g., order of mapping editing) or provide p-search with information about what fields a mapping may add to resolve mapping order?

Mappings do indeed compose. So the mappings run in order, adding any additional field or creating as many sub documents. So for example, one mapping I want to create in the future is one that splits the document up every certain number of lines. So in theory you could first run in through the denote mapping, giving the candidate the denote fields, then you could run in through the file-splitting mapping, keeping the denote fields, but creating several documents from each one.

Improving the UI on this is definitely something I want to do as you can set up pretty complicated things.

How should field weights be assigned? What is the scale of weights?

The current implementation makes it so that a weight of N makes it so that a term in that field is counted as if it occurred N times. Its kind of hard to put a number to how much higher its score will be, but it can be thought like, if a title had a weight of 10 and a document had term X in its title, another document would need to have ten occurrences of X in its body to be given the same score.

Are there plans for field types other than text and category? For example, a numeric field would be particularly helpful as I have a draft of a mapping to use denote-explore to add denote-graph metadata.

Definitely! Also "time" was one that feels like it could come up a lot. I'm still trying to figure out the queries to make available for number fields.

And thanks for that PR btw! I merged it into #43 and after certain details are figured out and its all tested, I'll merge it in.

@swflint
Copy link
Contributor Author

swflint commented Dec 20, 2024

So the way it's currently set up is that if you don't specify fields, it will search all the fields, and give any extra specified weight to the field. If you do specify fields (multiple select WIP) it will only search in those fields. That said, another query feature that I'm still working out the details is being able to specify fields in the query string, so for example, foo:denote-title would search for foo in the "denote-title" field. This would in turn allow features like `foo:denote-tile AND bar:denote-type" to take it as conjunction.

Okay, that makes a lot of sense, and I like the idea of a special query syntax. However, I think maybe a query syntax of denote-title:foo would make slightly more sense (given how that sort of thing works in other search tools).

Currently nil means drop, but it doesn't have to be this way. In fact I was also thinking of having it not drop, or even better to have this as an argument that can be passed to the mapping (thus allowing filters from the mapping infrastructure)

I'm not sure that mappings should remove documents, in general. I can get behind splitting documents (maybe), but removing them seems counterintuitive.

As I was writing the mappings that you merged, this was confusing, and my thought was that I should only return changes, with nil signalling no change. Since that is not currently the case, I'll make another PR to make them so they don't drop items. A specific, separate value to remove the document seems reasonable

Mappings do indeed compose. So the mappings run in order, adding any additional field or creating as many sub documents. So for example, one mapping I want to create in the future is one that splits the document up every certain number of lines. So in theory you could first run in through the denote mapping, giving the candidate the denote fields, then you could run in through the file-splitting mapping, keeping the denote fields, but creating several documents from each one.

Improving the UI on this is definitely something I want to do as you can set up pretty complicated things.

I think one of the major modes of composition I'd like to see is (perhaps) allowing multiple values for a given field, and letting new mappings add new types of fields as necessary. Consider, for example, title, in the case of Denote: I could store not only the sluggified title, but the document's regular title. If I call them both title, then my queries could operate on both, without necessarily having to think about which source the metadata came from (which seems helpful to me, I have some PDFs that are managed by denote, and have denote-derived titles, as well as pdf title, and I'd like to search by both as the title).

The current implementation makes it so that a weight of N makes it so that a term in that field is counted as if it occurred N times. Its kind of hard to put a number to how much higher its score will be, but it can be thought like, if a title had a weight of 10 and a document had term X in its title, another document would need to have ten occurrences of X in its body to be given the same score.

I'm not sure that I as a developer should be setting the weight or priority of different fields, but it does seem reasonable that I can set a default. Perhaps providing a UI to modify this would be helpful (it may be something that's best done as part of creating the prior using it?)

Definitely! Also "time" was one that feels like it could come up a lot. I'm still trying to figure out the queries to make available for number fields.

Great.

Some thoughts for number-related queries:

  • Score transform ($y = \frac{mx + b}{s}$, where $m$ and $b$ are values provided by the user, $s$ is inferred from the max or similar)
  • number is less than (scaled for distance?)
  • number is greater than (scaled for distance?)
  • number is near?

With similar queries for date/times?

And thanks for that PR btw! I merged it into #43 and after certain details are figured out and its all tested, I'll merge it in.

I'll be submitting another one shortly, to address some misunderstandings that I had.

@swflint
Copy link
Contributor Author

swflint commented Dec 20, 2024

As I was writing the mappings that you merged, this was confusing, and my thought was that I should only return changes, with nil signalling no change.
Since that is not currently the case, I'll make another PR to make them so they don't drop items.
A specific, separate value to remove the document seems reasonable.

As I have been thinking about this a bit more, I think the best way to approach this is:

  • If it is possible to modify the document, return one or mode modified documents.
  • If the document should be removed, this should be signaled explicitly, with some non-nil sentinel (e.g., :remove)
  • If it is not possible to modify the document, simply return nil

This makes it so that, as a developer of mappings, I only have to concentrate on two cases:

  1. Modification of the document
  2. Deletion of the document

This would simplify the code that I have in the denote and pdf info mappers, and would make it quite a bit easier to add future mappers (such as a denote graph mapper, or for users of org-roam, org-roam-related mappers.

@zkry
Copy link
Owner

zkry commented Dec 22, 2024

However, I think maybe a query syntax of denote-title:foo would make slightly more sense (given how that sort of thing works in other search tools).

Oh yeah, that makes more sense.

I think one of the major modes of composition I'd like to see is (perhaps) allowing multiple values for a given field, and letting new mappings add new types of fields as necessary.

I was thinking about this too. Sometimes you'd even want to extract multiple values for the same field. Like for example, in an HTML or markdown file, you might want to extract the various headings, so in that case you could have multiple h1s.

I'm not sure that I as a developer should be setting the weight or priority of different fields, but it does seem reasonable that I can set a default.

I agree these should definitely be easily customizable. Maybe something like (p-search-set-field 'denote-title :weight 3). I'd have to think of a way to have a nice defcustom interface.

As I have been thinking about this a bit more, I think the best way to approach this is:

If it is possible to modify the document, return one or mode modified documents.
If the document should be removed, this should be signaled explicitly, with some non-nil sentinel (e.g., :remove)
If it is not possible to modify the document, simply return nil

I like this approach too. It looks like it simplifies mapping and filtering. I'll update the code to have these semantics. This way, I think we could get the p-search-denote-only-denote-p and p-search-pdfinfo-drop-non-pdfs-p properties by default.

@zkry
Copy link
Owner

zkry commented Dec 22, 2024

I merged in the branch. There should now be support for multiple field values (fields can be a list of strings instead of a string). Also a returned nil has the meaning you mentioned, of just doing nothing. There is now a default mapping (-f) for changing the functionality of this.

I also moved the generator and mapping enhancements to a new subdirectory called "extensions" with a name prefix of "psx", analagous to how org babel extensions have the "ob-" prefix.

There may definitely be bugs so definitely let me know if something is off. Next up I'll work on tackling some of the other issues. Also documentation is a huge need so I'll be working on that too.

@swflint
Copy link
Contributor Author

swflint commented Dec 22, 2024

I merged in the branch. There should now be support for multiple field values (fields can be a list of strings instead of a string). Also a returned nil has the meaning you mentioned, of just doing nothing. There is now a default mapping (-f) for changing the functionality of this.

Is there an example of adding new values (not overwriting) to an existing field?

I also moved the generator and mapping enhancements to a new subdirectory called "extensions" with a name prefix of "psx", analagous to how org babel extensions have the "ob-" prefix.

That makes sense!

There may definitely be bugs so definitely let me know if something is off. Next up I'll work on tackling some of the other issues. Also documentation is a huge need so I'll be working on that too.

Right of the bat, it looks like extensions/psx-search-pdfinfo.el should be extensions/psx-pdfinfo.el.

Other than that, I'll be using it and let you know if I find any other bugs.

Thank you for all the work that you've put in to this!

@swflint
Copy link
Contributor Author

swflint commented Dec 22, 2024

I'm not sure that I as a developer should be setting the weight or
priority of different fields, but it does seem reasonable that I can
set a default.

I agree these should definitely be easily customizable. Maybe
something like (p-search-set-field 'denote-title :weight 3). I'd
have to think of a way to have a nice defcustom interface.

I could have been clearer here, sorry.

As a developer, if I define a field, it seems reasonable that I can provide a default priority (ideally, yes, through customize).
However, as a user, it would be helpful (I think) to be able to selectively override priorities on a per-session basis.
This could be as simple as a new type of prior, or a separate section (I am not sure which would be best).

@swflint
Copy link
Contributor Author

swflint commented Dec 22, 2024

I merged in the branch. There should now be support for multiple field
values (fields can be a list of strings instead of a string). Also a
returned nil has the meaning you mentioned, of just doing
nothing. There is now a default mapping (-f) for changing the
functionality of this.

If this is going to be the approach, there are a couple of things that need clarified as a developer.

  • Is there a way I can add to an existing field? If so, what is the way to do so? (This seems like something relevant to the two mappings I worked on, the candidate generator will provide an initial title, so will Denote, I would like to have access to both)
  • Are their fields that p-search should pre-define, and have as un-prefixed? My initial thought is that the following fields would be common enough:
    • author
    • title
    • keywords or tags
    • creation-date
    • modification-date

@zkry
Copy link
Owner

zkry commented Dec 22, 2024

Is there an example of adding new values (not overwriting) to an existing field?

I don't have an example yet of this, but essentially the "new-fields" argument to (p-search-document-extend document &optional new-id new-fields new-props) will add the new fields, not overwrite them. So essentially one mapping can call (p-search-document-extend a b '((h1 . ("my title"))))) and a second one can call (p-search-document-extend c d '((h1 . ("my other title"))))) and the final resulting document will have as value for h1 ("my title" "my other title").

(p-search-document-extend a b '((h1 . "my title"))))
(p-search-document-extend c d '((h1 . "my other title"))))

will also produce the same result.

Right of the bat, it looks like extensions/psx-search-pdfinfo.el should be extensions/psx-pdfinfo.el.

oh, good catch! Updating this right now.

Thank you for all the work that you've put in to this!

And thank you for collaborating with me on this! Having someone to go back and forth with makes a great difference.


As a developer, if I define a field, it seems reasonable that I can provide a default priority (ideally, yes, through customize).
However, as a user, it would be helpful (I think) to be able to selectively override priorities on a per-session basis.
This could be as simple as a new type of prior, or a separate section (I am not sure which would be best).

Oh, I see what you mean. So the first thought that come to mind is to have an additional option on the query prior (probably the only one that will be dealing with these weights) which when selected, lets you input override weights. I think the key to this will be creating a brand new transient input to make setting the weights effortless. Another idea is to have a session customization session, like the typical Emacs defcustom, but it only applies to the current session. I'll need to think about this a little bit more.

Is there a way I can add to an existing field? If so, what is the way to do so? (This seems like something relevant to the two mappings I worked on, the candidate generator will provide an initial title, so will Denote, I would like to have access to both)

So as eluded to above, if you want to just add to an existing field, you should be able to just call p-search-document-extend with the key/value you want to add to. Currently, fields can only be added to with mappings, not modified or removed. Document properties on the other hand are not meant to be used for search, but rather to define what operations and features (e.g. mappings, priors) are available to the candidate. A property of git-root will inform the system that it can apply the git-related priors. A property of buffer will inform the system that buffer-related priors are applicable. Like with the denote extension, a property of file-name indicates that it can be mapped with the denote mapping. file-name, git-root, and buffer are not meant to be searchable via queries however. So you can replace a candidate's properties by passing it as the new-props argument.

CANDIDATE-GENERATOR
 |  |  |  |  |  | 
 |  |  |  |  |  |
 v  v  v  |  v  v
          |
          |
MAPPING_1(x) --> (PROPS, FIELDS)
                   |       |
          +--------+       |
          |                v
MAPPING_2(x) --> (PROPS, FIELDS)
                   |       |
          +--------+       |
          |                v
MAPPING_3(x) --> (PROPS, FIELDS)
                    |      |
                    |      v
         CANDIDATE( v   ,  v )

Are their fields that p-search should pre-define, and have as un-prefixed? My initial thought is that the following fields would be common enough:

So title is treated special in p-search. It is a property, not a field, and it indicates how the candidate should be named in the system. This can be accessed via (p-search-document-property doc 'title) and every document should have a title and any mapping can change it (via p-search-document-extend). I am currently working on having the title searchable just like fields. title being a property, there cant be more than one of it (though there can be as many titles as a field). I could change the property to name and have it only for the document's name, and then let title be a field. I'll have to think more about what to do for this specific case and definitely let me know if you have any ideas.

As for the other items, I think there could be like a standard definition of common fields. Doing this wouldn't change the working of p-search, but rather the way different extensions coordinate on defining properties. We'd definitely need to think through things. Like let's say we have modification-date and it gets populated by the filesystem's modified time, then git comes and adds the last committed time. And then maybe there's some metadata contained in the file itself. If a user wanted to make a query like "modified close to this time," with this work as desired? The alternate would be having many different numerical fields like file-modified, git-modified, denote-modified, etc. and the question would be how would these all be queried? Maybe giving fields semantic metadata could work, like (p-search-def-field 'pdf-title 'text :weight 3 :class 'title), in order to group them. Definitely let me know your thoughts on this too.

Thanks again for your help working these things out!

@swflint
Copy link
Contributor Author

swflint commented Jan 1, 2025

So title is treated special in p-search. It is a property, not a
field, and it indicates how the candidate should be named in the
system. This can be accessed via (p-search-document-property doc 'title) and every document should have a title and any mapping can
change it (via p-search-document-extend). I am currently working on
having the title searchable just like fields. title being a
property, there cant be more than one of it (though there can be as
many titles as a field). I could change the property to name and
have it only for the document's name, and then let title be a
field. I'll have to think more about what to do for this specific case
and definitely let me know if you have any ideas.

I think calling it name or identifier seem like reasonable options (a variation of the latter would make the most sense, imho).

As for the other items, I think there could be like a standard
definition of common fields. Doing this wouldn't change the working
of p-search, but rather the way different extensions coordinate on
defining properties. We'd definitely need to think through
things. Like let's say we have modification-date and it gets
populated by the filesystem's modified time, then git comes and adds
the last committed time. And then maybe there's some metadata
contained in the file itself. If a user wanted to make a query like
"modified close to this time," with this work as desired? The
alternate would be having many different numerical fields like
file-modified, git-modified, denote-modified, etc. and the
question would be how would these all be queried? Maybe giving fields
semantic metadata could work, like (p-search-def-field 'pdf-title 'text :weight 3 :class 'title), in order to group them. Definitely
let me know your thoughts on this too.

Hmm...
This is the tough one.
I do think, for example, that it is reasonable to have multiple versions of the modification date using the same name (and likely, tbh, for other sorts of information).
Following the modification date example, I may have touched a file more recently than it was touched "in git", but I do not necessarily need to think about that when I set priors.
It could be interesting to provide information about the source as meta-metadata, however.

To the earlier part of this, yes, standard definitions of common fields really does seem necessary.
Like I noted previously, I think author, title, keywords (or tags, including a text-formatted version), and creation/modification date information are minimum.
Taking a note from the University of Central Florida libraries, language and file-type may be useful as standard fields as well.

@zkry
Copy link
Owner

zkry commented Jan 2, 2025

Ok, let's go with that then. I'll update the code changing the property title to be "name", and leave "title" to be a field. Then, I'll define the fields "author", "title", "keywords", "creation-date", "modification-date", "language", and "file-type" in p-seach.el.

@zkry
Copy link
Owner

zkry commented Jan 6, 2025

So the field refactoring, adding the default fields should be done in this PR #65

This also adds a new mechanism to query by category. So changed up psx-denote a bit since querying by category is now possible.

If you still want to be able to query categorical fields like text, I created this issue #66 for being able to query categories as text.

@swflint
Copy link
Contributor Author

swflint commented Jan 6, 2025

The formatted text version was a stop-gap until category queries were supported as it is. I'll take a look at #65, and on merge, I think this issue will be closed.

@zkry
Copy link
Owner

zkry commented Jan 7, 2025

Ok, should be merged in now. Definitely let me know anything's not working as expected.

@swflint
Copy link
Contributor Author

swflint commented Jan 7, 2025

It works wonderfully so far! Thank you!

@swflint swflint closed this as completed Jan 7, 2025
@zkry
Copy link
Owner

zkry commented Jan 9, 2025

And thank you for your ideas working with me through this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants