Integrate AppMap data with the context search algorithm. #2109

kgilpin · 2024-11-06T20:35:47Z

Objective

Improve the relevance of code snippet search results in the presence of AppMap data.

The current algorithm selects code snippets in two phases:

Find best matching AppMap data files, and add code snippets that are referred to by events on
those AppMap files. Select a number of events such that the character count of the collected
snippets matches 3/4 of a threshold.
Perform a second code search, unboosted by AppMap data, to fill in the remaining 1/4 of the
threshold.

sequenceDiagram
    participant S as SearchContextCollector
    participant F as FileIndex
    participant SI as SnippetIndex
    participant E as EventCollector
    participant A as AppMapIndex

    S->>A: search AppMaps with vectorTerms
    activate A
    A-->>S: AppMapSearchResponse
    deactivate A

    S->>F: buildIndex - files
    activate F
    F-->>S: fileIndex
    deactivate F

    S->>F: search fileIndex with vectorTerms
    activate F
    F-->>S: FileSearchResult[]
    deactivate F

    S->>SI: buildIndex - snippets
    activate SI
    SI-->>S: snippetIndex
    deactivate SI

    loop collect context with varying events
        S->>E: collectEvents
        activate E
        E-->>S: contextCandidate
        deactivate E

        S->>SI: collectSnippets
        activate SI
        SI-->>S: sourceContext
        deactivate SI

        S->>S: applyContext
        activate S
        S-->>S: appliedContext
        deactivate S
    end

    S-->>S: return searchResponse and context

Some problems with this approach include:

The fixed 3/4 allocation of context that comes directly from AppMaps. If there is AppMap data
available, and it is minimally relevant to the user's question, but not highly relevant, 3/4 of
the search context will still be populated by snippets referenced in that AppMap data.
The search algorithm used to select the events from the matching AppMap data does not index the
full code of the functions referred to by the events; it only matches on certain keywords that
are present in the AppMap data index (such as function names, parameter names, etc).

Logically, if a user records AppMap data, then the functions that are referenced by that AppMap data
are likely to be more relevant to any user question that functions that are not referenced. Keyword
search (BM25) is still a factor, but a keyword search match that is referenced by AppMap data should
be considered more relevant than a keyword search match that is not referenced.

This technique can be extended to other types of references, such as stack traces and errors, in
which non-code files that are relevant to runtime execution contain embedded references to code
object names and file locations.

Overview

The first phase of searching is to select relevant content that will be used for boosting search
results.

The second phase of searching is to index all the possible snippets that may match the user's
question, and then apply boost factors from the results obtained in phase one.

A file index and snippet index are similar. They both contain an identifier, directory, file path,
tokens, and words. File path, tokens, and words are indexed. Snippet also includes a range within
the file from which the snippet is obtained.

Snippets can be boosted by applying boost factors to specific identifiers. Applying a boost factor
makes a snippet more likely to be chosen; although it must be a BM25 match as well. Boosts are
applied when some external relevant data, such as a trace or stack trace, refers to a snippet. In
those cases, reference in the external relevant data is a strong indication that the snippet is
likely to be relevant to the search.

Task

These changes refer to the context search algorithm.

The first algorithm step is to choose the most relevant AppMap data files. This step should remain,
but it should be migrated to the new FileIndex implementation. AppMap data files should not be
indexed directly. Rather, there is an index directory for each AppMap data file, with the same name
as the AppMap file without the extension. Within this directory are files that contain keywords
which have been extracted from the AppMap data file. These keywords can be used to match the search
query.

The second step of the current algorithm is to select AppMap events and collect some number of these
into the output. This step will be removed. Instead of selecting AppMap events in one step, and
collecting relevant source code snippets in a second step, these steps will be combined.

Snippets are added to a SnippetIndex. AppMap data elements that are not code snippets, such as HTTP
client and server requests and SQL queries, will also be added to the SnippetIndex. Then those
snippets that are referenced by an AppMap data file selected in the first step will be boosted, as
described in the Overview.

kgilpin added the enhancement New feature or request label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate AppMap data with the context search algorithm. #2109

Integrate AppMap data with the context search algorithm. #2109

kgilpin commented Nov 6, 2024 •

edited

Loading

Integrate AppMap data with the context search algorithm. #2109

Integrate AppMap data with the context search algorithm. #2109

Comments

kgilpin commented Nov 6, 2024 • edited Loading

Objective

Overview

Task

kgilpin commented Nov 6, 2024 •

edited

Loading