feat: Introduce custom identifier extraction mecanism #62

jpedroh · 2024-07-20T14:39:19Z

Up until now, we used matching handlers to capture special node identifiers, diverging from their intended purpose. This approach also required users to add Rust code to the project, limiting the tool's generalization and extensibility.

This PR changes how node identifiers are extracted by moving the extraction process to the parsing step. This improves performance (as it runs only once per parse) and leverages Tree-sitter’s pattern matching query functionality.

Users can now provide a configuration with node types and a Tree-sitter query expression to extract identifiers. For example, in a Java class, a user can extract a field declaration identifier using the query (variable_declarator name: _ @field_name), which captures the field name.

However, Tree-sitter pattern matching can fall short in some cases. For instance, when trying to retrieve the identifier for a class with an inner class:

class A {
    class B {
    }
}

Using the query (class_declaration (identifier) @class_name) matches both classes A and B, resulting in [A, B] as the identifier, which is incorrect. Since Tree-sitter’s query language doesn’t support matching a single entry - this has to be done in userland code, which would complicate the identifier extraction process.

To address this, this PR introduces the option to use a Regular Expression for identifier extraction. The regular expression runs on the node’s source code and captures only the first match. In this case, class [A-Za-z_][A-Za-z0-9_]* correctly matches the class name, and we can safely discard the match for class B (since only the first match is considered).

These changes simplify the introduction of new extractors and eliminate approximately 600 lines of Rust code (tests and source code) previously used for node identifier extraction.

codesandbox · 2024-07-20T14:39:22Z

Review or Edit in CodeSandbox

Open the branch in Web Editor • VS Code • Insiders

Open Preview

coveralls · 2024-07-20T14:42:38Z

coverage: 80.247% (-2.0%) from 82.296%
when pulling 4003a53 on feat-introduce-identifier
into f6d66a5 on main.

jpedroh added 4 commits July 19, 2024 15:43

feat: Introduce identifier field on NonTerminal

f59d7f8

feat: initial iteration on tree sitter query

40a066e

feat: add regex matching as well

fb64495

feat: move identifier extraction to a trait

b359e4e

chore: remove unused code

c553bbf

refactor: simplify code

d6ada29

jpedroh force-pushed the feat-introduce-identifier branch from 76756d8 to d6ada29 Compare July 20, 2024 18:02

jpedroh added 8 commits July 20, 2024 15:05

refactor: rename

654c310

refactor: remove unused variable

9c9368f

chore: allow dead code

859e629

refactor: simplify conditions

3f9105a

fix: incorrect query for import declaration

425974d

performance: move regex initialization for better perf

c8a3640

perf: hoist query initialization

459d32a

refactor: condense on filter map

4003a53

jpedroh marked this pull request as ready for review July 21, 2024 18:04

jpedroh merged commit 2c6a135 into main Jul 21, 2024
8 checks passed

jpedroh deleted the feat-introduce-identifier branch July 21, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce custom identifier extraction mecanism #62

feat: Introduce custom identifier extraction mecanism #62

jpedroh commented Jul 20, 2024 •

edited

Loading

codesandbox bot commented Jul 20, 2024

coveralls commented Jul 20, 2024 •

edited

Loading

feat: Introduce custom identifier extraction mecanism #62

feat: Introduce custom identifier extraction mecanism #62

Conversation

jpedroh commented Jul 20, 2024 • edited Loading

codesandbox bot commented Jul 20, 2024

Review or Edit in CodeSandbox

coveralls commented Jul 20, 2024 • edited Loading

jpedroh commented Jul 20, 2024 •

edited

Loading

coveralls commented Jul 20, 2024 •

edited

Loading