Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Use git attributes to identify binary files #2083

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dustinbyrne
Copy link
Contributor

@dustinbyrne dustinbyrne commented Oct 22, 2024

Fixes #2042

Add support for identifying binary files using git attributes in addition to file extensions.

  • Add getGitAttributes function to read and parse the .gitattributes file in packages/cli/src/fulltext/FileIndex.ts.
  • Update isBinaryFile function to check git attributes information in addition to file extensions in packages/cli/src/fulltext/FileIndex.ts.
  • Update filterFiles function to use the getGitAttributes function to determine if a file is binary based on git attributes in packages/cli/src/fulltext/FileIndex.ts.
  • Add tests to verify that the context search/lookup/collector correctly identifies binary files using git attributes information in packages/cli/tests/unit/fulltext/FileIndex.spec.ts.

For more details, open the Copilot Workspace session.

Fixes #2042

Add support for identifying binary files using git attributes in addition to file extensions.

* Add `getGitAttributes` function to read and parse the `.gitattributes` file in `packages/cli/src/fulltext/FileIndex.ts`.
* Update `isBinaryFile` function to check git attributes information in addition to file extensions in `packages/cli/src/fulltext/FileIndex.ts`.
* Update `filterFiles` function to use the `getGitAttributes` function to determine if a file is binary based on git attributes in `packages/cli/src/fulltext/FileIndex.ts`.
* Add tests to verify that the context search/lookup/collector correctly identifies binary files using git attributes information in `packages/cli/tests/unit/fulltext/FileIndex.spec.ts`.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/getappmap/appmap-js/issues/2042?shareId=XXXX-XXXX-XXXX-XXXX).
@dustinbyrne dustinbyrne force-pushed the dustinbyrne/fix/use-git-attributes branch from 3201148 to 190762d Compare October 22, 2024 21:28
@dustinbyrne dustinbyrne marked this pull request as ready for review October 23, 2024 17:44
};
const isBinaryFile = (fileName: string, gitAttributes?: ReturnType<typeof getGitAttributes>) =>
BINARY_FILE_EXTENSIONS.some((ext) => fileName.endsWith(ext)) ||
gitAttributes?.some((attr) => attr.binary && minimatch(fileName, attr.pattern, { dot: true }));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The { dot: true } option was missing from the demonstration this morning. This fixes the issue with dot files appearing with a **/* pattern.

binary: attributes.includes('binary'),
};
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is much simpler than actual gitattributes(5):

Each line in gitattributes file is of form:

pattern attr1 attr2 ...

That is, a pattern followed by an attributes list, separated by whitespaces. Leading and trailing whitespaces are ignored. Lines that begin with # are ignored. Patterns that begin with a double quote are quoted in C style. When the pattern matches the path in question, the attributes listed on the line are given to the path.

When deciding what attributes are assigned to a path, Git consults $GIT_DIR/info/attributes file (which has the highest precedence), .gitattributes file in the same directory as the path in question, and its parent directories up to the toplevel of the work tree (the further the directory that contains .gitattributes is from the path in question, the lower its precedence). Finally global and system-wide files are considered (they have the lowest precedence).

This might be fine for our purposes, but the simplification warrants a comment at least — and I think skipping comments is easy enough to just do it.

Comment on lines +311 to +313
if (!existsSync(gitAttributesPath)) return [];

const gitAttributesContent = readFileSync(gitAttributesPath, 'utf-8');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From node.js doc:

Using fs.exists() to check for the existence of a file before calling fs.open(), fs.readFile(), or fs.writeFile() is not recommended. Doing so introduces a race condition, since other processes may change the file's state between the two calls. Instead, user code should open/read/write the file directly and handle the error raised if the file does not exist.

(Essentially, exists returning true doesn't guarantee that read will succeed, so you have to handle the error anyway, so might as well skip the exists.)

return {
pattern,
binary: attributes.includes('binary'),
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only pulling binary like that, why even bother returning other patterns? Perhaps it should just return a list of binary patterns instead. This will also simplify upstream code — you can just add those to excluded patterns.

@dustinbyrne dustinbyrne marked this pull request as draft November 11, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use git attributes to identify binary files
2 participants