Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Use git attributes to identify binary files #2083

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 29 additions & 6 deletions packages/cli/src/fulltext/FileIndex.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import { stat } from 'node:fs/promises';
import path, { join } from 'node:path';
import { readFileSync } from 'node:fs';

import sqlite3 from 'better-sqlite3';
import assert from 'assert';
import makeDebug from 'debug';
import { existsSync } from 'fs';
import minimatch from 'minimatch';

import listProjectFiles from './listProjectFiles';
import queryKeywords from './queryKeywords';
Expand Down Expand Up @@ -80,11 +82,14 @@ export class FileIndex {
? await listGitProjectFiles(directory)
: await listProjectFiles(directory);

const gitAttributes = getGitAttributes(directory);

const filteredFileNames = await filterFiles(
directory,
fileNames,
excludePatterns,
includePatterns
includePatterns,
gitAttributes
);

const options = {
Expand Down Expand Up @@ -254,9 +259,9 @@ const DATA_FILE_EXTENSIONS: string[] = [
'xml',
].map((ext) => '.' + ext);

const isBinaryFile = (fileName: string) => {
return BINARY_FILE_EXTENSIONS.some((ext) => fileName.endsWith(ext));
};
const isBinaryFile = (fileName: string, gitAttributes?: ReturnType<typeof getGitAttributes>) =>
BINARY_FILE_EXTENSIONS.some((ext) => fileName.endsWith(ext)) ||
gitAttributes?.some((attr) => attr.binary && minimatch(fileName, attr.pattern, { dot: true }));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The { dot: true } option was missing from the demonstration this morning. This fixes the issue with dot files appearing with a **/* pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good fix to have. Is it in any other merged PRs?


const isDataFile = (fileName: string) => {
return DATA_FILE_EXTENSIONS.some((ext) => fileName.endsWith(ext));
Expand All @@ -266,11 +271,12 @@ export async function filterFiles(
directory: string,
fileNames: string[],
excludePatterns?: RegExp[],
includePatterns?: RegExp[]
includePatterns?: RegExp[],
gitAttributes?: ReturnType<typeof getGitAttributes>
): Promise<string[]> {
const result: string[] = [];
for (const fileName of fileNames) {
if (isBinaryFile(fileName)) continue;
if (isBinaryFile(fileName, gitAttributes)) continue;

const includeFile = fileNameMatchesFilterPatterns(fileName, includePatterns, excludePatterns);
if (!includeFile) continue;
Expand Down Expand Up @@ -298,3 +304,20 @@ export async function filterFiles(
}
return result;
}

// Reads git attribute patterns from a .gitattributes file.
export function getGitAttributes(directory: string) {
const gitAttributesPath = join(directory, '.gitattributes');
if (!existsSync(gitAttributesPath)) return [];

const gitAttributesContent = readFileSync(gitAttributesPath, 'utf-8');
Comment on lines +311 to +313
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From node.js doc:

Using fs.exists() to check for the existence of a file before calling fs.open(), fs.readFile(), or fs.writeFile() is not recommended. Doing so introduces a race condition, since other processes may change the file's state between the two calls. Instead, user code should open/read/write the file directly and handle the error raised if the file does not exist.

(Essentially, exists returning true doesn't guarantee that read will succeed, so you have to handle the error anyway, so might as well skip the exists.)

const lines = gitAttributesContent.split('\n').filter(Boolean);

return lines.map((line) => {
const [pattern, ...attributes] = line.split(/\s+/);
return {
pattern,
binary: attributes.includes('binary'),
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only pulling binary like that, why even bother returning other patterns? Perhaps it should just return a list of binary patterns instead. This will also simplify upstream code — you can just add those to excluded patterns.

});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is much simpler than actual gitattributes(5):

Each line in gitattributes file is of form:

pattern attr1 attr2 ...

That is, a pattern followed by an attributes list, separated by whitespaces. Leading and trailing whitespaces are ignored. Lines that begin with # are ignored. Patterns that begin with a double quote are quoted in C style. When the pattern matches the path in question, the attributes listed on the line are given to the path.

When deciding what attributes are assigned to a path, Git consults $GIT_DIR/info/attributes file (which has the highest precedence), .gitattributes file in the same directory as the path in question, and its parent directories up to the toplevel of the work tree (the further the directory that contains .gitattributes is from the path in question, the lower its precedence). Finally global and system-wide files are considered (they have the lowest precedence).

This might be fine for our purposes, but the simplification warrants a comment at least — and I think skipping comments is easy enough to just do it.

49 changes: 48 additions & 1 deletion packages/cli/tests/unit/fulltext/FileIndex.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import { join } from 'node:path';
import sqlite3 from 'better-sqlite3';
import tmp from 'tmp';

import { FileIndex, filterFiles } from '../../../src/fulltext/FileIndex';
import { FileIndex, filterFiles, getGitAttributes } from '../../../src/fulltext/FileIndex';
import * as querySymbols from '../../../src/fulltext/querySymbols';
import { Git, GitState } from '../../../src/telemetry';
import * as listGitProjectFiles from '../../../src/fulltext/listGitProjectFIles';
Expand Down Expand Up @@ -183,4 +183,51 @@ describe(filterFiles, () => {
'large.txt',
]);
});

it('filters out files based on git attributes', async () => {
const dir = tmp.dirSync({ unsafeCleanup: true }).name;
writeFileSync(join(dir, '.gitattributes'), '*.my-extension binary\n*.bin binary\n*.txt text\n');
writeFileSync(join(dir, 'file.txt'), 'hello');
writeFileSync(join(dir, 'file.my-extension'), 'hello');
writeFileSync(join(dir, 'file.bin'), 'hello');
writeFileSync(join(dir, '.file.bin'), 'hello');

const fileList = readdirSync(dir);
const gitAttributes = getGitAttributes(dir);
const filtered = await filterFiles(dir, fileList, undefined, undefined, gitAttributes);
expect(filtered).toEqual(['.gitattributes', 'file.txt']);
});
});

describe('getGitAttributes', () => {
it('returns an empty array if the .gitattributes file does not exist', () => {
const dir = tmp.dirSync({ unsafeCleanup: true }).name;
expect(getGitAttributes(dir)).toEqual([]);
});

it('returns an empty array if the .gitattributes file is empty', () => {
const dir = tmp.dirSync({ unsafeCleanup: true }).name;
writeFileSync(join(dir, '.gitattributes'), '');
expect(getGitAttributes(dir)).toEqual([]);
});

it('returns an expected result if an item is invalid', () => {
const dir = tmp.dirSync({ unsafeCleanup: true }).name;
writeFileSync(join(dir, '.gitattributes'), 'invalid');
expect(getGitAttributes(dir)).toEqual([
{
pattern: 'invalid',
binary: false,
},
]);
});

it('returns an array of attributes if the .gitattributes file is valid', () => {
const dir = tmp.dirSync({ unsafeCleanup: true }).name;
writeFileSync(join(dir, '.gitattributes'), '*.txt text\n*.bin binary\n');
expect(getGitAttributes(dir)).toEqual([
{ pattern: '*.txt', binary: false },
{ pattern: '*.bin', binary: true },
]);
});
});
Loading