Skip to content

Commit

Permalink
chore: avoid known file signatures in datatypeId (#155)
Browse files Browse the repository at this point in the history
This is just a performance optimization for the Mapeo indexer so that it
avoids trying to parse files that are not Mapeo Docs. For example, a
hypercore might have PNG files written to it, which is prefixed by '89
50 4E 47 0D 0A 1A 0A'. If we used this as a dataTypeId then the indexer
would think any PNGs in the core are a Mapeo datatype and try to parse
them. It would fail and just be ignored, but trying to parse would have
a performance cost.

This is a check in the build script that will throw an error if a new
dataType is added that matches one of the known file signature prefixes.
In some cases we don't check against the whole file signature - we just
avoid starting data type IDs with byte(s) that are common in file
signatures.
  • Loading branch information
gmaclennan authored Oct 26, 2023
1 parent a977f6b commit 01c3bf6
Showing 1 changed file with 43 additions and 0 deletions.
43 changes: 43 additions & 0 deletions scripts/lib/parse-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,48 @@ import { capitalize, PROJECT_ROOT } from './utils.js'
// These messages are embedded in others and do not define Mapeo data types
const EMBEDDED_MESSAGES = ['tags', 'common']

// We avoid creating data type IDs that match these, since blobs (e.g. icons)
// can be stored in Mapeo hypercores, and we want to avoid trying to parse a
// file blob as a Mapeo datatype. This just minimizes cases where the Mapeo
// indexer might try to parse (and fail) a document that is not actually a Mapeo
// doc.
const KNOWN_FILE_SIGNATURE_PREFIXES = [
[0xef, 0xbb, 0xbf], // UTF-8 BOM
[0xfe, 0xff], // UTF-16 BOM
[0x3c, 0x3f, 0x78, 0x6d, 0x6c], // `<?xml` e.g. SVG file (icons are written as raw XML blobs)
[0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a], // PNG file
[0x42, 0x4d], // BMP
[0xff], // MP4 AAC, MP3 - a few formats start with this
[0x66], // M4A / AAC, FLAC - a few formats start with this
[0x52, 0x49, 0x46, 0x46, 0x57, 0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20], // WAV
]

/** @param {string} dataTypeId */
function validateDatatypeId(dataTypeId) {
const buf = Buffer.from(dataTypeId, 'hex')
if (buf.length !== 6) {
throw new Error('datatypeId must be 6 bytes encoded as hex: ' + dataTypeId)
}
const matchingKnownFileSignature = KNOWN_FILE_SIGNATURE_PREFIXES.find(
(prefix) => {
let doesMatch = true
for (let i = 0; i < Math.min(prefix.length, 6); i++) {
if (prefix[i] !== buf[i]) {
doesMatch = false
}
}
return doesMatch
}
)
if (matchingKnownFileSignature) {
throw new Error(
'This datatype ID (' +
dataTypeId +
') matches a known file signature, please choose a different one'
)
}
}

/**
* Parse the proto message types and check:
*
Expand Down Expand Up @@ -65,6 +107,7 @@ export function parseConfig() {
throw new Error('Duplicate dataTypeId in ' + filepath)
}
duplicateIdCheck.set(dataTypeId, schemaName)
validateDatatypeId(dataTypeId)

dataTypeIds[schemaName] = dataTypeId

Expand Down

0 comments on commit 01c3bf6

Please sign in to comment.