This is an example Node.js application processes a text corpus, generates embeddings for "chunks", and saves the embeddings to a local file. The embeddings can be used in another application (like a Retrieval Augmentated Generation system or 2D/3D clustering demonstration using UMAP dimensionality reduction)
There are two main scripts in this project:
embeddings-replicate.js
: Generates embeddings using the Llama model on Replicate.embeddings-transformers.js
: Generates embeddings using the bge-small model with transformers.js.
Both scripts output the embeddings to embeddings.json
.
.env
: API token for Replicate- Using open-source models for faster and cheaper text embeddings
- Uses the transformers.js package and bge-small model for embeddings generation.
embeddings-transformers.js
: Script to process a text file and generate embeddings using the bge-small model.
- Install Dependencies
npm install
- Set up the
.env
file with your Replicate API token:
REPLICATE_API_TOKEN=your_api_token_here
- Generate the
embeddings.json
file.
You'll need to hard-code a text filename and adjust how the text is split up depending on the format of your data.
const raw = fs.readFileSync('text-corpus.txt', 'utf-8');
let chunks = raw.split(/\n+/);
Then:
node embeddings-replicate.js
- Generate the
embeddings.json
file. Adjust the text filename and splitting method as needed:
const raw = fs.readFileSync('text-corpus.txt', 'utf-8');
let chunks = raw.split(/\n+/);
node embeddings-transformers.js