Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having trouble getting run_CLTrain.sh to execute #4

Open
Elfinwang opened this issue Nov 17, 2024 · 2 comments
Open

Having trouble getting run_CLTrain.sh to execute #4

Elfinwang opened this issue Nov 17, 2024 · 2 comments

Comments

@Elfinwang
Copy link

I’m having trouble getting the run_CLTrain.sh script to execute.

  1. Where to get 'file_name="data/data_simcse/${train_file}_for_simcse.csv'?
  2. I would appreciate some guidance on recommended parameters for training, such as the number of epochs to use, etc.
  3. Currently, I downloaded the ‘nli_for_simcse.csv’ from 'https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/nltasets-for-simcse/resolve/main/nli_for_simcse.csv', I encountered an error with the following line of code: (src/train.py, line 319)
    examples[sent0_cname][idx] = conv_dict(ast.literal_eval(examples[sent0_cname][idx].replace('−inf', '−2e308')))
    The error occurs when trying to parse the string with ast.literal_eval.

I would appreciate your help!!!

@Elfinwang Elfinwang changed the title Having trouble getting run_CLTrain.sh Having trouble getting run_CLTrain.sh to execute Nov 18, 2024
@LZ12DH
Copy link
Collaborator

LZ12DH commented Nov 26, 2024

Hi,

Thanks for the feedback!

The '${train_file}_for_simcse.csv' file is obtained by running 'src/prepare_CL_dataset.py'. Sorry my training data is over 2GB and I could not upload it to the repo. You may run'prepare_CL_dataset.py' using some query triplets to get the files.

For the error you incurred, the reason is that we used a tree based query encoding which is different from plain text in SimCSE. Also in case you have trouble running the above-mentioned code, you may also contact me via email [email protected] and I can share you a small set of training data to see if this error still happens.

Hope this reply clarifies your doubts!

@MattCremeens
Copy link

MattCremeens commented Jan 2, 2025

@LZ12DH , would you mind putting a sample of the training triplets that can be seen in

dataset_file = '../data/data_simcse/' + dataset + '/training_triplets_' + dataset + '_total.csv'

Is it just an id, a query, a query rewrite that increases efficiency, and a query rewrite that decreases efficiency? How were you able to create such a set of triplets? Did you find them somewhere or make them up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants