Exploring Naive Approaches to Tell Apart LLMs Productions from Human-written Text

This is the official repository for code and results of the paper:

Exploring Naive Approaches to Tell Apart LLMs Productions from Human-written Text. Oliver Giudice, Alessandro Maggi, Matteo Nardelli, NLPIR 2023

Powerful Large Language Models (large LMs or LLMs) such as BERT and GPT are making the task of detecting machine-generated text more and more prominent and crucial to minimize threats posed by text generation models misuse. Nonetheless, only a limited number of efforts exist so far, which can be classified into simple classifiers, zero-shot approaches, and fine-tuned LMs. These approaches usually rely on LMs whose discrimination accuracy decreases as the size difference in favor of the generator model increases (hence, a detector should always employ a LM with at least the same number of parameters of the source LM). Also, most of these approaches do not explicitly investigate whether the sentence syntactic structure can provide additional information that helps to build better detectors. All these considerations make the generalizing ability of detection methods into question. While generation techniques become more and more capable of producing human-like text, are the detection techniques capable of keeping up if not properly trained? In this paper, we evaluate the most effective (and reproducible) detection method available in the state of the art in order to figure out the limits in its robustness. We complement this analysis by discussing results obtained using a novel naive approach that demonstrably achieves comparable results in terms of robustness with respect to much more advanced and sophisticated state-of-the-art methods.

Setup

Create a new environment using conda

conda env create -f env_transf291.yml

Activate and install ipykernel

conda activate transf291
conda install ipykernel

You should be able to use your conda environent as Jupyter Notebook kernels.

In case you can't use transf291, you can try by adding it as kernel as follows.

ipython kernel install --user --name=transf291

Start Jupyter Notebook:

jupyter notebook

Citation

If you find this code useful for your research, please cite our paper:

@inproceedings{giudice2023text,
   title={Exploring Naive Approaches to Tell Apart LLMs Productions from Human-written Text},
   author={Giudice, Oliver and Maggi, Alessandro and Nardelli, Matteo},
   booktitle={7th International Conference on Natural Language Processing and Information Retrieval},
   year={2023},
   organization = {ACM}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
results		results
.gitignore		.gitignore
README.md		README.md
dataset_exploration.ipynb		dataset_exploration.ipynb
dataset_preparation.ipynb		dataset_preparation.ipynb
download_dataset.py		download_dataset.py
env_transf291.yml		env_transf291.yml
feature_extraction__classification.ipynb		feature_extraction__classification.ipynb
gtc_solaiman_RoBERTa.ipynb		gtc_solaiman_RoBERTa.ipynb
gtc_solaiman_logistic-regression_tfidf.ipynb		gtc_solaiman_logistic-regression_tfidf.ipynb
preprocessing.py		preprocessing.py
words_dictionary.json		words_dictionary.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Naive Approaches to Tell Apart LLMs Productions from Human-written Text

Setup

Citation

About

Releases

Packages

Contributors 3

Languages

bancaditalia/gen-text-detect

Folders and files

Latest commit

History

Repository files navigation

Exploring Naive Approaches to Tell Apart LLMs Productions from Human-written Text

Setup

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages