GitHub - ftvalentini/EmbeddingsBiasFrequency: The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings (Valentini et al., Findings 2022)

Code to replicate The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings (2022).

The following guide was run in Ubuntu 18.04.4 LTS with python=3.9.12 and R=4.2.0. You can set up a conda environment but it is not compulsory.

Requirements

Install Python requirements:

python -m pip install -r requirements.txt

Install R requirements:

Rscript install_packages.R

Clone Stanford's GloVe repo into the repo:

git clone https://github.com/stanfordnlp/GloVe.git

or alternatively add it as submodule:

git submodule add https://github.com/stanfordnlp/GloVe

To build GloVe:

In Linux: cd GloVe && make
In Windows: make -C "GloVe"

Guide

Wikipedia corpus

Download 2021 English Wikipedia dump into corpora dir:

mkdir -p corpora
WIKI_URL=https://archive.org/download/enwiki-20210401
WIKI_FILE=enwiki-20210401-pages-articles.xml.bz2
wget -c -b -P corpora/ $WIKI_URL/$WIKI_FILE
# flag "-c": continue getting a partially-downloaded file
# flag "-b": go to background after startup. Output is redirected to wget-log.

Extract dump into a raw .txt file:

src/data/extract_wiki_dump.sh corpora/enwiki-20210401-pages-articles.xml.bz2

Create text file with one line per sentence and removing articles of less than 50 words:

python3 -u src/data/tokenize_and_clean_corpus.py corpora/enwiki-20210401-pages-articles.txt

Remove non alpha-numeric symbols from sentences, clean whitespaces and convert caps to lower:

CORPUS_IN=corpora/enwiki-20210401-pages-articles_sentences.txt
CORPUS_OUT=corpora/wiki2021.txt
src/data/clean_corpus.sh $CORPUS_IN > $CORPUS_OUT

See number of lines, tokens, characters in the preprocessed corpus:

wc corpora/wiki2021.txt
# 78051838  1748884626 10453280228 corpora/wiki2021.txt

Shuffle corpus

Shuffle the corpus multiple times. Set seeds in src/data/shuffle_corpus_multiple.sh. Each new corpus is named as corpora/wiki2021s<seed>.txt.

CORPUS_IN=corpora/wiki2021.txt
bash src/data/shuffle_corpus_multiple.sh $CORPUS_IN

Co-occurrence counts

Create vocabulary of original and shuffled corpora using GloVe module:

mkdir -p data/working &&
OUT_DIR=data/working &&
VOCAB_MINCOUNT=100 &&
IDS=(wiki2021 wiki2021s1 wiki2021s2 wiki2021s3 wiki2021s4 wiki2021s5) && 
for id in ${IDS[@]}; do
    corpus=corpora/$id.txt
    src/corpus2vocab.sh $corpus $OUT_DIR $VOCAB_MINCOUNT
done

Create co-occurrence matrices with scipy.sparse format (.npz file) using GloVe module:

src/corpus2cooc_multiple.sh

Word embeddings

Download GloVe and word2vec pretrained embeddings:

python3 -u src/download_pretrained_we.py

Train SGNS on the corpora with gensim library. For each corpus, this saves a .model with trained model and .npy with the embeddings in array format. If the model is large, files with extension .trainables.syn1neg.npy and .wv.vectors.npy might be saved alongside .model.

bash src/corpus2sgns_multiple.sh

Train GloVe on the corpora and save one .npy file for each corpus with the vectors in array format.

bash src/corpus2glove_multiple.sh

Bias quantification

Compute female vs male $Bias_{WE}$ with pre-trained word embeddings. The lists of contexts words are specified in words_lists/ as text files. Results are saved into results/bias_{modelname}_{A}_{B}.csv with one row per word in the vocabulary.

mkdir -p results &&
A="FEMALE" &&
B="MALE" &&
python3 -u src/pretrained2biasdf.py $A $B "glove-wiki-gigaword-300" &&
python3 -u src/pretrained2biasdf.py $A $B "word2vec-google-news-300"

Compute female vs male $Bias_{WE}$ and $Bias_{PMI}$ of the original and shuffled corpora. The lists of contexts words are specified in words_lists/ as text files. Results are saved into results/bias_{modelname}_{A}_{B}.csv with one row per word in the vocabulary.

A="FEMALE" &&
B="MALE" &&
nohup src/biasdf_multiple.sh $A $B &

Figures

To replicate figures for $Bias_{WE}$ with pretrained word embeddings:

mkdir -p results/plots
R -e 'rmarkdown::render("plots_pretrained.Rmd", "html_document")'

Replicate tables and figures for $Bias_{WE}$ and $Bias_{PMI}$ with the original and shuffled 2021 Wikipedia with:

R -e 'rmarkdown::render("plots_trained.Rmd", "html_document")'

Results are saved as html documents.

conda environment

You can create a bias-frequency conda environment to install requirements and dependencies. This is not compulsory.

To install miniconda if needed, run:

wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
sha256sum Miniconda3-py37_4.10.3-Linux-x86_64.sh
bash Miniconda3-py37_4.10.3-Linux-x86_64.sh
# and follow stdout instructions to run commands with `conda`

To create a conda env with Python and R:

conda config --add channels conda-forge
conda create -n "bias-pmi" --channel=defaults python=3.9.12
conda install --channel=conda-forge r-base=4.2.0

Activate the environment with conda activate bias-frequency and install pip with conda install pip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Guide

Wikipedia corpus

Shuffle corpus

Co-occurrence counts

Word embeddings

Bias quantification

Figures

conda environment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
words_lists		words_lists
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
install_packages.R		install_packages.R
plots_pretrained.Rmd		plots_pretrained.Rmd
plots_trained.Rmd		plots_trained.Rmd
requirements.txt		requirements.txt

ftvalentini/EmbeddingsBiasFrequency

Folders and files

Latest commit

History

Repository files navigation

Requirements

Guide

Wikipedia corpus

Shuffle corpus

Co-occurrence counts

Word embeddings

Bias quantification

Figures

conda environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages