FAQ

This repo accompanies our manuscript in preparation on the extent and drivers of gender and race/ethnicity imbalance in infectious disease dynamics publication and citation practices. Much of the code is drawn from Dale Zhou and colleagues' repository and this work was inspired by Jordan Dworkin and colleagues' work in neuroscience.

Instructions

The goal of the coding notebook is to analyze the predicted gender and race/ethnicity of first and last authors in reference lists of manuscripts in progress. The code will clean your .bib file to only contain references that you have cited in your manuscript. This cleaned .bib will then be used to generate a data table of names that will be used to query the probabilistic gender (Gender API or genderize.io) and race/ethnicity (ethnicolr) database. Proportions of the predicted gender for first and last author pairs (man/man, man/woman, woman/man, and woman/woman) and predicted race (white/white and first and/or last author of color) will be calculated using the database probabilities.

Obtain a .bib file of your manuscript's reference list. You can do this with common reference managers. Please try to export your .bib in an output style that uses full first names (rather than only first initials) and using the full author lists (rather than abbreviated author lists with "et al."). If a journal only provides first initials, the code will try to automatically find the full first name using the paper title or DOI (this can typically retrieve the first name 70% of the time).
- Export .bib from Mendeley
- Export .bib from Zotero
Tips for Microsoft Word and Google Docs integration

Set your citation style to BibTex in Word and Google docs, and copy the reference list into a .bib file. Make sure you edit Zotero/styles/bibtex.csl: change et-al-min and et-al-first to something large (like 100) to avoid the last author being listed as 'et al.' Remove the line text variable="abstract" prefix=" abstractNote={" suffix="}"/> to avoid the .bib file getting very large.
- Export .bib from EndNote. Note: Please export full first names by either choosing an output style that does so by default (e.g. in MLA style)
- Export .bib from Read Cube Papers
Launch the coding environment. Please refresh the page if the Binder does not load after 5-10 mins. Sometimes this takes multiple attempts. Alternatively you can clone the repo but it requires several package installations.

Option 1 (recommended):

Option 2 (experiencing intermittent server issues): Visit https://notebooks.gesis.org/binder/, paste https://github.com/jtaube/IDD_bib_analysis into the GitHub repository name or URL field, and press "launch."
Open the notebook cleanBib.ipynb. Follow the instructions above each code block. It can take 10 minutes to 1 hour complete all of the instructions, depending on the state and size of your .bib file. We expect that the most time-consuming step will be manually modifying the .bib file to find missing author names, fill incomplete entries, and fix formatting errors. These problems arise because automated methods of reference managers and Google Scholar sometimes can not retrieve full information, for example if some journals only provide an author's first initial instead of their full first name.

Input/output

Input	Output
`.bib` file(s)(REQUIRED)	`cleanedBib.csv`: table of author first names, titles, and .bib keys
`.aux` file (OPTIONAL)	`predictions.csv`: table of author first names, estimated gender classification, and confidence
	`race_gender_citations.pdf`: heat map of your citations broken down by probabilistic gender and race estimations

FAQ

Why do I receive an error when running the code?

The most common errors are due to misformatted .bib files. Errors messages are very detailed, and at the bottom of the printed message will be an indication of the line and type of problem in the .bib file. They will require you to manually correct the .bib file of formatting errors or incomplete entries. After editing the .bib file, try re-running the code block that gave you the error. If you cannot resolve an error, please open an Issue, paste the error text or a screenshot of the error, and attach the files that you used so that we can reproduce the error. We will try to help resolve it.

Common errors

TokenRequired

TokenRequired: syntax error in line X: entry key expected

This error message indicates that on line X of your uploaded .bib file, there is an incomplete entry that is missing a unique key for the citation. For instance, @article{, should be changed to @article{yourUniqueCitationKey

Syntax Error

in line X: ‘Y‘ expected

This error message could indicate that there is an unexpected character at line X of your .bib file, such as a space in the name of a field. For example, instead of Early Access Date = …, the field should be changed to EarlyAccessDate = …

Key Error

KeyError: 'author' or KeyError: 'editor'

This error message could indicate that you have cited a book and there is no author field. In this case, if there is an editor field, please change editor to author. Otherwise, add the author metadata.

What should I do if the Binder crashes, times out, or takes very long to launch?

Please refresh the Binder or re-launch from our step 2 instruction upon a crash. This has often resolved the issue. The environment will time out if you are inactive for over 10 minutes (but leaving the window open counts as activity). Long launch times (>15 minutes) can be due to a recent patch by us (temporary slow-down from re-building the Docker image) or heavy load on the server. Please try again at a later time. Please refer to the Binder User Guide and FAQ for other questions.

Will this method work on non-Western names and how accurate is it?

Yes, the Gender API supports 177 countries but will classify genders with varying confidence. Dworkin et al. (2020; Supplementary Tables 1 and 2) assessed the extent of potential gender mislabeling by manual inspecting a sample of 200 authors. They found that the relative accuracy of the automated determination procedure at the level of both individual authors had an accuracy ≈ 0.96 and article gender categories had an accuracy ≈ 0.92. Because errors in gender determination would break the links between citation behavior and author gender, any incorrect estimation in the present data likely biases the results towards the null. The ethnicolr race probabilities use the last-name–race data from the 2000 census and 2010 census, the Wikipedia data collected by Skiena and colleagues, and the Florida voter registration data from early 2017. Across cross-validation folds, the average precision was 0.83, recall was 0.84, and f1-support scores were 0.83 for the Florida model. Please see this confusion matrix for the accuracy and precision of the algorithm during cross-validation and Sood & Laohaprapanon (2018).

How are proportions calculated, especially when considering gender-neutral or race-ambiguous names?

MIGHT NEED TO CHANGE The proportions for predicted gender and race are now weighted probabilistically. For instance, if Gender-API predicts a name as man with 72% accuracy, then the 0.72 is added to the man proportion. The proportions are calculated from weighted sums across all author pairs. Similarly for predicted race, the ethnicolor package can be used to make binary predictions but also provides probabilities that an author belongs to each racial category. Consider the last name “Smith.” The model’s probabilities for the name “Smith” are 73% white, 25% Black, 1% Hispanic, and <1% Asian. We use all four probabilities to estimate how citers probabilistically assign racial categories to names, either implicitly or explicitly, while reading and citing papers. Note that imperfections in the algorithm’s predictions will break the links between citation behavior and author race, and therefore any incorrect estimation in the present data likely biases the results towards the null model.

Are self-citations included?

We do not include self-citations by default because we seek to measure engagement with and citation of other researchers' work. We define self-citations as those including your first or last author as a co-author.

What if a reference has only 1 author?

We omit that paper from your analysis.

What if the author list includes a committee or consortium?

Please use either the closest named author (first or last) or the lead of the committee/consortium.

What if a reference has more than 1 first author or last author?

We do not automatically account for these cases. If you are aware of papers with co-first or co-last authors, then you could manually add duplicate entries for each co-first or co-last author so that they are double-counted.

Should I include the diversity statement references in the proportion calculations?

Please do not include the diversity statement references. The descriptive statistic of primary interest is of your citation practices.

What is a .bib file?

The .bib file is a bibliography with tagged entry fields used by LaTeX to format a typesetted manuscript's reference list and its in-line citations. If you are not using LaTeX to write your manuscript, common reference managers that are linked to Microsoft Word or Google Docs also allow you to export .bib files (See Instructions, Step 1). When you are asked to edit the .bib file, this means that you should open the file with a text editor (if you want to edit your own copy of the file) or within the Binder environment (if you want to edit a temporary copy of your uploaded file). Each entry starts with an @ symbol and includes a reference key, then lists metadata for author, year, journal, etc. Some instructions will ask you to edit the list of names in the "author" data, and some will ask you to remove entire entries.

What is an .aux file?

The .aux file is generated when you compile the .tex file to build your manuscript. It is linked to the .bib file(s) used to populate your manuscript's reference list and records the citations used.

I have an idea to advance this project, suggestions about how to improve the notebook, and/or found a bug. Can I contribute? How do I contribute?

If you have suggestions for changes, please open an Issue or Pull Request. We welcome feedback on any pain points in running this code notebook (there is an Issue in which you can submit feedback).
To modify the notebook cleanBib.ipynb, please:

Test the code works as intended and does not seem break any existing code (we can also help to check this later) by pasting it into the cleanBib.ipynb Jupyter notebook and running it in an active Binder session.
When you're confident it works as intended, copy the code again if you made any modifications from testing. Close/end the current Binder session, and start a fresh one to open the cleanBib.ipynb and do not run anything in this notebook (to remove traces of when you last ran it/how many times you've run the code). Go to File > Download As > Notebook (.ipynb).
Create a fork of our GitHub repository to your own GitHub account.
Upload and commit your modified cleanBib.ipynb to your fork.
Submit a Pull Request to our GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 473 Commits
data		data
diversityStatement		diversityStatement
img		img
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
apt.txt		apt.txt
checkcites.lua		checkcites.lua
cleanBib.ipynb		cleanBib.ipynb
hello.py		hello.py
preprocessing.py		preprocessing.py
queries.py		queries.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions

Input/output

FAQ

About

Releases

Packages

Languages

License

jtaube/IDD_bib_analysis

Folders and files

Latest commit

History

Repository files navigation

Instructions

Input/output

FAQ

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages