Author: Nerses Nersesyan
ModelJack is project for effectively emulating interesting language APIs with simple models that can run locally to avoid latency, costs and request limits.
In the first example we show how to train a model that emulates the Google Perspective API using data from the Wikipedia Talk project and the fasttext library.
The Perspective API is a demo released by the Google Jigsaw team. The API scores a comment based on its potential impact on a conversation, deting personal attacks. More detailed information about the project can be found here.
Detecting and reducing toxic comments and personal attacks is very important for most platforms with user-generated content. The Perspective API is potentially very useful, but is a demo limited to 1000 requests.
Can we emulate it so that developers can integrate this functionality into their platforms today?
Download the data to this directory and run:
python ft_cls.py
See Results
For training and evaluation of created model we used the Wikipedia Talk project dataset.
Wikipedia Talk project release includes:
-
a large historical corpus of discussion comments on Wikipedia talk pages
-
a sample of over 100k comments with human labels for whether the comment contains a personal attack
-
a sample of over 100k comments with human labels for whether the comment has an aggressive tone
Please refer to meta.wikimedia.org/wiki/Research:Detox/Data_Release for documentation of the schema of each data set.
Ex Machina: Personal Attacks Seen at Scale - documentation on the data collection and modeling methodology from Google and Wikimedia
Conversation AI - The Conversation AI Research Github Organization at Google