Skip to content

Commit

Permalink
Merge pull request #582 from deeppavlov/dev
Browse files Browse the repository at this point in the history
Release v1.11.0
  • Loading branch information
dilyararimovna authored Oct 17, 2023
2 parents 9d1e48d + 6d6d5a7 commit 0015ded
Show file tree
Hide file tree
Showing 264 changed files with 9,788 additions and 1,055 deletions.
14 changes: 14 additions & 0 deletions annotators/IntentCatcherTransformers/intent_phrases_commands.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
{
"intent_phrases": {
"test_command": {
"phrases": [
"test_command",
"test command"
],
"reg_phrases": [
"test_command",
"test command"
],
"min_precision": 0.94,
"punctuation": [
"."
]
},
"track_object": {
"phrases": [
"((track)|(follow)|(trail)|(trace)|(find)|(rail)|(groove)|(monitor)) a ((human)|(man)|(car)|(bicycle)|(girl)|(dude)|(bag)|(chair)|(black dog)|(white cat))",
Expand Down
2 changes: 1 addition & 1 deletion annotators/IntentCatcherTransformers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from itertools import chain
from typing import List

from common.universal_templates import join_sentences_in_or_pattern
from common.join_pattern import join_sentences_in_or_pattern


def get_regexp(intent_phrases_path):
Expand Down
2 changes: 1 addition & 1 deletion annotators/asr/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ gunicorn==19.9.0
requests==2.28.2
sentry-sdk==1.19.1
jinja2<=3.1.2
Werkzeug>=2.2.2
Werkzeug>=2.2.2,<3.0
2 changes: 1 addition & 1 deletion annotators/combined_classification/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from sentry_sdk.integrations.flask import FlaskIntegration
from deeppavlov import build_model
from common.utils import combined_classes
from common.combined_classes import combined_classes

logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO)

Expand Down
20 changes: 20 additions & 0 deletions annotators/combined_classification_ru/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM deeppavlov/deeppavlov:1.2.0-gpu

WORKDIR /base/DeepPavlov


WORKDIR /src
RUN mkdir common

COPY annotators/combined_classification_ru/requirements.txt ./requirements.txt
RUN pip install -r requirements.txt

ARG SERVICE_PORT
ENV SERVICE_PORT=$SERVICE_PORT
ARG CONFIG
ENV CONFIG=$CONFIG

COPY annotators/combined_classification_ru/ ./
COPY common/ common/

CMD gunicorn --workers=1 server:app -b 0.0.0.0:${SERVICE_PORT} --timeout=1200 --preload
32 changes: 32 additions & 0 deletions annotators/combined_classification_ru/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@

# Combined_classification

## Description

This model is based on the transformer-agnostic multitask neural architecture. It can solve several tasks similtaneously, almost as good as single-task models.

The models were trained on the following datasets:

**Factoid classification** : For the Factoid task, we used the same Yahoo ConversVsInfo dataset that was used to train the Dream socialbot in Alexa Prize . Note that the valid set in this task was equal to the test set.

**Midas classification** : For the Midas task, we used the same Midas classification dataset that was used to train the Dream socialbot in Alexa Prize . Note that the valid set in this task was equal to the test set.

**Emotion classification** :For the Emotion classification task, we used the emo\_go\_emotions dataset, with all the 28 classes compressed into the seven basic emotions as in the original paper. Note that these 7 emotions are not exactly the same as the 7 emotions in the original Dream socialbot in Alexa Prize: 1 emotion differs (love VS disgust), so the scores are incomparable with the original model. Note that this task is multiclass.

**Topic classification**: For the Topic classification task, we used the dataset made by Dilyara Zharikova. The dataset was further filtered and improved for the final model version, to make the model suitable for DREAM. Note that the original topics model doesn’t account for that dataset changes(which were also about class number) and thus its scores are not compatible with the scores we have.

**Sentiment classification** : For the Sentiment classification task, we used the Dynabench dataset (r1 + r2).

**Toxic classification** : For the toxic classification task, we used the dataset from kaggle <https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/datawith> the 7 toxic classes that pose an interest to us. Note that this task is multilabel.

The model also contains 3 replacement models for Amazon services.

The models (multitask and comparative single task) were trained with initial learning rate 2e-5(with validation patience 2 it could be dropped 2 times), batch size 32,optimizer adamW(betas (0.9,0.99) and early stop on 3 epochs. The criteria on early stopping was average accuracy for all tasks for multitask models, or the single-task accuracy for singletask models.

This model(with a distilbert-base-uncased backbone) takes only 2439 Mb for 9 tasks, whereas single-task models with the same backbone for every of these tasks take up almost the same memory(~2437 Mb for every of these 9 tasks).

## I/O
text here if i/o specified

## Dependencies

66 changes: 66 additions & 0 deletions annotators/combined_classification_ru/combined_classifier_ru.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"metadata":{
"variables":{
"MODELS_PATH": "~/.deeppavlov/models",
"DP_NAME":"distilrubert-base-cased-conversational",
"BACKBONE":"DeepPavlov/{DP_NAME}",
"NAME":"rumtl",
"SAVE_LOAD_PATH":"{MODELS_PATH}/{NAME}",
"BATCH_SIZE":160,
"NUM_TRAIN_EPOCHS":30,
"GRADIENT_ACC_STEPS":1
},
"download":[{
"url": "http://files.deeppavlov.ai/dream_data/russian_mtl/rumtl.pth.tar.gz",
"subdir": "{MODELS_PATH}"
}]
},
"chainer":{
"in":[
"x_emo","x_sentiment","x_toxic","x_factoid","x_midas","x_topics"
],
"in_y":[
"y_emo","y_sentiment","y_toxic","y_factoid","y_midas","y_topics"
],
"pipe":[
{
"class_name":"multitask_pipeline_preprocessor",
"possible_keys_to_extract":[0],
"preprocessor":"TorchTransformersPreprocessor",
"do_lower_case":true,
"n_task":6,
"vocab_file":"{BACKBONE}",
"max_seq_length":128,
"in":["x_emo","x_sentiment","x_toxic","x_factoid","x_midas","x_topics"],
"out":["bert_features_emo","bert_features_sentiment","bert_features_toxic","bert_features_factoid","bert_features_midas","bert_features_topics"]
},
{
"id":"multitask_transformer",
"class_name":"multitask_transformer",
"optimizer_parameters":{
"lr":2e-5
},
"gradient_accumulation_steps":"{GRADIENT_ACC_STEPS}",
"learning_rate_drop_patience":2,
"learning_rate_drop_div":2.0,
"return_probas":true,
"new_model":false,
"backbone_model":"{BACKBONE}",
"save_path":"{MODELS_PATH}/{NAME}",
"load_path":"{MODELS_PATH}/{NAME}",
"tasks":{
"emo":{"type":"classification","options":7},
"sentiment":{"type":"classification","options":3},
"toxic":{"type":"classification","options":2},
"factoid":{"type":"classification","options":2},
"midas":{"type":"classification","options":15},
"topics":{"type":"classification", "options":76}
},
"in":["bert_features_emo","bert_features_sentiment","bert_features_toxic","bert_features_factoid","bert_features_midas","bert_features_topics"],
"in_y":["y_emo","y_sentiment","y_toxic","y_factoid","y_midas","y_topics"],
"out":["y_emo_pred","y_sentiment_pred","y_toxic_pred","y_factoid_pred","y_midas_pred","y_topics_pred"]
}
],
"out":["y_emo_pred","y_sentiment_pred","y_toxic_pred","y_factoid_pred","y_midas_pred","y_topics_pred"]
}
}
21 changes: 21 additions & 0 deletions annotators/combined_classification_ru/load_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from locust import HttpUser, task

batch = [
{"sentences": ["i love you", "i hate you", "i dont care"]},
{"sentences": ["почему ты так глуп"]},
{"sentences": ["поговорим о играх"]},
{"sentences": ["поговорим о фильмах"]},
{"sentences": ["поменяем тему"]},
]


class QuickstartUser(HttpUser):
@task
def hello_world(self):
ans = self.client.post("", json=batch[self.batch_index % len(batch)])
self.batch_index += 1
if ans.status_code != 200:
print(ans.status_code, ans.text)

def on_start(self):
self.batch_index = 0
2 changes: 2 additions & 0 deletions annotators/combined_classification_ru/load_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pip install -r requirements_load_test.txt
locust -f load_test.py --headless -u 10 -r 2 --host http://0.0.0.0:$SERVICE_PORT/model
10 changes: 10 additions & 0 deletions annotators/combined_classification_ru/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
gunicorn==19.9.0
sentry-sdk[flask]==0.14.1
itsdangerous==2.0.1
uvicorn==0.13.0
prometheus-client==0.13.0
filelock==3.4.2
transformers==4.15.0
jinja2<=3.0.3
Werkzeug<=2.0.3
pytorch-crf==0.7.2
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
locust==1.4.3
90 changes: 90 additions & 0 deletions annotators/combined_classification_ru/server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import logging
import os
import time

from flask import Flask, request, jsonify
import sentry_sdk

from sentry_sdk.integrations.flask import FlaskIntegration
from deeppavlov import build_model
from common.combined_classes import combined_classes


supported_tasks = [
"emotion_classification",
"sentiment_classification",
"toxic_classification",
"factoid_classification",
"midas_classification",
"topics_ru",
]

combined_classes = {task: combined_classes[task] for task in combined_classes if task in supported_tasks}
combined_classes["toxic_classification"] = ["not_toxic", "toxic"] # As Russian toxic supports only TWO classes

logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO)

sentry_sdk.init(dsn=os.getenv("SENTRY_DSN"), integrations=[FlaskIntegration()])

logger = logging.getLogger(__name__)


def get_result(sentences, sentences_with_history, postannotations=False):
logger.debug((sentences, sentences_with_history, postannotations))
ans = [{} for _ in sentences]
if not sentences:
logger.exception("Input sentences not received")
sentences = [" "]
# if not sentences_with_history:
# logger.exception("Input sentences with history not received")
# sentences_with_history = sentences
data = [sentences for _ in range(len(combined_classes))]
try:
prob_lists = model(*data)
for task_name, prob_list in zip(combined_classes, prob_lists):
for i in range(len(prob_list)):
ans[i][task_name] = {
class_: round(float(prob), 2) for class_, prob in zip(combined_classes[task_name], prob_list[i])
}
except Exception as e:
sentry_sdk.capture_exception(e)
logger.exception(e)

return ans


try:
model = build_model("combined_classifier_ru.json", download=True)
logger.info("Making test res")
test_res = get_result(["a"], ["a"])
logger.info("model loaded, test query processed")
except Exception as e:
sentry_sdk.capture_exception(e)
logger.exception(e)
raise e

app = Flask(__name__)


@app.route("/model", methods=["POST"])
def respond():
t = time.time()
sentences = request.json.get("sentences", [" "])
sentences_with_hist = request.json.get("sentences_with_history", sentences)
answer = get_result(sentences, sentences_with_hist)
logger.debug(f"combined_classification result: {answer}")
logger.info(f"combined_classification exec time: {time.time() - t}")
return jsonify(answer)


@app.route("/batch_model", methods=["POST"])
def batch_respond():
t = time.time()
sep = " [SEP] "
utterances_with_histories = request.json.get("utterances_with_histories", [[" "]])
sentences_with_hist = [sep.join(s) for s in utterances_with_histories]
sentences = [s[-1].split(sep)[-1] for s in utterances_with_histories]
answer = get_result(sentences, sentences_with_hist)
logger.debug(f"combined_classification batch result: {answer}")
logger.info(f"combined_classification exec time: {time.time() - t}")
return jsonify([{"batch": answer}])
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SERVICE_PORT: 8198
SERVICE_NAME: combined_classification_ru
CONFIG: combined_classifier_ru.json
CUDA_VISIBLE_DEVICES: '0'
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: combined-classification-ru
endpoints:
- model
- batch_model
compose:
env_file:
- .env
build:
args:
SERVICE_PORT: 8198
SERVICE_NAME: combined_classification_ru
CONFIG: combined_classifier_ru.json
context: .
dockerfile: ./annotators/combined_classification_ru/Dockerfile
command: gunicorn --workers=1 server:app -b 0.0.0.0:8198 --timeout 600
environment:
- CUDA_VISIBLE_DEVICES=0
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 2G
volumes:
- ./common:/src/common
- ./annotators/combined_classification_ru:/src
- ~/.deeppavlov:/root/.deeppavlov
- ~/.deeppavlov/cache:/root/.cache
ports:
- 8198:8198
proxy:
command:
- nginx
- -g
- daemon off;
build:
context: dp/proxy/
dockerfile: Dockerfile
environment:
- PROXY_PASS=dream.deeppavlov.ai:8198
- PORT=8198
Loading

0 comments on commit 0015ded

Please sign in to comment.