diff --git a/.env b/.env index 62e7cd93de..c747bbe0d8 100644 --- a/.env +++ b/.env @@ -31,3 +31,4 @@ WIKI_FACTS_URL=http://wiki-facts:8116/respond FACT_RANDOM_SERVICE_URL=http://fact-random:8119/respond INFILLING_SERVICE_URL=http://infilling:8122/respond DIALOGPT_SERVICE_URL=http://dialogpt:8091/respond +DIALOGPT_CONTINUE_SERVICE_URL=http://dialogpt:8125/continue diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 47321c924f..771d1db004 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,12 +8,12 @@ #### Create a new issue -First, make sure the issue doesn't exist [in the list](https://github.com/deepmipt/dream/issues) yet. If a related issue doesn't exist, you can [open a new one](https://github.com/deepmipt/dream/issues/new). +First, make sure the issue doesn't exist [in the list](https://github.com/deeppavlovteam/dream/issues) yet. If a related issue doesn't exist, you can [open a new one](https://github.com/deeppavlovteam/dream/issues/new). #### Solve an issue -Scan through our [existing issues](https://github.com/deepmipt/dream/issues) to find one that interests you. You can narrow down the search using `labels` as filters. If you find an issue to work on, you are welcome to open a PR with a fix. +Scan through our [existing issues](https://github.com/deeppavlovteam/dream/issues) to find one that interests you. You can narrow down the search using `labels` as filters. If you find an issue to work on, you are welcome to open a PR with a fix. #### Fork and make changes diff --git a/README.md b/README.md index 85b04ae336..bf2d4f7666 100644 --- a/README.md +++ b/README.md @@ -50,18 +50,18 @@ We provide a demo of Dream Socialbot on [our website](https://demo.deeppavlov.ai ### Dream Mini Mini version of DeepPavlov Dream Socialbot. This is a generative-based socialbot that uses [English DialoGPT model](https://huggingface.co/microsoft/DialoGPT-medium) to generate most of the responses. It also contains intent catcher and responder components to cover special user requests. -[Link to the distribution.](https://github.com/deepmipt/dream/tree/main/assistant_dists/dream_mini) +[Link to the distribution.](https://github.com/deeppavlovteam/dream/tree/main/assistant_dists/dream_mini) ### Dream Russian Russian version of DeepPavlov Dream Socialbot. This is a generative-based socialbot that uses [Russian DialoGPT model](https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2) to generate most of the responses. It also contains intent catcher and responder components to cover special user requests. -[Link to the distribution.](https://github.com/deepmipt/dream/tree/main/assistant_dists/dream_russian) +[Link to the distribution.](https://github.com/deeppavlovteam/dream/tree/main/assistant_dists/dream_russian) # Quick Start ### Clone the repo ``` -git clone https://github.com/deepmipt/dream.git +git clone https://github.com/deeppavlovteam/dream.git ``` @@ -184,85 +184,91 @@ Dream Architecture is presented in the following image: ## Annotators -| Name | Requirements | Description | -|-------------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| ASR | 30 MiB RAM | calculates overall ASR confidence for a given utterance and grades it as either _very low_, _low_, _medium_, or _high_ (for Amazon markup) | -| Badlisted words | 110 MiB RAM | detects words and phrases from the badlist | -| Combined classification | 1.5 GiB RAM, 3.5 GiB GPU | BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification | -| COMeT | 4.5 GiB RAM, 2.2 GiB GPU | Commonsense prediction models COMeT Atomic and ConceptNet | -| Convers Evaluator Annotator | 1.5 GiB RAM, 4.5 GiB GPU | is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous | -| Entity detection | 3.1 GiB RAM | extracts entities and their types from utterances | -| Entity linking | 16 GiB RAM, 1.5 GiB GPU | finds Wikidata entity ids for the entities detected with Entity Detection | -| Entity Storer | 220 MiB RAM | a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state | -| Fact random | 50 MiB RAM | returns random facts for the given entity (for entities from user utterance) | -| Fact retrieval | 400 MiB GPU | extracts facts from Wikipedia and wikiHow | -| Intent catcher | 2.7 GiB RAM | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | -| KBQA | 360 MiB GPU | answers user's factoid questions based on Wikidata KB | -| MIDAS classification | 4.5 GiB GPU | BERT-based model trained on a semantic classes subset of MIDAS dataset | -| NER | 800 MiB RAM | extracts person names, names of locations, organizations from uncased text | -| News API annotator | 70 MiB RAM | extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key. | -| Sentrewrite | 30 MiB RAM | rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components | -| Sentseg | 1 GiB RAM | allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation | -| Spacy nounphrases | 200 MiB RAM | extracts nounphrases using Spacy and filters out generic ones | -| Speech Function Classifier | | a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade | -| Speech Function Predictor | | yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier | -| Spelling preprocessing | 30 MiB RAM | pattern-based component to rewrite different colloquial expressions to a more formal style of conversation | -| Topic recommendation | 40 MiB RAM | offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4). | -| User Persona Extractor | 40 MiB RAM | determines which age category the user belongs to based on some key words | -| Wiki parser | 100 MiB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | +| Name | Requirements | Description | +|-----------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ASR | 40 MiB RAM | calculates overall ASR confidence for a given utterance and grades it as either _very low_, _low_, _medium_, or _high_ (for Amazon markup) | +| Badlisted words | 150 MiB RAM | detects words and phrases from the badlist | +| Combined classification | 1.5 GiB RAM, 3.5 GiB GPU | BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification | +| COMeT Atomic | 2 GiB RAM, 1.1 GiB GPU | Commonsense prediction models COMeT Atomic | +| COMeT ConceptNet | 2 GiB RAM, 1.1 GiB GPU | Commonsense prediction models COMeT ConceptNet | +| Convers Evaluator Annotator | 1 GiB RAM, 4.5 GiB GPU | is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous | +| Entity Detection | 1.5 GiB RAM, 3.2 GiB GPU | extracts entities and their types from utterances | +| Entity Linking | 640 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection | +| Entity Storer | 220 MiB RAM | a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state | +| Fact Random | 50 MiB RAM | returns random facts for the given entity (for entities from user utterance) | +| Fact Retrieval | 7.4 GiB RAM, 1.2 GiB GPU | extracts facts from Wikipedia and wikiHow | +| Intent Catcher | 1.7 GiB RAM, 2.4 GiB GPU | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | +| KBQA | 2 GiB RAM, 1.4 GiB GPU | answers user's factoid questions based on Wikidata KB | +| MIDAS Classification | 1.1 GiB RAM, 4.5 GiB GPU | BERT-based model trained on a semantic classes subset of MIDAS dataset | +| MIDAS Predictor | 30 MiB RAM | BERT-based model trained on a semantic classes subset of MIDAS dataset | +| NER | 2.2 GiB RAM, 5 GiB GPU | extracts person names, names of locations, organizations from uncased text | +| News API annotator | 80 MiB RAM | extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key. | +| Personality Catcher | 30 MiB RAM | | +| Sentrewrite | 200 MiB RAM | rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components | +| Sentseg | 1 GiB RAM | allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation | +| Spacy Nounphrases | 180 MiB RAM | extracts nounphrases using Spacy and filters out generic ones | +| Speech Function Classifier | | a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade | +| Speech Function Predictor | | yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier | +| Spelling Preprocessing | 30 MiB RAM | pattern-based component to rewrite different colloquial expressions to a more formal style of conversation | +| Topic recommendation | 40 MiB RAM | offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4). | +| User Persona Extractor | 40 MiB RAM | determines which age category the user belongs to based on some key words | +| Wiki Parser | 100 MiB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | +| Wiki Facts | 1.7 GiB RAM | | ## Services -| Name | Requirements | Description | -|---------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| DialoGPT | 1.3 GiB RAM, 1 GiB GPU | generative service based on Transformers generative model, the model is set in docker compose argument `PRETRAINED_MODEL_NAME_OR_PATH` (for example, `microsoft/DialoGPT-small` with 0.2-0.5 sec on GPU) | -| Infilling | 1.7 GiB RAM, 1 GiB GPU | generative service based on Infilling model, for the given utterance returns utterance where `_` from original text is replaced with generated tokens | +| Name | Requirements | Description | +|---------------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| DialoGPT | 1.2 GiB RAM, 2.1 GiB GPU | generative service based on Transformers generative model, the model is set in docker compose argument `PRETRAINED_MODEL_NAME_OR_PATH` (for example, `microsoft/DialoGPT-small` with 0.2-0.5 sec on GPU) | +| Infilling | 1 GiB RAM, 1.2 GiB GPU | generative service based on Infilling model, for the given utterance returns utterance where `_` from original text is replaced with generated tokens | +| Knowledge Grounding | 2 GiB RAM, 2.1 GiB GPU | generative service based on BlenderBot architecture providing a response to the context taking into account an additional text paragraph | +| Masked LM | 1.1 GiB RAM, 1 GiB GPU | | ## Skills -| Name | Requirements | Description | -|-------------------------------|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Christmas Skill | | supports FAQ, facts, and scripts for Christmas | -| Comet Dialog skill | | uses COMeT ConceptNet model to express an opinion, to ask a question or give a comment about user's actions mentioned in the dialogue | -| Convert Reddit | 900 MiB RAM | uses a ConveRT encoder to build efficient representations for sentences | -| Dummy Skill | a part of agent container | a fallback skill with multiple non-toxic candidate responses | -| Dummy Skill Dialog | 600 MiB RAM | returns the next turn from the Topical Chat dataset if the response of the user to the Dummy Skill is similar to the corresponding response in the source data | -| Eliza | 30 MiB RAM | Chatbot (https://github.com/wadetb/eliza) | -| Emotion skill | 30 MiB RAM | returns template responses to emotions detected by Emotion Classification from Combined Classification annotator | -| Factoid QA | 200 MiB RAM | answers factoid questions | -| Game Cooperative skill | 120 MiB RAM | provides user with a conversation about computer games: the charts of the best games for the past year, past month, and last week | -| Intent Responder | 40 MiB RAM | provides template-based replies for some of the intents detected by Intent Catcher annotator | -| Knowledge Grounding skill | 60 MiB RAM, 1.5 GiB GPU | generates a response based on the dialogue history and provided knowledge related to the current conversation topic | -| Meta Script skill | 150 MiB RAM | provides a multi-turn dialogue around human activities. The skill uses COMeT Atomic model to generate commonsensical descriptions and questions on several aspects | -| Misheard ASR | 40 MiB RAM | uses the ASR Processor annotations to give feedback to the user when ASR confidence is too low | -| News API skill | 60 MiB RAM | presents the top-rated latest news about entities or topics using the GNews API | -| Oscar Skill | | supports FAQ, facts, and scripts for Oscar | -| Personal Info skill | 40 MiB RAM | queries and stores user's name, birthplace, and location | -| Personality Catcher | 30 MiB RAM | | -| DFF Program Y skill | 800 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot | -| DFF Program Y Dangerous skill | 150 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot, containing responses to dangerous situations in a dialog | -| DFF Program Y Wide skill | 130 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot, which includes only very general templates (with lower confidence) | -| Small Talk skill | 35 MiB RAM | asks questions using the hand-written scripts for 25 topics, including but not limited to love, sports, work, pets, etc. | -| SuperBowl Skill | | supports FAQ, facts, and scripts for SuperBowl | -| Valentine's Day Skill | | supports FAQ, facts, and scripts for Valentine's Day | -| Wikidata Dial Skill | | generates an utterance using Wikidata triplets. Not turned on, needs improvement | -| DFF Animals skill | 250 MiB RAM | is created using DFF and has three branches of conversation about animals: user's pets, pets of the socialbot, and wild animals | -| DFF Art skill | 200 MiB RAM | DFF-based skill to discuss art | -| DFF Book skill | 450 MiB RAM | **[New DFF version]** detects book titles and authors mentioned in the user's utterance with the help of Wiki parser and Entity linking and recommends books by leveraging information from the GoodReads database | -| DFF Bot Persona skill | 170 MiB RAM | aims to discuss user favorites and 20 most popular things with short stories expressing the socialbot's opinion towards them | -| DFF Coronavirus skill | 150 MiB RAM | **[New DFF version]** retrieves data about the number of coronavirus cases and deaths in different locations sourced from the John Hopkins University Center for System Science and Engineering | -| DFF Food skill | 170 MiB RAM | constructed with DFF to encourage food-related conversation | -| DFF Friendship skill | 100 MiB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill | -| DFF Funfact skill | 100 MiB RAM | **[New DFF version]** Tells user fun facts | -| DFF Gaming skill | 120 MiB RAM | provides a video games discussion. Gaming Skill is for more general talk about video games | -| DFF Gossip skill | 95 MiB RAM | DFF-based skill to discuss other people with news about them | -| DFF Grounding skill | 90 MiB RAM | **[New DFF version]** DFF-based skill to answer what is the topic of the conversation, to generate acknowledgement, to generate universal responses on some dialog acts by MIDAS | -| DFF Movie skill | 1.1 GiB RAM | is implemented using DFF and takes care of the conversations related to movies | -| DFF Music skill | 100 MiB RAM | DFF-based skill to discuss music | -| DFF Science skill | 90 MiB RAM | DFF-based skill to discuss science | -| DFF Short Story skill | 90 MiB RAM | **[New DFF version]** tells user short stories from 3 categories: (1) bedtime stories, such as fables and moral stories, (2) horror stories, and (3) funny ones | -| DFF Sports Skill | 100 MiB RAM | DFF-based skill to discuss sports | -| DFF Travel skill | 90 MiB RAM | DFF-based skill to discuss travel | -| DFF Weather skill | 1.4 GiB RAM | **[New DFF version]** uses the OpenWeatherMap service to get the forecast for the user's location | -| DFF Wiki skill | 160 MiB RAM | used for making scenarios with the extraction of entities, slot filling, facts insertion, and acknowledgements | +| Name | Requirements | Description | +|-------------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Christmas Skill | | supports FAQ, facts, and scripts for Christmas | +| Comet Dialog skill | | uses COMeT ConceptNet model to express an opinion, to ask a question or give a comment about user's actions mentioned in the dialogue | +| Convert Reddit | 1.2 GiB RAM | uses a ConveRT encoder to build efficient representations for sentences | +| Dummy Skill | a part of agent container | a fallback skill with multiple non-toxic candidate responses | +| Dummy Skill Dialog | 600 MiB RAM | returns the next turn from the Topical Chat dataset if the response of the user to the Dummy Skill is similar to the corresponding response in the source data | +| Eliza | 30 MiB RAM | Chatbot (https://github.com/wadetb/eliza) | +| Emotion skill | 40 MiB RAM | returns template responses to emotions detected by Emotion Classification from Combined Classification annotator | +| Factoid QA | 170 MiB RAM | answers factoid questions | +| Game Cooperative skill | 100 MiB RAM | provides user with a conversation about computer games: the charts of the best games for the past year, past month, and last week | +| Knowledge Grounding skill | 100 MiB RAM | generates a response based on the dialogue history and provided knowledge related to the current conversation topic | +| Meta Script skill | 150 MiB RAM | provides a multi-turn dialogue around human activities. The skill uses COMeT Atomic model to generate commonsensical descriptions and questions on several aspects | +| Misheard ASR | 40 MiB RAM | uses the ASR Processor annotations to give feedback to the user when ASR confidence is too low | +| News API skill | 60 MiB RAM | presents the top-rated latest news about entities or topics using the GNews API | +| Oscar Skill | | supports FAQ, facts, and scripts for Oscar | +| Personal Info skill | 40 MiB RAM | queries and stores user's name, birthplace, and location | +| DFF Program Y skill | 800 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot | +| DFF Program Y Dangerous skill | 100 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot, containing responses to dangerous situations in a dialog | +| DFF Program Y Wide skill | 110 MiB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot, which includes only very general templates (with lower confidence) | +| Small Talk skill | 35 MiB RAM | asks questions using the hand-written scripts for 25 topics, including but not limited to love, sports, work, pets, etc. | +| SuperBowl Skill | | supports FAQ, facts, and scripts for SuperBowl | +| Text QA | 1.8 GiB RAM, 2.8 GiB GPU | | +| Valentine's Day Skill | | supports FAQ, facts, and scripts for Valentine's Day | +| Wikidata Dial Skill | | generates an utterance using Wikidata triplets. Not turned on, needs improvement | +| DFF Animals skill | 200 MiB RAM | is created using DFF and has three branches of conversation about animals: user's pets, pets of the socialbot, and wild animals | +| DFF Art skill | 100 MiB RAM | DFF-based skill to discuss art | +| DFF Book skill | 400 MiB RAM | **[New DFF version]** detects book titles and authors mentioned in the user's utterance with the help of Wiki parser and Entity linking and recommends books by leveraging information from the GoodReads database | +| DFF Bot Persona skill | 150 MiB RAM | aims to discuss user favorites and 20 most popular things with short stories expressing the socialbot's opinion towards them | +| DFF Coronavirus skill | 110 MiB RAM | **[New DFF version]** retrieves data about the number of coronavirus cases and deaths in different locations sourced from the John Hopkins University Center for System Science and Engineering | +| DFF Food skill | 150 MiB RAM | constructed with DFF to encourage food-related conversation | +| DFF Friendship skill | 100 MiB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill | +| DFF Funfact skill | 100 MiB RAM | **[New DFF version]** Tells user fun facts | +| DFF Gaming skill | 80 MiB RAM | provides a video games discussion. Gaming Skill is for more general talk about video games | +| DFF Gossip skill | 95 MiB RAM | DFF-based skill to discuss other people with news about them | +| DFF Grounding skill | 90 MiB RAM | **[New DFF version]** DFF-based skill to answer what is the topic of the conversation, to generate acknowledgement, to generate universal responses on some dialog acts by MIDAS | +| DFF Intent Responder | 100 MiB RAM | **[New DFF version]** provides template-based replies for some of the intents detected by Intent Catcher annotator | +| DFF Movie skill | 1.1 GiB RAM | is implemented using DFF and takes care of the conversations related to movies | +| DFF Music skill | 70 MiB RAM | DFF-based skill to discuss music | +| DFF Science skill | 90 MiB RAM | DFF-based skill to discuss science | +| DFF Short Story skill | 90 MiB RAM | **[New DFF version]** tells user short stories from 3 categories: (1) bedtime stories, such as fables and moral stories, (2) horror stories, and (3) funny ones | +| DFF Sport Skill | 70 MiB RAM | DFF-based skill to discuss sports | +| DFF Travel skill | 70 MiB RAM | DFF-based skill to discuss travel | +| DFF Weather skill | 1.4 GiB RAM | **[New DFF version]** uses the OpenWeatherMap service to get the forecast for the user's location | +| DFF Wiki skill | 150 MiB RAM | used for making scenarios with the extraction of entities, slot filling, facts insertion, and acknowledgements | # Components Russian Version diff --git a/README_ru.md b/README_ru.md index 2a44c3df70..6699373248 100644 --- a/README_ru.md +++ b/README_ru.md @@ -55,20 +55,20 @@ Deepy GoBot Base содержит аннотатор исправления оп Мини-версия DeepPavlov Dream Socialbot. Данная версия основана на нейросетевой генерации с использованием [English DialoGPT модели](https://huggingface.co/microsoft/DialoGPT-medium). Дистрибутив также содержит компоненты для детектирования запросов пользователя и выдачи специальных ответов на них. -[Link to the distribution.](https://github.com/deepmipt/dream/tree/main/assistant_dists/dream_mini) +[Link to the distribution.](https://github.com/deeppavlovteam/dream/tree/main/assistant_dists/dream_mini) ### Dream Russian Русскоязычная версия DeepPavlov Dream Socialbot. Данная версия основана на нейросетевой генерации с использованием [Russian DialoGPT модели](https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2). Дистрибутив также содержит компоненты для детектирования запросов пользователя и выдачи специальных ответов на них. -[Link to the distribution.](https://github.com/deepmipt/dream/tree/main/assistant_dists/dream_russian) +[Link to the distribution.](https://github.com/deeppavlovteam/dream/tree/main/assistant_dists/dream_russian) # Quick Start ### Склонируйте репозиторий ``` -git clone https://github.com/deepmipt/dream.git +git clone https://github.com/deeppavlovteam/dream.git ``` @@ -184,27 +184,27 @@ docker-compose -f docker-compose.yml -f assistant_dists/dream/docker-compose.ove ## Аннотаторы (Annotators) -| Name | Requirements | Description | -|------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Badlisted words | 50 MiB RAM | Аннотатор детекции нецензурных слов основан на лемматизации с помощью pymorphy2 и проверки по словарю нецензурных слов. | -| Entity Detection | 3 GiB RAM | Аннотатор извлечения не именованных сущностей и определения их типа для русского языка нижнего регистра на основе на основе нейросетевой модели ruBERT (PyTorch). | -| Entity Linking | 300 MiB RAM | Аннотатор связывания (нахождения Wikidata id) сущностей, извлеченных с помощью Entity detection, на основе дистиллированной модели ruBERT. | -| Intent Catcher | 1.8 MiB RAM, 4.9 Gib GPU | Аннотатор детектирования специальных намерений пользователя на основе многоязычной модели Universal Sentence Encoding. | -| NER | 1.8 GiB RAM, 4.9 Gib GPU | Аннотатор извлечения именованных сущностей для русского языка нижнего регистра на основе нейросетевой модели Conversational ruBERT (PyTorch). | -| Sentseg | 2.4 GiB RAM, 4.9 Gib GPU | Аннотатор восстановления пунктуации для русского языка нижнего регистра на основе нейросетевой модели ruBERT (PyTorch). Модель обучена на русскоязычных субтитрах. | -| Spacy Annotator | 250 MiB RAM | Аннотатор токенизации и аннотирования токенов на основе библиотеки spacy и входящей в нее модели “ru_core_news_sm”. | -| Spelling Preprocessing | 4.4 GiB RAM | Аннотатор исправления опечаток и грамматических ошибок на основе модели расстояния Левенштейна. Используется предобученная модель из библиотеки DeepPavlov. | -| Toxic Classification | 1.9 GiB RAM, 1.2 Gib GPU | Классификатор токсичности для фильтрации реплик пользователя [от Сколтеха](https://huggingface.co/SkolkovoInstitute/russian_toxicity_classifier) | -| Wiki Parser | 100 MiB RAM | Аннотатор извлечения триплетов из Wikidata для сущностей, извлеченных с помощью Entity detection. | -| DialogRPT | 3.9 GiB RAM, 2 GiB GPU | Сервис оценки вероятности реплики понравиться пользователю (updown) на основе ранжирующей модели DialogRPT, которая дообучена на основе генеративной модели Russian DialoGPT на комментариев с сайта Пикабу. | +| Name | Requirements | Description | +|------------------------|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Badlisted words | 50 MiB RAM | Аннотатор детекции нецензурных слов основан на лемматизации с помощью pymorphy2 и проверки по словарю нецензурных слов. | +| Entity Detection | 3 GiB RAM | Аннотатор извлечения не именованных сущностей и определения их типа для русского языка нижнего регистра на основе на основе нейросетевой модели ruBERT (PyTorch). | +| Entity Linking | 300 MiB RAM | Аннотатор связывания (нахождения Wikidata id) сущностей, извлеченных с помощью Entity detection, на основе дистиллированной модели ruBERT. | +| Intent Catcher | 1.8 GiB RAM, 5 Gib GPU | Аннотатор детектирования специальных намерений пользователя на основе многоязычной модели Universal Sentence Encoding. | +| NER | 1.8 GiB RAM, 5 Gib GPU | Аннотатор извлечения именованных сущностей для русского языка нижнего регистра на основе нейросетевой модели Conversational ruBERT (PyTorch). | +| Sentseg | 2.4 GiB RAM, 5 Gib GPU | Аннотатор восстановления пунктуации для русского языка нижнего регистра на основе нейросетевой модели ruBERT (PyTorch). Модель обучена на русскоязычных субтитрах. | +| Spacy Annotator | 250 MiB RAM | Аннотатор токенизации и аннотирования токенов на основе библиотеки spacy и входящей в нее модели “ru_core_news_sm”. | +| Spelling Preprocessing | 4.5 GiB RAM | Аннотатор исправления опечаток и грамматических ошибок на основе модели расстояния Левенштейна. Используется предобученная модель из библиотеки DeepPavlov. | +| Toxic Classification | 1.9 GiB RAM, 1.3 Gib GPU | Классификатор токсичности для фильтрации реплик пользователя [от Сколтеха](https://huggingface.co/SkolkovoInstitute/russian_toxicity_classifier) | +| Wiki Parser | 100 MiB RAM | Аннотатор извлечения триплетов из Wikidata для сущностей, извлеченных с помощью Entity detection. | +| DialogRPT | 3.9 GiB RAM, 2.2 GiB GPU | Сервис оценки вероятности реплики понравиться пользователю (updown) на основе ранжирующей модели DialogRPT, которая дообучена на основе генеративной модели Russian DialoGPT на комментариев с сайта Пикабу. | ## Навыки и Сервисы (Skills & Services) | Name | Requirements | Description | |----------------------|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| DialoGPT | 2.8 GiB RAM, 2 GiB GPU | Сервис генерации реплики по текстовому контексту диалога на основе предобученной модели Russian [DialoGPT](https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2) | +| DialoGPT | 2.8 GiB RAM, 2.2 GiB GPU | Сервис генерации реплики по текстовому контексту диалога на основе предобученной модели Russian [DialoGPT](https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2) | | Dummy Skill | a part of agent container | Навык для генерации ответов-заглушек и выдачис лучайных вопросов из базы в каечстве linking-questions. | | Personal Info Skill | 40 MiB RAM | Сценарный навык для извлечения и запоминания основной личной информации о пользователе. | -| DFF Generative Skill | 50 MiB RAM | **[New DFF version]** навык, выдающий 5 гипотез, выданных сервисом DialoGPT | +| DFF Generative Skill | 50 MiB RAM | **[New DFF version]** навык, выдающий 5 гипотез, выданных сервисом DialoGPT | | DFF Intent Responder | 50 MiB RAM | **[New DFF version]** Сценарный навык на основе DFF для ответа на специальные намерения пользователя. | | DFF Program Y Skill | 80 MiB RAM | **[New DFF version]** Сценарный навык на основе DFF для ответа на общие вопросы в виде AIML компоненты. | | DFF Friendship Skill | 70 MiB RAM | **[New DFF version]** Сценарный навык на основе DFF приветственной части диалога с пользователем. | diff --git a/annotators/ConversationEvaluator/Dockerfile b/annotators/ConversationEvaluator/Dockerfile index 26bf4b7034..c9abf8251d 100644 --- a/annotators/ConversationEvaluator/Dockerfile +++ b/annotators/ConversationEvaluator/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.0 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.0 ARG CONFIG ARG DATA_URL=http://files.deeppavlov.ai/alexaprize_data/cobot_conveval2.tar.gz @@ -19,5 +20,4 @@ RUN pip install -r requirements.txt COPY annotators/ConversationEvaluator/ ./ COPY common/ common/ -RUN python -m deeppavlov install $CONFIG CMD gunicorn --workers=1 --bind 0.0.0.0:8004 --timeout=300 server:app diff --git a/annotators/ConversationEvaluator/requirements.txt b/annotators/ConversationEvaluator/requirements.txt index 4257f455b5..bd85d05a20 100644 --- a/annotators/ConversationEvaluator/requirements.txt +++ b/annotators/ConversationEvaluator/requirements.txt @@ -8,3 +8,5 @@ cachetools==4.0.0 blinker==1.4 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/DeepPavlovEmotionClassification/requirements.txt b/annotators/DeepPavlovEmotionClassification/requirements.txt index a87d105e0c..3c5c6c0d7a 100644 --- a/annotators/DeepPavlovEmotionClassification/requirements.txt +++ b/annotators/DeepPavlovEmotionClassification/requirements.txt @@ -2,3 +2,5 @@ sentry-sdk==0.13.0 gunicorn==19.9.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/DeepPavlovFactoidClassification/requirements.txt b/annotators/DeepPavlovFactoidClassification/requirements.txt index 251e4f465a..92bebdc567 100644 --- a/annotators/DeepPavlovFactoidClassification/requirements.txt +++ b/annotators/DeepPavlovFactoidClassification/requirements.txt @@ -3,3 +3,5 @@ requests==2.23.0 gunicorn==19.9.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/DeepPavlovSentimentClassification/requirements.txt b/annotators/DeepPavlovSentimentClassification/requirements.txt index a87d105e0c..3c5c6c0d7a 100644 --- a/annotators/DeepPavlovSentimentClassification/requirements.txt +++ b/annotators/DeepPavlovSentimentClassification/requirements.txt @@ -2,3 +2,5 @@ sentry-sdk==0.13.0 gunicorn==19.9.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/DeepPavlovToxicClassification/requirements.txt b/annotators/DeepPavlovToxicClassification/requirements.txt index 251e4f465a..92bebdc567 100644 --- a/annotators/DeepPavlovToxicClassification/requirements.txt +++ b/annotators/DeepPavlovToxicClassification/requirements.txt @@ -3,3 +3,5 @@ requests==2.23.0 gunicorn==19.9.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/IntentCatcherTransformers/Dockerfile b/annotators/IntentCatcherTransformers/Dockerfile index c3da5f939e..becf9df233 100644 --- a/annotators/IntentCatcherTransformers/Dockerfile +++ b/annotators/IntentCatcherTransformers/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.17.2 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.17.2 RUN apt-key del 7fa2af80 && \ rm -f /etc/apt/sources.list.d/cuda*.list && \ @@ -23,7 +24,6 @@ COPY ./common/ ./common/ COPY annotators/IntentCatcherTransformers/ /src WORKDIR /src -RUN python -m deeppavlov install ${CONFIG_NAME} RUN python -m deeppavlov download ${CONFIG_NAME} RUN python train_model_if_not_exist.py diff --git a/annotators/IntentCatcherTransformers/requirements.txt b/annotators/IntentCatcherTransformers/requirements.txt index 6f1a463e73..7beda13232 100644 --- a/annotators/IntentCatcherTransformers/requirements.txt +++ b/annotators/IntentCatcherTransformers/requirements.txt @@ -12,4 +12,8 @@ pandas==0.25.3 huggingface-hub==0.0.8 datasets==1.11.0 scikit-learn==0.21.2 -xeger==0.3.5 \ No newline at end of file +xeger==0.3.5 +transformers==4.6.0 +torch==1.6.0 +torchvision==0.7.0 +cryptography==2.8 \ No newline at end of file diff --git a/annotators/NER_deeppavlov/Dockerfile b/annotators/NER_deeppavlov/Dockerfile index 19574f897d..5cf0b958a8 100644 --- a/annotators/NER_deeppavlov/Dockerfile +++ b/annotators/NER_deeppavlov/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:1.0.0rc1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@1.0.0rc1 ARG CONFIG ARG PORT @@ -9,13 +10,12 @@ ENV CONFIG=$CONFIG ENV PORT=$PORT COPY ./annotators/NER_deeppavlov/requirements.txt /src/requirements.txt -RUN pip install -r /src/requirements.txt +RUN pip install --upgrade pip && pip install -r /src/requirements.txt COPY $SRC_DIR /src WORKDIR /src -RUN python -m deeppavlov install $CONFIG RUN python -m deeppavlov download $CONFIG CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8021 diff --git a/annotators/NER_deeppavlov/requirements.txt b/annotators/NER_deeppavlov/requirements.txt index 59ddf832bb..7897128cfe 100644 --- a/annotators/NER_deeppavlov/requirements.txt +++ b/annotators/NER_deeppavlov/requirements.txt @@ -4,4 +4,10 @@ gunicorn==19.9.0 requests==2.22.0 itsdangerous==2.0.1 jinja2<=3.0.3 -Werkzeug<=2.0.3 \ No newline at end of file +Werkzeug<=2.0.3 +transformers==4.6.0 +torch==1.6.0 +torchvision==0.7.0 +cryptography==2.8 +datasets==1.11.0 +huggingface-hub==0.0.8 \ No newline at end of file diff --git a/annotators/combined_classification/Dockerfile b/annotators/combined_classification/Dockerfile index 5c52339ce4..2c19b5319b 100644 --- a/annotators/combined_classification/Dockerfile +++ b/annotators/combined_classification/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 #RUN rm DeepPavlov @@ -9,24 +10,19 @@ RUN git clone https://github.com/dimakarp1996/DeepPavlov.git WORKDIR /base/DeepPavlov RUN git checkout pal-bert+ner - ARG CONFIG ARG PORT ENV CONFIG=$CONFIG ENV PORT=$PORT - -#RUN pip install -r requirements.txt WORKDIR /src RUN mkdir common - COPY annotators/combined_classification/ ./ COPY common/ common/ RUN ls /tmp -#RUN python -m deeppavlov install $CONFIG RUN pip install -r requirements.txt ARG DATA_URL=http://files.deeppavlov.ai/alexaprize_data/pal_bert_7in1/model.pth.tar ADD $DATA_URL /tmp diff --git a/annotators/dialog_breakdown/Dockerfile b/annotators/dialog_breakdown/Dockerfile index 691a2a370a..b5e1ce240b 100644 --- a/annotators/dialog_breakdown/Dockerfile +++ b/annotators/dialog_breakdown/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.0 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.0 ARG CONFIG ARG PORT @@ -18,5 +19,4 @@ COPY common/ common/ RUN sed -i "s|$SED_ARG|g" "$CONFIG" -RUN python -m deeppavlov install $CONFIG CMD gunicorn --workers=1 --bind 0.0.0.0:8082 --timeout=300 server:app diff --git a/annotators/dialog_breakdown/requirements.txt b/annotators/dialog_breakdown/requirements.txt index c196f9d997..3386b6e6db 100644 --- a/annotators/dialog_breakdown/requirements.txt +++ b/annotators/dialog_breakdown/requirements.txt @@ -1,7 +1,9 @@ -gunicorn==19.9.0 -sentry-sdk[flask]==0.14.1 -flask==1.1.1 -itsdangerous==2.0.1 -requests==2.22.0 -jinja2<=3.0.3 -Werkzeug<=2.0.3 \ No newline at end of file +gunicorn==19.9.0 +sentry-sdk[flask]==0.14.1 +flask==1.1.1 +itsdangerous==2.0.1 +requests==2.22.0 +jinja2<=3.0.3 +Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 \ No newline at end of file diff --git a/annotators/emotion_classification_deepy/Dockerfile b/annotators/emotion_classification_deepy/Dockerfile index cba9789d6d..dc72337b50 100644 --- a/annotators/emotion_classification_deepy/Dockerfile +++ b/annotators/emotion_classification_deepy/Dockerfile @@ -1,7 +1,7 @@ FROM deeppavlov/base-gpu:0.12.0 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.0 WORKDIR /app COPY . . -RUN python -m deeppavlov install emo_bert.json && \ - python -m deeppavlov download emo_bert.json \ No newline at end of file +RUN python -m deeppavlov download emo_bert.json \ No newline at end of file diff --git a/annotators/emotion_classification_deepy/requirements.txt b/annotators/emotion_classification_deepy/requirements.txt index a87d105e0c..f719fd6312 100644 --- a/annotators/emotion_classification_deepy/requirements.txt +++ b/annotators/emotion_classification_deepy/requirements.txt @@ -2,3 +2,6 @@ sentry-sdk==0.13.0 gunicorn==19.9.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 + diff --git a/annotators/entity_detection/Dockerfile b/annotators/entity_detection/Dockerfile index b1051b84d9..6e84697aa0 100644 --- a/annotators/entity_detection/Dockerfile +++ b/annotators/entity_detection/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 RUN apt-get update && apt-get install git -y diff --git a/annotators/entity_detection/server.py b/annotators/entity_detection/server.py index 961431d3f6..273397e361 100644 --- a/annotators/entity_detection/server.py +++ b/annotators/entity_detection/server.py @@ -103,7 +103,7 @@ def get_result(request, what_to_annotate): utts_nums, ): utt_entities = {} - for entity, tag, finegrained_tag, (start_offset, end_offset) in zip( + for entity, tag, finegrained_tags, (start_offset, end_offset) in zip( entity_substr_list, tags_list, finegrained_tags_list, entity_offsets_list ): entity_init = uttr[start_offset:end_offset] @@ -112,10 +112,18 @@ def get_result(request, what_to_annotate): if entity.lower() not in stopwords and len(entity) > 2 and start_offset >= last_utt_start: entity = EVERYTHING_EXCEPT_LETTERS_DIGITALS_AND_SPACE.sub(" ", entity) entity = DOUBLE_SPACES.sub(" ", entity).strip() - if finegrained_tag[0][0] > 0.5: - tag = finegrained_tag[0][1].lower() + filtered_finegrained_tags = [] + if finegrained_tags[0][0] > 0.5: + tag = finegrained_tags[0][1].lower() + conf, finegrained_tag = finegrained_tags[0] + filtered_finegrained_tags.append((finegrained_tag.lower(), round(conf, 3))) + for finegrained_elem in finegrained_tags[1:]: + conf, finegrained_tag = finegrained_elem + if conf > 0.2: + filtered_finegrained_tags.append((finegrained_tag.lower(), round(conf, 3))) else: tag = "misc" + filtered_finegrained_tags.append(("misc", 1.0)) if not finegrained: tag = replace_finegrained_tags(tag) if "entities" in utt_entities: @@ -124,6 +132,7 @@ def get_result(request, what_to_annotate): { "text": entity, "label": tag, + "finegrained_label": filtered_finegrained_tags, "offsets": (start_offset - last_utt_start, end_offset - last_utt_start), } ) @@ -133,6 +142,7 @@ def get_result(request, what_to_annotate): { "text": entity, "label": tag, + "finegrained_label": filtered_finegrained_tags, "offsets": (start_offset - last_utt_start, end_offset - last_utt_start), } ] diff --git a/annotators/entity_detection/src/entity_detection_parser.py b/annotators/entity_detection/src/entity_detection_parser.py index c1392fd987..8140891e9c 100644 --- a/annotators/entity_detection/src/entity_detection_parser.py +++ b/annotators/entity_detection/src/entity_detection_parser.py @@ -12,9 +12,9 @@ # See the License for the specific language governing permissions and # limitations under the License. +from collections import defaultdict from logging import getLogger from typing import List -from collections import defaultdict import numpy as np from nltk.corpus import stopwords @@ -122,7 +122,7 @@ def __call__( def tags_from_probas(self, tokens, probas): tags = [] tag_probas = [] - for token, proba in zip(tokens, probas): + for proba in probas: if proba[0] < self.thres_proba: tag_num = np.argmax(proba[1:]) + 1 else: @@ -157,7 +157,7 @@ def entities_from_tags(self, text, tokens, tags, tag_probas): replace_tokens = [("'s", ""), (" .", ""), ("{", ""), ("}", ""), (" ", " "), ('"', "'"), ("(", ""), (")", "")] cnt = 0 - for n, (tok, tag, probas) in enumerate(zip(tokens, tags, tag_probas)): + for tok, tag, probas in zip(tokens, tags, tag_probas): if tag.split("-")[-1] in self.entity_tags: f_tag = tag.split("-")[-1] if tag.startswith("B-") and any(entity_dict.values()): @@ -165,7 +165,7 @@ def entities_from_tags(self, text, tokens, tags, tag_probas): entity = " ".join(entity) for old, new in replace_tokens: entity = entity.replace(old, new) - if entity: + if entity and entity.lower() not in self.stopwords: entities_dict[c_tag].append(entity) entities_positions_dict[c_tag].append(entity_positions_dict[c_tag]) cur_probas = entity_probas_dict[c_tag] @@ -174,12 +174,13 @@ def entities_from_tags(self, text, tokens, tags, tag_probas): entity_positions_dict[c_tag] = [] entity_probas_dict[c_tag] = [] - entity_dict[f_tag].append(tok) - entity_positions_dict[f_tag].append(cnt) - if f_tag == "MISC": - entity_probas_dict[f_tag].append(self.misc_proba) - else: - entity_probas_dict[f_tag].append(probas[self.tags_ind[tag]]) + if tok not in {"?", "!"}: + entity_dict[f_tag].append(tok) + entity_positions_dict[f_tag].append(cnt) + if f_tag == "MISC": + entity_probas_dict[f_tag].append(self.misc_proba) + else: + entity_probas_dict[f_tag].append(probas[self.tags_ind[tag]]) elif any(entity_dict.values()): for tag, entity in entity_dict.items(): @@ -189,7 +190,7 @@ def entities_from_tags(self, text, tokens, tags, tag_probas): entity = entity.replace(old, new) if entity.replace(" - ", "-").lower() in text.lower(): entity = entity.replace(" - ", "-") - if entity: + if entity and entity.lower() not in self.stopwords: entities_dict[c_tag].append(entity) entities_positions_dict[c_tag].append(entity_positions_dict[c_tag]) cur_probas = entity_probas_dict[c_tag] @@ -200,6 +201,20 @@ def entities_from_tags(self, text, tokens, tags, tag_probas): entity_probas_dict[c_tag] = [] cnt += 1 + if any(entity_dict.values()): + for tag, entity in entity_dict.items(): + c_tag = tag.split("-")[-1] + entity = " ".join(entity) + for old, new in replace_tokens: + entity = entity.replace(old, new) + if entity.replace(" - ", "-").lower() in text.lower(): + entity = entity.replace(" - ", "-") + if entity and entity.lower() not in self.stopwords: + entities_dict[c_tag].append(entity) + entities_positions_dict[c_tag].append(entity_positions_dict[c_tag]) + cur_probas = entity_probas_dict[c_tag] + entities_probas_dict[c_tag].append(round(sum(cur_probas) / len(cur_probas), 4)) + entities_list = [entity for tag, entities in entities_dict.items() for entity in entities] entities_positions_list = [ position for tag, positions in entities_positions_dict.items() for position in positions diff --git a/annotators/entity_detection/test_entity_detection.py b/annotators/entity_detection/test_entity_detection.py index 2d17399f81..b5c12d9481 100644 --- a/annotators/entity_detection/test_entity_detection.py +++ b/annotators/entity_detection/test_entity_detection.py @@ -14,12 +14,24 @@ def main(): { "entities": ["capital", "russia"], "labelled_entities": [ - {"text": "capital", "offsets": [12, 19], "label": "misc"}, - {"text": "russia", "offsets": [23, 29], "label": "location"}, + {"text": "capital", "offsets": [12, 19], "label": "misc", "finegrained_label": [["misc", 1.0]]}, + { + "text": "russia", + "offsets": [23, 29], + "label": "location", + "finegrained_label": [["country", 0.953]], + }, + ], + } + ], + [ + { + "entities": ["politics"], + "labelled_entities": [ + {"text": "politics", "offsets": [17, 25], "label": "misc", "finegrained_label": [["misc", 1.0]]} ], } ], - [{"entities": ["politics"], "labelled_entities": [{"text": "politics", "offsets": [17, 25], "label": "misc"}]}], ] count = 0 diff --git a/annotators/entity_detection_rus/Dockerfile b/annotators/entity_detection_rus/Dockerfile index 27cb71c743..1ba0eb74fd 100644 --- a/annotators/entity_detection_rus/Dockerfile +++ b/annotators/entity_detection_rus/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG PORT diff --git a/annotators/entity_linking/Dockerfile b/annotators/entity_linking/Dockerfile index c35baa1a8b..27ae58a462 100644 --- a/annotators/entity_linking/Dockerfile +++ b/annotators/entity_linking/Dockerfile @@ -5,13 +5,20 @@ RUN apt-key del 7fa2af80 && \ curl https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb \ -o cuda-keyring_1.0-1_all.deb && \ dpkg -i cuda-keyring_1.0-1_all.deb +RUN apt-get -y update +RUN apt-get install -y build-essential zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget llvm \ + libncurses5-dev libncursesw5-dev xz-utils libffi-dev liblzma-dev RUN apt-get -y update && \ apt-get install -y software-properties-common && \ apt-get update && apt-get install git -y +RUN apt-get install -y sqlite3 + +ARG LANGUAGE=EN +ENV LANGUAGE ${LANGUAGE} + ARG CONFIG -ARG COMMIT=0.13.0 ARG PORT ARG SRC_DIR ARG SED_ARG=" | " @@ -22,12 +29,9 @@ ENV PORT=$PORT COPY ./annotators/entity_linking/requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt -RUN pip install git+https://github.com/deepmipt/DeepPavlov.git@${COMMIT} - COPY $SRC_DIR /src WORKDIR /src - RUN python -m deeppavlov install $CONFIG RUN sed -i "s|$SED_ARG|g" "$CONFIG" diff --git a/annotators/entity_linking/README.md b/annotators/entity_linking/README.md index c744c1b4af..63db7a6794 100644 --- a/annotators/entity_linking/README.md +++ b/annotators/entity_linking/README.md @@ -3,7 +3,7 @@ Arguments: "entity_substr" - batch of lists of entity substrings for which we want to find ids in Wikidata, "template" - template of the sentence (if the sentence with the entity matches of one of templates), "context" - text with the entity. ```python -requests.post("http://0.0.0.0:8079/model", json = {"entity_substr": [["Forrest Gump"]], "template": [""], "context": ["Who directed Forrest Gump?"]}).json() +requests.post("http://0.0.0.0:8079/model", json = {"entity_substr": [["Forrest Gump"]], "entity_tags": [[[("film", 0.9)]]],, "context": ["Who directed Forrest Gump?"]}).json() ``` Output: [[[['Q134773', 'Q3077690', 'Q552213', 'Q5365088', 'Q17006552']], [[0.02, 0.02, 0.02, 0.02, 0.02]]]] diff --git a/annotators/entity_linking/entity_linking_eng.json b/annotators/entity_linking/entity_linking_eng.json new file mode 100644 index 0000000000..c07ab31058 --- /dev/null +++ b/annotators/entity_linking/entity_linking_eng.json @@ -0,0 +1,71 @@ +{ + "chainer": { + "in": ["entity_substr", "entity_tags", "sentences"], + "pipe": [ + { + "class_name": "src.torch_transformers_el_ranker:TorchTransformersEntityRankerInfer", + "id": "entity_descr_ranking", + "pretrained_bert": "{TRANSFORMER}", + "text_encoder_weights_path": "{MODELS_PATH}/entity_descr_nll_ranking/text_encoder.pth.tar", + "descr_encoder_weights_path": "{MODELS_PATH}/entity_descr_nll_ranking/descr_encoder.pth.tar", + "special_token_id": 30522, + "descr_batch_size": 10, + "device": "cpu" + }, + { + "class_name": "src.entity_linking:EntityLinker", + "in": ["entity_substr", "entity_tags", "sentences"], + "out": ["entity_ids", "entity_conf", "entity_pages", "first_pars", "dbpedia_types"], + "load_path": "{DOWNLOADS_PATH}/entity_linking_eng/el_eng_dream", + "add_info_filename": "{DOWNLOADS_PATH}/entity_linking_eng/el_eng_dream/add_info.db", + "tags_filename": "{MODELS_PATH}/finegrained_tags/tag.dict", + "words_dict_filename": "{DOWNLOADS_PATH}/entity_linking_eng/words_dict.pickle", + "ngrams_matrix_filename": "{DOWNLOADS_PATH}/entity_linking_eng/ngrams_matrix.npz", + "entity_ranker": "#entity_descr_ranking", + "rank_in_runtime": true, + "num_entities_for_bert_ranking": 20, + "use_gpu": false, + "include_mention": false, + "num_entities_to_return": 5, + "lemmatize": true, + "use_tags": true, + "use_descriptions": true, + "full_paragraph": true, + "return_confidences": true, + "lang": "en" + } + ], + "out": ["entity_substr", "entity_ids", "entity_conf", "entity_pages", "first_pars", "dbpedia_types"] + }, + "metadata": { + "variables": { + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "TRANSFORMER": "{DOWNLOADS_PATH}/torch_bert_models/bert_small", + "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/entity_linking/entity_descr_nll_ranking.tar.gz", + "subdir": "{MODELS_PATH}/entity_descr_nll_ranking" + }, + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/entity_linking/bert_small.tar.gz", + "subdir": "{TRANSFORMER}" + }, + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/entity_linking/el_eng_dream_files.tar.gz", + "subdir": "{DOWNLOADS_PATH}/entity_linking_eng" + }, + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/entity_linking/el_eng_tags.tar.gz", + "subdir": "{MODELS_PATH}/finegrained_tags" + }, + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/entity_linking/word_spelling.tar.gz", + "subdir": "{DOWNLOADS_PATH}/entity_linking_eng" + } + ] + } +} diff --git a/annotators/entity_linking/kbqa_entity_linking.json b/annotators/entity_linking/kbqa_entity_linking.json deleted file mode 100644 index e6fb111e7f..0000000000 --- a/annotators/entity_linking/kbqa_entity_linking.json +++ /dev/null @@ -1,72 +0,0 @@ -{ - "chainer": { - "in": ["entity_substr", "template", "context"], - "pipe": [ - { - "class_name": "rel_ranking_infer", - "id": "entity_descr_ranking", - "ranker": {"config_path": "{CONFIGS_PATH}/classifiers/entity_ranking_bert_eng_no_mention_lite.json"}, - "batch_size": 32, - "load_path": "{DOWNLOADS_PATH}/wikidata_eng", - "rel_q2name_filename": "q_to_descr_en.pickle", - "rels_to_leave": 20 - }, - { - "class_name": "kbqa_entity_linking:KBEntityLinker", - "in": ["entity_substr", "template", "context"], - "out": ["entity_ids", "confidences"], - "load_path": "{DOWNLOADS_PATH}/wikidata_eng", - "inverted_index_filename": "inverted_index_eng.pickle", - "entities_list_filename": "entities_list.pickle", - "q2name_filename": "wiki_eng_q_to_name.pickle", - "q2descr_filename": "q_to_descr_en.pickle", - "who_entities_filename": "who_entities.pickle", - "entity_ranker": "#entity_descr_ranking", - "num_entities_for_bert_ranking": 20, - "build_inverted_index": false, - "use_descriptions": true, - "use_prefix_tree": false, - "num_entities_to_return": 5 - }, - { - "class_name": "q_to_page:QToPage", - "in": ["entity_ids"], - "out": ["entity_pages"], - "q_to_page_filename": "{DOWNLOADS_PATH}/wikidata_eng/q_to_page_en.pickle" - } - ], - "out": ["entity_ids", "confidences", "entity_pages"] - }, - "metadata": { - "variables": { - "ROOT_PATH": "~/.deeppavlov", - "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", - "MODELS_PATH": "{ROOT_PATH}/models", - "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" - }, - "requirements": [ - "{DEEPPAVLOV_PATH}/requirements/tf.txt", - "{DEEPPAVLOV_PATH}/requirements/bert_dp.txt", - "{DEEPPAVLOV_PATH}/requirements/rapidfuzz.txt", - "{DEEPPAVLOV_PATH}/requirements/hdt.txt", - "{DEEPPAVLOV_PATH}/requirements/spelling.txt", - "{DEEPPAVLOV_PATH}/requirements/spacy.txt", - "{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt", - "{DEEPPAVLOV_PATH}/requirements/pyinflect.txt" - ], - "download": [ - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/kbqa_entity_linking_eng.tar.gz", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_descr_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_page_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - } - ] - } -} diff --git a/annotators/entity_linking/kbqa_entity_linking_lite.json b/annotators/entity_linking/kbqa_entity_linking_lite.json deleted file mode 100644 index f3caa9fe9f..0000000000 --- a/annotators/entity_linking/kbqa_entity_linking_lite.json +++ /dev/null @@ -1,62 +0,0 @@ -{ - "chainer": { - "in": ["entity_substr", "template", "context"], - "pipe": [ - { - "class_name": "kbqa_entity_linking:KBEntityLinker", - "in": ["entity_substr", "template", "context"], - "out": ["entity_ids", "confidences"], - "load_path": "{DOWNLOADS_PATH}/wikidata_eng", - "inverted_index_filename": "inverted_index_eng.pickle", - "entities_list_filename": "entities_list.pickle", - "q2name_filename": "wiki_eng_q_to_name.pickle", - "types_dict_filename": "types_dict.pickle", - "q2descr_filename": "q_to_descr_en.pickle", - "who_entities_filename": "who_entities.pickle", - "build_inverted_index": false, - "use_descriptions": false, - "use_prefix_tree": false, - "num_entities_to_return": 5 - }, - { - "class_name": "first_par_extractor", - "in": ["entity_ids"], - "out": ["first_par"], - "wiki_first_par_filename": "{DOWNLOADS_PATH}/wikidata_eng/q_to_par_en.pickle" - } - ], - "out": ["entity_ids", "confidences", "first_par"] - }, - "metadata": { - "variables": { - "ROOT_PATH": "~/.deeppavlov", - "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", - "MODELS_PATH": "{ROOT_PATH}/models", - "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" - }, - "requirements": [ - "{DEEPPAVLOV_PATH}/requirements/tf.txt", - "{DEEPPAVLOV_PATH}/requirements/bert_dp.txt", - "{DEEPPAVLOV_PATH}/requirements/rapidfuzz.txt", - "{DEEPPAVLOV_PATH}/requirements/hdt.txt", - "{DEEPPAVLOV_PATH}/requirements/spelling.txt", - "{DEEPPAVLOV_PATH}/requirements/spacy.txt", - "{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt", - "{DEEPPAVLOV_PATH}/requirements/pyinflect.txt" - ], - "download": [ - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/kbqa_entity_linking_eng.tar.gz", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/types_dict.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_par_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - } - ] - } -} diff --git a/annotators/entity_linking/kbqa_entity_linking_page.json b/annotators/entity_linking/kbqa_entity_linking_page.json deleted file mode 100644 index 4f9363de91..0000000000 --- a/annotators/entity_linking/kbqa_entity_linking_page.json +++ /dev/null @@ -1,82 +0,0 @@ -{ - "chainer": { - "in": ["entity_substr", "template", "long_context", "entity_types", "short_context"], - "pipe": [ - { - "class_name": "rel_ranking_infer", - "id": "entity_descr_ranking", - "ranker": {"config_path": "{CONFIGS_PATH}/classifiers/entity_ranking_bert_eng_no_mention_lite.json"}, - "batch_size": 32, - "load_path": "{DOWNLOADS_PATH}/wikidata_eng", - "rel_q2name_filename": "q_to_descr_en.pickle", - "rels_to_leave": 20 - }, - { - "class_name": "kbqa_entity_linking:KBEntityLinker", - "in": ["entity_substr", "template", "long_context", "entity_types", "short_context"], - "out": ["entity_ids", "confidences", "tokens_match_conf"], - "load_path": "{DOWNLOADS_PATH}/wikidata_eng", - "inverted_index_filename": "inverted_index_eng.pickle", - "entities_list_filename": "entities_list.pickle", - "q2name_filename": "wiki_eng_q_to_name.pickle", - "q2descr_filename": "q_to_descr_en.pickle", - "who_entities_filename": "who_entities.pickle", - "entity_ranker": "#entity_descr_ranking", - "num_entities_for_bert_ranking": 20, - "build_inverted_index": false, - "use_descriptions": true, - "use_prefix_tree": false, - "num_entities_to_return": 5 - }, - { - "class_name": "q_to_page:FirstParExtractor", - "in": ["entity_ids"], - "out": ["first_par"], - "wiki_first_par_filename": "{DOWNLOADS_PATH}/wikidata_eng/q_to_par_en.pickle" - }, - { - "class_name": "q_to_page:QToPage", - "in": ["entity_ids"], - "out": ["entity_pages_titles"], - "q_to_page_filename": "{DOWNLOADS_PATH}/wikidata_eng/q_to_page_en.pickle" - } - ], - "out": ["entity_ids", "confidences", "tokens_match_conf", "first_par", "entity_pages_titles"] - }, - "metadata": { - "variables": { - "ROOT_PATH": "~/.deeppavlov", - "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", - "MODELS_PATH": "{ROOT_PATH}/models", - "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" - }, - "requirements": [ - "{DEEPPAVLOV_PATH}/requirements/tf.txt", - "{DEEPPAVLOV_PATH}/requirements/bert_dp.txt", - "{DEEPPAVLOV_PATH}/requirements/rapidfuzz.txt", - "{DEEPPAVLOV_PATH}/requirements/hdt.txt", - "{DEEPPAVLOV_PATH}/requirements/spelling.txt", - "{DEEPPAVLOV_PATH}/requirements/spacy.txt", - "{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt", - "{DEEPPAVLOV_PATH}/requirements/pyinflect.txt" - ], - "download": [ - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/kbqa_entity_linking_eng.tar.gz", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_descr_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_par_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - }, - { - "url": "http://files.deeppavlov.ai/kbqa/wikidata/q_to_page_en.pickle", - "subdir": "{DOWNLOADS_PATH}/wikidata_eng" - } - ] - } -} \ No newline at end of file diff --git a/annotators/entity_linking/q_to_page.py b/annotators/entity_linking/q_to_page.py deleted file mode 100644 index 91393ded5e..0000000000 --- a/annotators/entity_linking/q_to_page.py +++ /dev/null @@ -1,54 +0,0 @@ -from deeppavlov.core.common.registry import register -from deeppavlov.core.models.component import Component -from deeppavlov.core.commands.utils import expand_path -from deeppavlov.core.common.file import load_pickle - - -@register("q_to_page") -class QToPage(Component): - def __init__(self, q_to_page_filename, entities_num=5, **kwargs): - self.q_to_page = load_pickle(str(expand_path(q_to_page_filename))) - self.entities_num = entities_num - - def __call__(self, entities_batch): - pages_batch = [] - for entities_list in entities_batch: - if entities_list: - pages_list = [] - for entities in entities_list: - pages = [] - if entities: - for entity in entities[: self.entities_num]: - page = self.q_to_page.get(entity, "") - if page: - pages.append(page) - pages_list.append(pages) - pages_batch.append(pages_list) - else: - pages_batch.append([]) - - return pages_batch - - -@register("first_par_extractor") -class FirstParExtractor(Component): - def __init__(self, wiki_first_par_filename, entities_num=2, **kwargs): - self.wiki_first_par = load_pickle(str(expand_path(wiki_first_par_filename))) - self.entities_num = entities_num - - def __call__(self, entities_batch): - batch_first_par = [] - for entities_list in entities_batch: - if entities_list: - first_par_list = [] - for entities in entities_list: - first_par = [] - for entity in entities[: self.entities_num]: - if entity in self.wiki_first_par: - first_par.append(self.wiki_first_par[entity]) - first_par_list.append(first_par) - batch_first_par.append(first_par_list) - else: - batch_first_par.append([]) - - return batch_first_par diff --git a/annotators/entity_linking/requirements.txt b/annotators/entity_linking/requirements.txt index 7a90818d10..d00011fe27 100644 --- a/annotators/entity_linking/requirements.txt +++ b/annotators/entity_linking/requirements.txt @@ -1,8 +1,13 @@ -sentry-sdk[flask]==0.14.1 -flask==1.1.1 -itsdangerous==2.0.1 +Flask==1.1.1 +nltk==3.2.5 gunicorn==19.9.0 requests==2.22.0 +sentry-sdk==0.12.3 +rapidfuzz==0.7.6 +torch==1.6.0 +transformers==4.6.0 +deeppavlov==0.17.2 +itsdangerous==2.0.1 jinja2<=3.0.3 Werkzeug<=2.0.3 -inflect==5.3.0 +cryptography==2.8 \ No newline at end of file diff --git a/annotators/entity_linking/server.py b/annotators/entity_linking/server.py index 8b91dc0dda..7dc5b34322 100644 --- a/annotators/entity_linking/server.py +++ b/annotators/entity_linking/server.py @@ -1,15 +1,13 @@ import logging import os -import re import time from flask import Flask, request, jsonify import sentry_sdk -from sentry_sdk.integrations.flask import FlaskIntegration from deeppavlov import build_model logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO) logger = logging.getLogger(__name__) -sentry_sdk.init(dsn=os.getenv("SENTRY_DSN"), integrations=[FlaskIntegration()]) +sentry_sdk.init(os.getenv("SENTRY_DSN")) app = Flask(__name__) @@ -23,140 +21,63 @@ logger.exception(e) raise e -GENRES_TEMPLATE = re.compile( - r"(\brock\b|heavy metal|\bjazz\b|\bblues\b|\bpop\b|\brap\b|hip hop\btechno\b" r"|dubstep|classic)" -) -SPORT_TEMPLATE = re.compile(r"(soccer|football|basketball|baseball|tennis|mma|boxing|volleyball|chess|swimming)") - -genres_dict = { - "rock": "Q11399", - "heavy metal": "Q38848", - "jazz": "Q8341", - "blues": "Q9759", - "pop": "Q37073", - "rap": "Q6010", - "hip hop": "Q6010", - "techno": "Q170611", - "dubstep": "Q20474", - "classic": "Q9730", -} - -sport_dict = { - "soccer": "Q2736", - "football": "Q2736", - "basketball": "Q5372", - "baseball": "Q5369", - "tennis": "Q847", - "mma": "Q114466", - "boxing": "Q32112", - "volleyball": "Q1734", - "chess": "Q718", - "swimming": "Q31920", -} - - -def extract_topic_skill_entities(utt, entity_substr_list, entity_ids_list): - found_substr = "" - found_id = "" - found_genres = re.findall(GENRES_TEMPLATE, utt) - if found_genres: - genre = found_genres[0] - genre_id = genres_dict[genre] - if all([genre not in elem for elem in entity_substr_list]) or all( - [genre_id not in entity_ids for entity_ids in entity_ids_list] - ): - found_substr = genre - found_id = genre_id - found_sport = re.findall(SPORT_TEMPLATE, utt) - if found_sport: - sport = found_sport[0] - sport_id = sport_dict[sport] - if all([sport not in elem for elem in entity_substr_list]) or all( - [sport_id not in entity_ids for entity_ids in entity_ids_list] - ): - found_substr = sport - found_id = sport_id - - return found_substr, found_id - @app.route("/model", methods=["POST"]) def respond(): st_time = time.time() inp = request.json entity_substr_batch = inp.get("entity_substr", [[""]]) - template_batch = inp.get("template", [""]) + entity_tags_batch = inp.get( + "entity_tags", [["" for _ in entity_substr_list] for entity_substr_list in entity_substr_batch] + ) context_batch = inp.get("context", [[""]]) - logger.info(f"entity linking, input {entity_substr_batch}") - long_context_batch = [] - short_context_batch = [] - for entity_substr_list, context_list in zip(entity_substr_batch, context_batch): - last_utt = context_list[-1] - if ( - len(last_utt) > 1 - and any([entity_substr.lower() == last_utt.lower() for entity_substr in entity_substr_list]) - or any([entity_substr.lower() == last_utt[:-1] for entity_substr in entity_substr_list]) - ): - context = " ".join(context_list) - else: - context = last_utt - if isinstance(context, list): - context = " ".join(context) - if isinstance(last_utt, list): - short_context = " ".join(last_utt) + opt_context_batch = [] + for hist_utt in context_batch: + hist_utt = [utt for utt in hist_utt if len(utt) > 1] + last_utt = hist_utt[-1] + if last_utt[-1] not in {".", "!", "?"}: + last_utt = f"{last_utt}." + if len(hist_utt) > 1: + prev_utt = hist_utt[-2] + if prev_utt[-1] not in {".", "!", "?"}: + prev_utt = f"{prev_utt}." + opt_context_batch.append([prev_utt, last_utt]) else: - short_context = last_utt - long_context_batch.append(context) - short_context_batch.append(short_context) + opt_context_batch.append([last_utt]) - entity_types_batch = [[[] for _ in entity_substr_list] for entity_substr_list in entity_substr_batch] entity_info_batch = [[{}] for _ in entity_substr_batch] try: - entity_ids_batch, conf_batch, tokens_match_conf_batch, entity_pages_batch, entity_pages_titles_batch = el( - entity_substr_batch, template_batch, long_context_batch, entity_types_batch, short_context_batch - ) + ( + entity_substr_batch, + entity_ids_batch, + conf_batch, + entity_pages_batch, + first_pars_batch, + dbpedia_types_batch, + ) = el(entity_substr_batch, entity_tags_batch, opt_context_batch) entity_info_batch = [] for ( entity_substr_list, entity_ids_list, conf_list, - tokens_match_conf_list, entity_pages_list, - entity_pages_titles_list, - context, + first_pars_list, + dbpedia_types_list, ) in zip( - entity_substr_batch, - entity_ids_batch, - conf_batch, - tokens_match_conf_batch, - entity_pages_batch, - entity_pages_titles_batch, - short_context_batch, + entity_substr_batch, entity_ids_batch, conf_batch, entity_pages_batch, first_pars_batch, dbpedia_types_batch ): entity_info_list = [] - for entity_substr, entity_ids, conf, tokens_match_conf, entity_pages, entity_pages_titles in zip( - entity_substr_list, - entity_ids_list, - conf_list, - tokens_match_conf_list, - entity_pages_list, - entity_pages_titles_list, + for entity_substr, entity_ids, confs, entity_pages, first_pars, dbpedia_types in zip( + entity_substr_list, entity_ids_list, conf_list, entity_pages_list, first_pars_list, dbpedia_types_list ): entity_info = {} entity_info["entity_substr"] = entity_substr entity_info["entity_ids"] = entity_ids - entity_info["confidences"] = [float(elem) for elem in conf] - entity_info["tokens_match_conf"] = [float(elem) for elem in tokens_match_conf] - entity_info["entity_pages"] = entity_pages - entity_info["entity_pages_titles"] = entity_pages_titles - entity_info_list.append(entity_info) - topic_substr, topic_id = extract_topic_skill_entities(context, entity_substr_list, entity_ids_list) - if topic_substr: - entity_info = {} - entity_info["entity_substr"] = topic_substr - entity_info["entity_ids"] = [topic_id] - entity_info["confidences"] = [float(1.0)] - entity_info["tokens_match_conf"] = [float(1.0)] + entity_info["confidences"] = [float(elem[2]) for elem in confs] + entity_info["tokens_match_conf"] = [float(elem[0]) for elem in confs] + entity_info["pages_titles"] = entity_pages + entity_info["first_paragraphs"] = first_pars + entity_info["dbpedia_types"] = dbpedia_types entity_info_list.append(entity_info) entity_info_batch.append(entity_info_list) except Exception as e: diff --git a/annotators/entity_linking/src/entity_linking.py b/annotators/entity_linking/src/entity_linking.py new file mode 100644 index 0000000000..5910023c7b --- /dev/null +++ b/annotators/entity_linking/src/entity_linking.py @@ -0,0 +1,542 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import sqlite3 +from logging import getLogger +from typing import List, Dict, Tuple +from collections import defaultdict + +import nltk +from nltk.corpus import stopwords +from rapidfuzz import fuzz + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component +from deeppavlov.core.models.serializable import Serializable +from deeppavlov.core.commands.utils import expand_path +from src.find_word import WordSearcher + +log = getLogger(__name__) +nltk.download("stopwords") + + +@register("entity_linker") +class EntityLinker(Component, Serializable): + """ + Class for linking of entity substrings in the document to entities in Wikidata + """ + + def __init__( + self, + load_path: str, + tags_filename: str, + add_info_filename: str, + words_dict_filename: str = None, + ngrams_matrix_filename: str = None, + entity_ranker=None, + num_entities_for_bert_ranking: int = 50, + num_entities_to_return: int = 10, + max_text_len: int = 300, + max_paragraph_len: int = 150, + lang: str = "ru", + use_descriptions: bool = True, + use_tags: bool = False, + lemmatize: bool = False, + full_paragraph: bool = False, + use_connections: bool = False, + **kwargs, + ) -> None: + """ + + Args: + load_path: path to folder with inverted index files + entity_ranker: component deeppavlov.models.kbqa.rel_ranking_bert + num_entities_for_bert_ranking: number of candidate entities for BERT ranking using description and context + ngram_range: char ngrams range for TfidfVectorizer + num_entities_to_return: number of candidate entities for the substring which are returned + lang: russian or english + use_description: whether to perform entity ranking by context and description + lemmatize: whether to lemmatize tokens + **kwargs: + """ + super().__init__(save_path=None, load_path=load_path) + self.lemmatize = lemmatize + self.tags_filename = tags_filename + self.add_info_filename = add_info_filename + self.words_dict_filename = words_dict_filename + self.ngrams_matrix_filename = ngrams_matrix_filename + self.num_entities_for_bert_ranking = num_entities_for_bert_ranking + self.entity_ranker = entity_ranker + self.num_entities_to_return = num_entities_to_return + self.max_text_len = max_text_len + self.max_paragraph_len = max_paragraph_len + self.lang = f"@{lang}" + if self.lang == "@en": + self.stopwords = set(stopwords.words("english")) + elif self.lang == "@ru": + self.stopwords = set(stopwords.words("russian")) + self.use_descriptions = use_descriptions + self.use_connections = use_connections + self.use_tags = use_tags + self.full_paragraph = full_paragraph + self.re_tokenizer = re.compile(r"[\w']+|[^\w ]") + self.not_found_str = "not in wiki" + self.related_tags = { + "loc": ["gpe", "country", "city", "us_state", "river"], + "gpe": ["loc", "country", "city", "us_state"], + "work_of_art": ["product", "law"], + "product": ["work_of_art"], + "law": ["work_of_art"], + "org": ["fac", "business"], + "business": ["org"], + "actor": ["per"], + "athlete": ["per"], + "musician": ["per"], + "politician": ["per"], + "writer": ["per"], + } + self.word_searcher = None + if self.words_dict_filename: + self.word_searcher = WordSearcher(self.words_dict_filename, self.ngrams_matrix_filename) + self.load() + + def load(self) -> None: + with open(str(expand_path(self.tags_filename)), "r") as fl: + lines = fl.readlines() + tags = [] + for line in lines: + tag_str = line.strip().split()[:-1] + tags.append("_".join(tag_str)) + if "O" in tags: + tags.remove("O") + self.cursors = {} + for tag in tags: + conn = sqlite3.connect(f"{self.load_path}/{tag.lower()}.db", check_same_thread=False) + cur = conn.cursor() + self.cursors[tag.lower()] = cur + conn = sqlite3.connect(str(expand_path(self.add_info_filename)), check_same_thread=False) + self.add_info_cur = conn.cursor() + + def save(self) -> None: + pass + + def __call__( + self, + entity_substr_batch: List[List[str]], + entity_tags_batch: List[List[str]] = None, + sentences_batch: List[List[str]] = None, + entity_offsets_batch: List[List[List[int]]] = None, + sentences_offsets_batch: List[List[Tuple[int, int]]] = None, + ): + if sentences_offsets_batch is None and sentences_batch is not None: + sentences_offsets_batch = [] + for sentences_list in sentences_batch: + sentences_offsets_list = [] + start = 0 + for sentence in sentences_list: + end = start + len(sentence) + sentences_offsets_list.append([start, end]) + start = end + 1 + sentences_offsets_batch.append(sentences_offsets_list) + + if sentences_batch is None: + sentences_batch = [[] for _ in entity_substr_batch] + sentences_offsets_batch = [[] for _ in entity_substr_batch] + + log.info(f"sentences_batch {sentences_batch}") + if entity_offsets_batch is None and sentences_batch is not None: + entity_offsets_batch = [] + for entity_substr_list, sentences_list in zip(entity_substr_batch, sentences_batch): + text = " ".join(sentences_list).lower() + log.info(f"text {text}") + entity_offsets_list = [] + for entity_substr in entity_substr_list: + st_offset = text.find(entity_substr.lower()) + end_offset = st_offset + len(entity_substr) + entity_offsets_list.append([st_offset, end_offset]) + entity_offsets_batch.append(entity_offsets_list) + + entity_ids_batch, entity_conf_batch, entity_pages_batch = [], [], [] + for entity_substr_list, entity_offsets_list, entity_tags_list, sentences_list, sentences_offsets_list in zip( + entity_substr_batch, entity_offsets_batch, entity_tags_batch, sentences_batch, sentences_offsets_batch + ): + entity_ids_list, entity_conf_list, entity_pages_list = self.link_entities( + entity_substr_list, + entity_offsets_list, + entity_tags_list, + sentences_list, + sentences_offsets_list, + ) + log.info(f"entity_ids_list {entity_ids_list} entity_conf_list {entity_conf_list}") + if self.num_entities_to_return == 1: + entity_pages_list = [entity_pages[0] for entity_pages in entity_pages_list] + else: + entity_pages_list = [entity_pages[: self.num_entities_to_return] for entity_pages in entity_pages_list] + entity_ids_batch.append(entity_ids_list) + entity_conf_batch.append(entity_conf_list) + entity_pages_batch.append(entity_pages_list) + first_par_batch, dbpedia_types_batch = self.extract_add_info(entity_pages_batch) + return entity_ids_batch, entity_conf_batch, entity_pages_batch, first_par_batch, dbpedia_types_batch + + def extract_add_info(self, entity_pages_batch: List[List[List[str]]]): + first_par_batch, dbpedia_types_batch = [], [] + for entity_pages_list in entity_pages_batch: + first_par_list, dbpedia_types_list = [], [] + for entity_pages in entity_pages_list: + first_pars, dbpedia_types = [], [] + for entity_page in entity_pages: + try: + query = "SELECT * FROM entity_additional_info WHERE page_title='{}';".format(entity_page) + res = self.add_info_cur.execute(query) + fetch_res = res.fetchall() + first_par = fetch_res[0][1] + dbpedia_types_elem = fetch_res[0][2].split() + first_pars.append(first_par) + dbpedia_types.append(dbpedia_types_elem) + except Exception as e: + first_pars.append("") + dbpedia_types.append([]) + log.info(f"error {e}") + first_par_list.append(first_pars) + dbpedia_types_list.append(dbpedia_types) + first_par_batch.append(first_par_list) + dbpedia_types_batch.append(dbpedia_types_list) + return first_par_batch, dbpedia_types_batch + + def link_entities( + self, + entity_substr_list: List[str], + entity_offsets_list: List[List[int]], + entity_tags_list: List[str], + sentences_list: List[str], + sentences_offsets_list: List[List[int]], + ) -> List[List[str]]: + log.info( + f"entity_substr_list {entity_substr_list} entity_tags_list {entity_tags_list} " + f"entity_offsets_list {entity_offsets_list}" + ) + entity_ids_list, conf_list, pages_list, pages_dict_list, descr_list = [], [], [], [], [] + if entity_substr_list: + entities_scores_list = [] + cand_ent_scores_list = [] + for entity_substr, tags in zip(entity_substr_list, entity_tags_list): + for symb_old, symb_new in [("'", "''"), ("-", " "), ("@", ""), (".", ""), (" ", " ")]: + entity_substr = entity_substr.replace(symb_old, symb_new) + cand_ent_init = defaultdict(set) + if len(entity_substr) > 1: + cand_ent_init = self.find_exact_match(entity_substr, tags) + all_low_conf = True + for entity_id in cand_ent_init: + entity_info_set = cand_ent_init[entity_id] + for entity_info in entity_info_set: + if entity_info[0] == 1.0: + all_low_conf = False + break + if not all_low_conf: + break + clean_tags = [tag for tag, conf in tags] + corr_tags, corr_clean_tags = [], [] + for tag, conf in tags: + if tag in self.related_tags: + corr_tag_list = self.related_tags[tag] + for corr_tag in corr_tag_list: + if corr_tag not in clean_tags and corr_tag not in corr_clean_tags: + corr_tags.append([corr_tag, conf]) + corr_clean_tags.append(corr_tag) + + if (not cand_ent_init or all_low_conf) and corr_tags: + corr_cand_ent_init = self.find_exact_match(entity_substr, corr_tags) + cand_ent_init = {**cand_ent_init, **corr_cand_ent_init} + entity_substr_split = [ + word for word in entity_substr.split(" ") if word not in self.stopwords and len(word) > 0 + ] + if ( + not cand_ent_init + and len(entity_substr_split) == 1 + and self.word_searcher + and all([letter.isalpha() for letter in entity_substr_split[0]]) + ): + corr_words = self.word_searcher(entity_substr_split[0], set(clean_tags + corr_clean_tags)) + if corr_words: + cand_ent_init = self.find_exact_match(corr_words[0], tags + corr_tags) + if not cand_ent_init and len(entity_substr_split) > 1: + cand_ent_init = self.find_fuzzy_match(entity_substr_split, tags) + + cand_ent_scores = [] + for entity in cand_ent_init: + entities_scores = list(cand_ent_init[entity]) + entities_scores = sorted(entities_scores, key=lambda x: (x[0], x[3], x[2]), reverse=True) + cand_ent_scores.append(([entity] + list(entities_scores[0]))) + + cand_ent_scores = sorted(cand_ent_scores, key=lambda x: (x[1], x[4], x[3]), reverse=True) + cand_ent_scores = cand_ent_scores[: self.num_entities_for_bert_ranking] + cand_ent_scores_list.append(cand_ent_scores) + entity_ids = [elem[0] for elem in cand_ent_scores] + pages = [elem[5] for elem in cand_ent_scores] + scores = [elem[1:5] for elem in cand_ent_scores] + entities_scores_list.append( + {entity_id: entity_scores for entity_id, entity_scores in zip(entity_ids, scores)} + ) + entity_ids_list.append(entity_ids) + pages_list.append(pages) + pages_dict_list.append({entity_id: page for entity_id, page in zip(entity_ids, pages)}) + descr_list.append([elem[6] for elem in cand_ent_scores]) + + if self.use_descriptions: + substr_lens = [len(entity_substr.split()) for entity_substr in entity_substr_list] + entity_ids_list, conf_list = self.rank_by_description( + entity_substr_list, + entity_tags_list, + entity_offsets_list, + entity_ids_list, + descr_list, + entities_scores_list, + sentences_list, + sentences_offsets_list, + substr_lens, + ) + pages_list = [ + [pages_dict.get(entity_id, "") for entity_id in entity_ids] + for entity_ids, pages_dict in zip(entity_ids_list, pages_dict_list) + ] + + return entity_ids_list, conf_list, pages_list + + def process_cand_ent(self, cand_ent_init, entities_and_ids, entity_substr_split, tag, tag_conf): + for entity_title, entity_id, entity_rels, anchor_cnt, _, page, descr in entities_and_ids: + substr_score = self.calc_substr_score(entity_title, entity_substr_split) + cand_ent_init[entity_id].add((substr_score, anchor_cnt, entity_rels, tag_conf, page, descr)) + return cand_ent_init + + def find_exact_match(self, entity_substr, tags): + entity_substr = entity_substr.lower() + entity_substr_split = entity_substr.split() + cand_ent_init = defaultdict(set) + for tag, tag_conf in tags: + if tag.lower() in self.cursors: + query = "SELECT * FROM inverted_index WHERE title MATCH '{}';".format(entity_substr) + res = self.cursors[tag.lower()].execute(query) + entities_and_ids = res.fetchall() + if entities_and_ids: + cand_ent_init = self.process_cand_ent( + cand_ent_init, entities_and_ids, entity_substr_split, tag, tag_conf + ) + if tags and tags[0][0] == "misc" and not cand_ent_init: + for tag in self.cursors: + query = "SELECT * FROM inverted_index WHERE title MATCH '{}';".format(entity_substr) + res = self.cursors[tag].execute(query) + entities_and_ids = res.fetchall() + if entities_and_ids: + cand_ent_init = self.process_cand_ent( + cand_ent_init, entities_and_ids, entity_substr_split, tag, tag_conf + ) + return cand_ent_init + + def find_fuzzy_match(self, entity_substr_split, tags): + entity_substr_split = [word.lower() for word in entity_substr_split] + cand_ent_init = defaultdict(set) + for tag, tag_conf in tags: + if tag.lower() in self.cursors: + for word in entity_substr_split: + query = "SELECT * FROM inverted_index WHERE title MATCH '{}';".format(word) + res = self.cursors[tag.lower()].execute(query) + part_entities_and_ids = res.fetchall() + cand_ent_init = self.process_cand_ent( + cand_ent_init, part_entities_and_ids, entity_substr_split, tag, tag_conf + ) + return cand_ent_init + + def calc_substr_score(self, entity_title, entity_substr_split): + label_tokens = entity_title.split() + cnt = 0.0 + for ent_tok in entity_substr_split: + found = False + for label_tok in label_tokens: + if label_tok == ent_tok: + found = True + break + if found: + cnt += 1.0 + else: + for label_tok in label_tokens: + if label_tok[:2] == ent_tok[:2]: + fuzz_score = fuzz.ratio(label_tok, ent_tok) + if fuzz_score >= 80.0 and not found: + cnt += fuzz_score * 0.01 + break + substr_score = round(cnt / max(len(label_tokens), len(entity_substr_split)), 3) + if len(label_tokens) == 2 and len(entity_substr_split) == 1: + if entity_substr_split[0] == label_tokens[1]: + substr_score = 0.5 + elif entity_substr_split[0] == label_tokens[0]: + substr_score = 0.3 + return substr_score + + def rank_by_description( + self, + entity_substr_list: List[str], + entity_tags_list: List[List[Tuple[str, int]]], + entity_offsets_list: List[List[int]], + cand_ent_list: List[List[str]], + cand_ent_descr_list: List[List[str]], + entities_scores_list: List[Dict[str, Tuple[int, float]]], + sentences_list: List[str], + sentences_offsets_list: List[Tuple[int, int]], + substr_lens: List[int], + ) -> List[List[str]]: + entity_ids_list = [] + conf_list = [] + contexts = [] + for entity_start_offset, entity_end_offset in entity_offsets_list: + sentence = "" + rel_start_offset = 0 + rel_end_offset = 0 + found_sentence_num = 0 + for num, (sent, (sent_start_offset, sent_end_offset)) in enumerate( + zip(sentences_list, sentences_offsets_list) + ): + if entity_start_offset >= sent_start_offset and entity_end_offset <= sent_end_offset: + sentence = sent + found_sentence_num = num + rel_start_offset = entity_start_offset - sent_start_offset + rel_end_offset = entity_end_offset - sent_start_offset + break + context = "" + if sentence: + start_of_sentence = 0 + end_of_sentence = len(sentence) + if len(sentence) > self.max_text_len: + start_of_sentence = max(rel_start_offset - self.max_text_len // 2, 0) + end_of_sentence = min(rel_end_offset + self.max_text_len // 2, len(sentence)) + text_before = sentence[start_of_sentence:rel_start_offset] + text_after = sentence[rel_end_offset:end_of_sentence] + context = text_before + "[ent]" + text_after + if self.full_paragraph: + cur_sent_len = len(re.findall(self.re_tokenizer, context)) + first_sentence_num = found_sentence_num + last_sentence_num = found_sentence_num + context = [context] + while True: + added = False + if last_sentence_num < len(sentences_list) - 1: + sentence_tokens = re.findall(self.re_tokenizer, sentences_list[last_sentence_num + 1]) + last_sentence_len = len(sentence_tokens) + if cur_sent_len + last_sentence_len < self.max_paragraph_len: + context.append(sentences_list[last_sentence_num + 1]) + cur_sent_len += last_sentence_len + last_sentence_num += 1 + added = True + if first_sentence_num > 0: + sentence_tokens = re.findall(self.re_tokenizer, sentences_list[first_sentence_num - 1]) + first_sentence_len = len(sentence_tokens) + if cur_sent_len + first_sentence_len < self.max_paragraph_len: + context = [sentences_list[first_sentence_num - 1]] + context + cur_sent_len += first_sentence_len + first_sentence_num -= 1 + added = True + if not added: + break + context = " ".join(context) + + log.info(f"rank, context: {context}") + contexts.append(context) + + scores_list = self.entity_ranker(contexts, cand_ent_list, cand_ent_descr_list) + + for context, entity_tags, candidate_entities, substr_len, entities_scores, scores in zip( + contexts, entity_tags_list, cand_ent_list, substr_lens, entities_scores_list, scores_list + ): + log.info(f"len candidate entities {len(candidate_entities)}") + if len(context.split()) < 4: + entities_with_scores = [ + ( + entity, + round(entities_scores.get(entity, (0.0, 0, 0))[0], 2), + entities_scores.get(entity, (0.0, 0, 0))[1], + entities_scores.get(entity, (0.0, 0, 0))[2], + 0.95, + ) + for entity, score in scores + ] + else: + entities_with_scores = [ + ( + entity, + round(entities_scores.get(entity, (0.0, 0, 0))[0], 2), + entities_scores.get(entity, (0.0, 0, 0))[1], + entities_scores.get(entity, (0.0, 0, 0))[2], + round(score, 2), + ) + for entity, score in scores + ] + log.info(f"len entities with scores {len(entities_with_scores)}") + if entity_tags and entity_tags[0][0] == "misc": + entities_with_scores = sorted(entities_with_scores, key=lambda x: (x[1], x[2], x[4]), reverse=True) + else: + entities_with_scores = sorted(entities_with_scores, key=lambda x: (x[1], x[4], x[3]), reverse=True) + log.info(f"--- entities_with_scores {entities_with_scores}") + + if not entities_with_scores: + top_entities = [self.not_found_str] + top_conf = [(0.0, 0, 0, 0.0)] + elif entities_with_scores and substr_len == 1 and entities_with_scores[0][1] < 1.0: + top_entities = [self.not_found_str] + top_conf = [(0.0, 0, 0, 0.0)] + elif entities_with_scores and ( + entities_with_scores[0][1] < 0.3 + or (entities_with_scores[0][4] < 0.13 and entities_with_scores[0][3] < 20) + or (entities_with_scores[0][4] < 0.3 and entities_with_scores[0][3] < 4) + or entities_with_scores[0][1] < 0.6 + ): + top_entities = [self.not_found_str] + top_conf = [(0.0, 0, 0, 0.0)] + else: + top_entities = [score[0] for score in entities_with_scores] + top_conf = [score[1:] for score in entities_with_scores] + + log.info(f"--- top_entities {top_entities} top_conf {top_conf}") + + high_conf_entities = [] + high_conf_nums = [] + for elem_num, (entity, conf) in enumerate(zip(top_entities, top_conf)): + if len(conf) == 3 and conf[0] == 1.0 and conf[2] > 50 and conf[3] > 0.3: + new_conf = list(conf) + if new_conf[2] > 55: + new_conf[3] = 1.0 + new_conf = tuple(new_conf) + high_conf_entities.append((entity,) + new_conf) + high_conf_nums.append(elem_num) + + high_conf_entities = sorted(high_conf_entities, key=lambda x: (x[1], x[4], x[3]), reverse=True) + for n, elem_num in enumerate(high_conf_nums): + if elem_num - n >= 0 and elem_num - n < len(top_entities): + del top_entities[elem_num - n] + del top_conf[elem_num - n] + + log.info(f"top entities {top_entities} top_conf {top_conf}") + log.info(f"high_conf_entities {high_conf_entities}") + + top_entities = [elem[0] for elem in high_conf_entities] + top_entities + top_conf = [elem[1:] for elem in high_conf_entities] + top_conf + + log.info(f"top entities {top_entities} top_conf {top_conf}") + + if self.num_entities_to_return == 1 and top_entities: + entity_ids_list.append(top_entities[0]) + conf_list.append(top_conf[0]) + else: + entity_ids_list.append(top_entities[: self.num_entities_to_return]) + conf_list.append(top_conf[: self.num_entities_to_return]) + return entity_ids_list, conf_list diff --git a/annotators/entity_linking/src/find_word.py b/annotators/entity_linking/src/find_word.py new file mode 100644 index 0000000000..a294199c38 --- /dev/null +++ b/annotators/entity_linking/src/find_word.py @@ -0,0 +1,79 @@ +import itertools +import pickle +from collections import Counter +import numpy as np +import scipy as sp +from deeppavlov.core.commands.utils import expand_path + +Sparse = sp.sparse.csr_matrix + + +class WordSearcher: + def __init__(self, words_dict_filename: str, ngrams_matrix_filename: str): + self.words_dict_filename = words_dict_filename + self.ngrams_matrix_filename = ngrams_matrix_filename + self.load() + self.make_ngrams_dicts() + + def load(self): + with open(str(expand_path(self.words_dict_filename)), "rb") as fl: + self.words_dict = pickle.load(fl) + words_list = list(self.words_dict.keys()) + self.words_list = sorted(words_list) + + loader = np.load(str(expand_path(self.ngrams_matrix_filename)), allow_pickle=True) + self.count_matrix = Sparse((loader["data"], loader["indices"], loader["indptr"]), shape=loader["shape"]) + + def make_ngrams_dicts(self): + letters = "abcdefghijklmnopqrstuvwxyz" + self.bigrams_dict, self.trigrams_dict = {}, {} + bigram_combs = list(itertools.product(letters, letters)) + bigram_combs = ["".join(comb) for comb in bigram_combs] + trigram_combs = list(itertools.product(letters, letters, letters)) + trigram_combs = ["".join(comb) for comb in trigram_combs] + for cnt, bigram in enumerate(bigram_combs): + self.bigrams_dict[bigram] = cnt + for cnt, trigram in enumerate(trigram_combs): + self.trigrams_dict[trigram] = cnt + len(bigram_combs) + + def __call__(self, query, tags): + ngrams_list = [] + for i in range(len(query) - 1): + ngram = query[i : i + 2].lower() + ngram_id = self.bigrams_dict[ngram] + ngrams_list.append(ngram_id) + for i in range(len(query) - 2): + ngram = query[i : i + 3].lower() + ngram_id = self.trigrams_dict[ngram] + ngrams_list.append(ngram_id) + ngrams_with_cnts = Counter(ngrams_list).most_common() + ngram_ids = [elem[0] for elem in ngrams_with_cnts] + ngram_cnts = [1 for _ in ngrams_with_cnts] + + indptr = np.array([0, len(ngram_cnts)]) + query_matrix = Sparse( + (ngram_cnts, ngram_ids, indptr), shape=(1, len(self.bigrams_dict) + len(self.trigrams_dict)) + ) + + scores = query_matrix * self.count_matrix + scores = np.squeeze(scores.toarray() + 0.0001) + + thresh = 1000 + if thresh >= len(scores): + o = np.argpartition(-scores, len(scores) - 1)[0:thresh] + else: + o = np.argpartition(-scores, thresh)[0:thresh] + o_sort = o[np.argsort(-scores[o])] + o_sort = o_sort.tolist() + + found_words = [self.words_list[n] for n in o_sort] + found_words = [ + word + for word in found_words + if ( + word.startswith(query[0]) + and abs(len(word) - len(query)) < 3 + and self.words_dict[word].intersection(tags) + ) + ] + return found_words diff --git a/annotators/entity_linking/src/torch_transformers_el_ranker.py b/annotators/entity_linking/src/torch_transformers_el_ranker.py new file mode 100644 index 0000000000..3510ee8ad1 --- /dev/null +++ b/annotators/entity_linking/src/torch_transformers_el_ranker.py @@ -0,0 +1,171 @@ +from pathlib import Path +from logging import getLogger +from typing import List, Tuple, Union + +import torch +from transformers import AutoTokenizer, AutoModel +from transformers.data.processors.utils import InputFeatures + +from deeppavlov.core.commands.utils import expand_path +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component + +log = getLogger(__name__) + + +@register("torch_transformers_entity_ranker_preprocessor") +class TorchTransformersEntityRankerPreprocessor(Component): + def __init__( + self, + vocab_file: str, + do_lower_case: bool = True, + max_seq_length: int = 512, + return_tokens: bool = False, + special_tokens: List[str] = None, + **kwargs, + ) -> None: + self.max_seq_length = max_seq_length + self.return_tokens = return_tokens + if Path(vocab_file).is_file(): + vocab_file = str(expand_path(vocab_file)) + self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) + else: + self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) + if special_tokens is not None: + special_tokens_dict = {"additional_special_tokens": special_tokens} + self.tokenizer.add_special_tokens(special_tokens_dict) + + def __call__(self, texts_a: List[str]) -> Union[List[InputFeatures], Tuple[List[InputFeatures], List[List[str]]]]: + # in case of iterator's strange behaviour + if isinstance(texts_a, tuple): + texts_a = list(texts_a) + lengths = [] + for text_a in texts_a: + encoding = self.tokenizer.encode_plus( + text_a, + add_special_tokens=True, + pad_to_max_length=True, + return_attention_mask=True, + ) + input_ids = encoding["input_ids"] + lengths.append(len(input_ids)) + + input_features = self.tokenizer( + text=texts_a, + add_special_tokens=True, + max_length=self.max_seq_length, + padding="max_length", + return_attention_mask=True, + truncation=True, + return_tensors="pt", + ) + return input_features + + +@register("torch_transformers_entity_ranker_infer") +class TorchTransformersEntityRankerInfer: + def __init__( + self, + pretrained_bert, + text_encoder_weights_path, + descr_encoder_weights_path, + special_token_id: int = 30522, + do_lower_case: bool = True, + batch_size: int = 5, + descr_batch_size: int = 30, + device: str = "gpu", + **kwargs, + ): + self.device = torch.device("cuda" if torch.cuda.is_available() and device == "gpu" else "cpu") + self.pretrained_bert = str(expand_path(pretrained_bert)) + self.preprocessor = TorchTransformersEntityRankerPreprocessor( + vocab_file=self.pretrained_bert, + do_lower_case=do_lower_case, + special_tokens=["[ent]"], + ) + self.text_encoder = AutoModel.from_pretrained(self.pretrained_bert) + tokenizer = AutoTokenizer.from_pretrained(self.pretrained_bert) + self.text_encoder.resize_token_embeddings(len(tokenizer) + 1) + self.descr_encoder = AutoModel.from_pretrained(self.pretrained_bert) + self.text_encoder_weights_path = str(expand_path(text_encoder_weights_path)) + text_encoder_checkpoint = torch.load(self.text_encoder_weights_path, map_location=self.device) + self.text_encoder.load_state_dict(text_encoder_checkpoint["model_state_dict"]) + self.text_encoder.to(self.device) + self.descr_encoder_weights_path = str(expand_path(descr_encoder_weights_path)) + descr_encoder_checkpoint = torch.load(self.descr_encoder_weights_path, map_location=self.device) + self.descr_encoder.load_state_dict(descr_encoder_checkpoint["model_state_dict"]) + self.descr_encoder.to(self.device) + self.special_token_id = special_token_id + self.batch_size = batch_size + self.descr_batch_size = descr_batch_size + + def __call__( + self, + contexts_batch: List[str], + candidate_entities_batch: List[List[str]], + candidate_entities_descr_batch: List[List[str]], + ): + entity_embs = [] + num_batches = len(contexts_batch) // self.batch_size + int(len(contexts_batch) % self.batch_size > 0) + for ii in range(num_batches): + contexts_list = contexts_batch[ii * self.batch_size : (ii + 1) * self.batch_size] + context_features = self.preprocessor(contexts_list) + text_input_ids = context_features["input_ids"].to(self.device) + text_attention_mask = context_features["attention_mask"].to(self.device) + entity_tokens_pos = [] + for input_ids_list in text_input_ids: + found_n = -1 + for n, input_id in enumerate(input_ids_list): + if input_id == self.special_token_id: + found_n = n + break + if found_n == -1: + found_n = 0 + entity_tokens_pos.append(found_n) + + text_encoder_output = self.text_encoder(input_ids=text_input_ids, attention_mask=text_attention_mask) + text_hidden_states = text_encoder_output.last_hidden_state + for i in range(len(entity_tokens_pos)): + pos = entity_tokens_pos[i] + entity_embs.append(text_hidden_states[i, pos].detach().cpu().numpy().tolist()) + + scores_batch = [] + for entity_emb, candidate_entities_list, candidate_entities_descr_list in zip( + entity_embs, candidate_entities_batch, candidate_entities_descr_batch + ): + if candidate_entities_list: + num_batches = len(candidate_entities_descr_list) // self.descr_batch_size + int( + len(candidate_entities_descr_list) % self.descr_batch_size > 0 + ) + scores_list = [] + for jj in range(num_batches): + cur_descr_list = candidate_entities_descr_list[ + jj * self.descr_batch_size : (jj + 1) * self.descr_batch_size + ] + entity_emb_list = [entity_emb for _ in cur_descr_list] + entity_emb_t = torch.Tensor(entity_emb_list).to(self.device) + descr_features = self.preprocessor(cur_descr_list) + descr_input_ids = descr_features["input_ids"].to(self.device) + descr_attention_mask = descr_features["attention_mask"].to(self.device) + descr_encoder_output = self.descr_encoder( + input_ids=descr_input_ids, attention_mask=descr_attention_mask + ) + descr_cls_emb = descr_encoder_output.last_hidden_state[:, :1, :].squeeze(1) + + bs, emb_dim = entity_emb_t.size() + entity_emb_t = entity_emb_t.reshape(bs, 1, emb_dim) + descr_cls_emb = descr_cls_emb.reshape(bs, emb_dim, 1) + dot_products = torch.matmul(entity_emb_t, descr_cls_emb).squeeze(1).squeeze(1) + cur_scores_list = dot_products.detach().cpu().numpy().tolist() + scores_list += cur_scores_list + + entities_with_scores = [ + (entity, round(min(max(score - 114.0, 0.0), 28.0) / 28.0, 3)) + for entity, score in zip(candidate_entities_list, scores_list) + ] + entities_with_scores = sorted(entities_with_scores, key=lambda x: x[1], reverse=True) + scores_batch.append(entities_with_scores) + else: + scores_batch.append([]) + + return scores_batch diff --git a/annotators/entity_linking/test_el.py b/annotators/entity_linking/test_el.py index 39e2f4e402..72fab81251 100644 --- a/annotators/entity_linking/test_el.py +++ b/annotators/entity_linking/test_el.py @@ -7,28 +7,24 @@ def main(): url = "http://0.0.0.0:8075/model" request_data = [ - {"entity_substr": [["Forrest Gump"]], "template": [""], "context": [["Who directed Forrest Gump?"]]}, { - "entity_substr": [["Robert Lewandowski"]], - "template": [""], - "context": [["What team Robert Lewandowski plays for?"]], + "entity_substr": [["forrest gump"]], + "entity_tags": [[[("film", 0.9)]]], + "context": [["who directed forrest gump?"]], + }, + { + "entity_substr": [["robert lewandowski"]], + "entity_tags": [[[("per", 0.9)]]], + "context": [["what team does robert lewandowski play for?"]], }, ] - if use_context: - gold_results = [ - ["Q134773", "Q3077690", "Q552213", "Q5365088"], - ["Q151269", "Q187312", "Q273773", "Q104913", "Q1153256"], - ] - - else: - gold_results = [ - ["Q134773", "Q3077690", "Q552213", "Q5365088", "Q17006552"], - ["Q151269", "Q104913", "Q768144", "Q2403374", "Q170095"], - ] + gold_results = [["Q134773"], ["Q151269", "Q215925"]] + count = 0 for data, gold_result in zip(request_data, gold_results): result = requests.post(url, json=data).json() + print(result) entity_ids = result[0][0]["entity_ids"] if entity_ids == gold_result: count += 1 diff --git a/annotators/entity_linking_deepy/Dockerfile b/annotators/entity_linking_deepy/Dockerfile index e268e3b73d..78bc8d7981 100644 --- a/annotators/entity_linking_deepy/Dockerfile +++ b/annotators/entity_linking_deepy/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG COMMIT @@ -21,6 +22,4 @@ COPY $SRC_DIR /src WORKDIR /src -RUN python -m deeppavlov install $CONFIG - CMD python -m deeppavlov riseapi $CONFIG -p $PORT -d diff --git a/annotators/entity_linking_deepy/requirements.txt b/annotators/entity_linking_deepy/requirements.txt index ad39c3791a..7dd6c520d7 100644 --- a/annotators/entity_linking_deepy/requirements.txt +++ b/annotators/entity_linking_deepy/requirements.txt @@ -1,3 +1,5 @@ aiohttp jinja2<=3.0.3 -Werkzeug<=2.0.3 \ No newline at end of file +Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 \ No newline at end of file diff --git a/annotators/entity_linking_rus/Dockerfile b/annotators/entity_linking_rus/Dockerfile index 5c79a0216c..6475b9913d 100644 --- a/annotators/entity_linking_rus/Dockerfile +++ b/annotators/entity_linking_rus/Dockerfile @@ -33,7 +33,6 @@ RUN pip install -r /src/requirements.txt COPY $SRC_DIR /src WORKDIR /src -RUN python -m deeppavlov install $CONFIG RUN sed -i "s|$SED_ARG|g" "$CONFIG" diff --git a/annotators/entity_linking_rus/requirements.txt b/annotators/entity_linking_rus/requirements.txt index 47b59b9619..daa2394e33 100644 --- a/annotators/entity_linking_rus/requirements.txt +++ b/annotators/entity_linking_rus/requirements.txt @@ -10,3 +10,4 @@ deeppavlov==0.17.2 itsdangerous==2.0.1 jinja2<=3.0.3 Werkzeug<=2.0.3 +cryptography==2.8 diff --git a/annotators/fact_retrieval/Dockerfile b/annotators/fact_retrieval/Dockerfile index d7036a7860..50100b7c77 100644 --- a/annotators/fact_retrieval/Dockerfile +++ b/annotators/fact_retrieval/Dockerfile @@ -24,13 +24,12 @@ ENV PORT=$PORT COPY ./annotators/fact_retrieval/requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt -RUN pip install git+https://github.com/deepmipt/DeepPavlov.git@${COMMIT} +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@${COMMIT} COPY $SRC_DIR /src WORKDIR /src COPY ./common/ ./common/ -RUN python -m deeppavlov install $CONFIG RUN sed -i "s|$SED_ARG|g" "$CONFIG" RUN sed -i "s|$SED_ARG|g" "$CONFIG_WIKI" diff --git a/annotators/fact_retrieval/requirements.txt b/annotators/fact_retrieval/requirements.txt index a9862c25ce..80cc9f10a4 100644 --- a/annotators/fact_retrieval/requirements.txt +++ b/annotators/fact_retrieval/requirements.txt @@ -8,4 +8,8 @@ requests==2.22.0 pytorch-lightning==0.9.0 torch==1.6.0 transformers==2.11.0 -faiss-cpu==1.7.0 \ No newline at end of file +faiss-cpu==1.7.0 +tensorflow==1.15.5 +cryptography==2.8 +https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 +spacy==2.2.3 \ No newline at end of file diff --git a/annotators/hypothesis_scorer/Dockerfile b/annotators/hypothesis_scorer/Dockerfile index 3e48038672..c5a7c6ebd0 100644 --- a/annotators/hypothesis_scorer/Dockerfile +++ b/annotators/hypothesis_scorer/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.14.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.14.1 ARG MIDAS_DATA_URL=https://files.deeppavlov.ai/alexaprize_data/midas.tar.gz ARG CONVERT_DATA_URL=https://files.deeppavlov.ai/alexaprize_data/convert_reddit_v2.8.tar.gz diff --git a/annotators/intent_catcher_deepy/Dockerfile b/annotators/intent_catcher_deepy/Dockerfile index 0ee88680b6..2cc8adce29 100644 --- a/annotators/intent_catcher_deepy/Dockerfile +++ b/annotators/intent_catcher_deepy/Dockerfile @@ -18,11 +18,11 @@ COPY ./requirements.txt requirements.txt RUN pip install --upgrade pip && \ pip install -r requirements.txt && \ - pip install git+git://github.com/deepmipt/DeepPavlov.git@dbcaf73acd8580e2bec337300ab0d29887d78c51 + pip install git+git://github.com/deeppavlovteam/DeepPavlov.git@dbcaf73acd8580e2bec337300ab0d29887d78c51 RUN python -c "import tensorflow_hub as hub; hub.Module(\"https://tfhub.dev/google/universal-sentence-encoder/2\")" && \ wget -O /usr/local/lib/python3.6/dist-packages/deeppavlov/utils/server/server.py \ - https://raw.githubusercontent.com/deepmipt/DeepPavlov/1e707d55ca090782f16f918f15450d1d07d27c85/deeppavlov/utils/server/server.py + https://raw.githubusercontent.com/deeppavlovteam/DeepPavlov/1e707d55ca090782f16f918f15450d1d07d27c85/deeppavlov/utils/server/server.py COPY ./ / diff --git a/annotators/kbqa/Dockerfile b/annotators/kbqa/Dockerfile index 4316baeab0..d6bf48c142 100644 --- a/annotators/kbqa/Dockerfile +++ b/annotators/kbqa/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG COMMIT @@ -14,12 +15,7 @@ ENV COMMIT=$COMMIT COPY ./annotators/kbqa/requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt -RUN cd DeepPavlov && \ - git config --global user.email "you@example.com" && \ - git config --global user.name "Your Name" && \ - git fetch --all --tags --prune && \ - git checkout $COMMIT && \ - pip install -e . +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@$COMMIT COPY $SRC_DIR /src diff --git a/annotators/kbqa/kbqa_cq_mt_bert_lite.json b/annotators/kbqa/kbqa_cq_mt_bert_lite.json index aa0281db14..812c53d301 100644 --- a/annotators/kbqa/kbqa_cq_mt_bert_lite.json +++ b/annotators/kbqa/kbqa_cq_mt_bert_lite.json @@ -1,6 +1,6 @@ { "chainer": { - "in": ["x_init", "x_init", "template_type", "entities", "types"], + "in": ["x_init", "x_init", "template_type", "entities", "tags"], "pipe": [ { "class_name": "template_matcher", @@ -14,7 +14,7 @@ "id": "linker_entities", "url": "{ENTITY_LINKING_URL}", "out": ["entity_ids"], - "param_names": ["entity_substr", "template_found"] + "param_names": ["entity_substr", "entity_tags", "context"] }, { "class_name": "wiki_parser", @@ -25,7 +25,7 @@ "lang": "@en" }, { - "class_name": "kbqa_entity_linker", + "class_name": "kbqa_entity_linking:KBEntityLinker", "id": "linker_types", "load_path": "{DOWNLOADS_PATH}/wikidata_eng", "inverted_index_filename": "inverted_index_types_eng.pickle", @@ -36,7 +36,7 @@ "use_prefix_tree": false }, { - "class_name": "rel_ranking_infer", + "class_name": "rel_ranking_infer:RelRankerInfer", "id": "rel_r_inf", "ranker": {"config_path": "/src/rel_ranking_bert_en.json"}, "load_path": "{DOWNLOADS_PATH}/wikidata_eng", @@ -44,7 +44,7 @@ "rels_to_leave": 40 }, { - "class_name": "query_generator", + "class_name": "query_generator:QueryGenerator", "id": "query_g", "linker_entities": "#linker_entities", "linker_types": "#linker_types", @@ -62,17 +62,16 @@ "max_comb_num": 50, "use_wp_api_requester": false, "use_el_api_requester": true, - "in": ["x_init", "x_init", "template_type", "entities", "types"], + "in": ["x_init", "x_init", "template_type", "entities", "tags"], "out": ["candidate_rels_answers", "entities", "template_answers"] }, { - "class_name": "rel_ranking_infer", + "class_name": "rel_ranking_infer:RelRankerInfer", "rank": false, "wiki_parser": "#wiki_p", "batch_size": 32, "load_path": "{DOWNLOADS_PATH}/wikidata_eng", "rel_q2name_filename": "wiki_dict_properties.pickle", - "use_mt_bert": true, "use_api_requester": false, "return_confidences": true, "return_all_possible_answers": true, @@ -94,8 +93,6 @@ "ENTITY_LINKING_URL": "http://entity-linking:8075/model" }, "requirements": [ - "{DEEPPAVLOV_PATH}/requirements/tf.txt", - "{DEEPPAVLOV_PATH}/requirements/bert_dp.txt", "{DEEPPAVLOV_PATH}/requirements/fasttext.txt", "{DEEPPAVLOV_PATH}/requirements/rapidfuzz.txt", "{DEEPPAVLOV_PATH}/requirements/hdt.txt", diff --git a/annotators/entity_linking/kbqa_entity_linking.py b/annotators/kbqa/kbqa_entity_linking.py similarity index 69% rename from annotators/entity_linking/kbqa_entity_linking.py rename to annotators/kbqa/kbqa_entity_linking.py index 7441d18197..e8923dc0d2 100644 --- a/annotators/entity_linking/kbqa_entity_linking.py +++ b/annotators/kbqa/kbqa_entity_linking.py @@ -12,18 +12,14 @@ # See the License for the specific language governing permissions and # limitations under the License. -import os import re import sqlite3 -import logging +from logging import getLogger from typing import List, Dict, Tuple, Optional, Any from collections import defaultdict, Counter -import en_core_web_sm -import inflect import nltk import pymorphy2 -import sentry_sdk from nltk.corpus import stopwords from rapidfuzz import fuzz from hdt import HDTDocument @@ -33,12 +29,9 @@ from deeppavlov.core.models.serializable import Serializable from deeppavlov.core.commands.utils import expand_path from deeppavlov.core.common.file import load_pickle, save_pickle -from deeppavlov.models.spelling_correction.levenshtein.levenshtein_searcher import LevenshteinSearcher -from deeppavlov.models.kbqa.rel_ranking_infer import RelRankerInfer +from rel_ranking_infer import RelRankerInfer -sentry_sdk.init(os.getenv("SENTRY_DSN")) -logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.DEBUG) -log = logging.getLogger(__name__) +log = getLogger(__name__) @register("kbqa_entity_linker") @@ -83,7 +76,6 @@ def __init__( **kwargs, ) -> None: """ - Args: load_path: path to folder with inverted index files inverted_index_filename: file with dict of words (keys) and entities containing these words @@ -111,13 +103,11 @@ def __init__( include_mention: whether to leave or delete entity mention from the sentence before passing to BERT ranker num_entities_to_return: how many entities for each substring the system returns lemmatize: whether to lemmatize tokens of extracted entity - use_prefix_tree: whether to use prefix tree for search of entities with typos in entity labels **kwargs: """ super().__init__(save_path=save_path, load_path=load_path) self.morph = pymorphy2.MorphAnalyzer() self.lemmatize = lemmatize - self.use_prefix_tree = use_prefix_tree self.inverted_index_filename = inverted_index_filename self.entities_list_filename = entities_list_filename self.build_inverted_index = build_inverted_index @@ -145,36 +135,27 @@ def __init__( self.stopwords = set(stopwords.words("russian")) self.re_tokenizer = re.compile(r"[\w']+|[^\w ]") self.entity_ranker = entity_ranker - self.nlp = en_core_web_sm.load() - self.inflect_engine = inflect.engine() self.use_descriptions = use_descriptions self.include_mention = include_mention self.num_entities_to_return = num_entities_to_return self.num_entities_for_bert_ranking = num_entities_for_bert_ranking - self.black_list_what_is = { - "Q277759", # book series - "Q11424", # film - "Q7889", # video game - "Q2743", # musical theatre - "Q5398426", # tv series - "Q506240", # television film - "Q21191270", # television series episode - "Q7725634", # literary work - "Q131436", # board game - "Q1783817", # cooperative board game - } + self.black_list_what_is = set( + [ + "Q277759", # book series + "Q11424", # film + "Q7889", # video game + "Q2743", # musical theatre + "Q5398426", # tv series + "Q506240", # television film + "Q21191270", # television series episode + "Q7725634", # literary work + "Q131436", # board game + "Q1783817", # cooperative board game + ] + ) if self.use_descriptions and self.entity_ranker is None: raise ValueError("No entity ranker is provided!") - if self.use_prefix_tree: - alphabet = ( - r"!#%\&'()+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz½¿ÁÄ" - + "ÅÆÇÉÎÓÖ×ÚßàáâãäåæçèéêëíîïðñòóôöøùúûüýāăąćČčĐėęěĞğĩīİıŁłńňŌōőřŚśşŠšťũūůŵźŻżŽžơưșȚțəʻ" - + "ʿΠΡβγБМавдежикмностъяḤḥṇṬṭầếờợ–‘’Ⅲ−∗" - ) - dictionary_words = list(self.inverted_index.keys()) - self.searcher = LevenshteinSearcher(alphabet, dictionary_words) - if self.build_inverted_index: if self.kb_format == "hdt": self.doc = HDTDocument(str(expand_path(self.kb_filename))) @@ -221,118 +202,82 @@ def __call__( self, entity_substr_batch: List[List[str]], templates_batch: List[str] = None, - long_context_batch: List[str] = None, + context_batch: List[str] = None, entity_types_batch: List[List[List[str]]] = None, - short_context_batch: List[str] = None, ) -> Tuple[List[List[List[str]]], List[List[List[float]]]]: + log.info( + f"entity_substr_batch {entity_substr_batch} templates_batch {templates_batch} context_batch {context_batch}" + ) entity_ids_batch = [] confidences_batch = [] - tokens_match_conf_batch = [] if templates_batch is None: templates_batch = ["" for _ in entity_substr_batch] - if long_context_batch is None: - long_context_batch = ["" for _ in entity_substr_batch] - if short_context_batch is None: - short_context_batch = ["" for _ in entity_substr_batch] + if context_batch is None: + context_batch = ["" for _ in entity_substr_batch] if entity_types_batch is None: entity_types_batch = [[[] for _ in entity_substr_list] for entity_substr_list in entity_substr_batch] - for entity_substr_list, template_found, long_context, entity_types_list, short_context in zip( - entity_substr_batch, templates_batch, long_context_batch, entity_types_batch, short_context_batch + for entity_substr_list, template_found, context, entity_types_list in zip( + entity_substr_batch, templates_batch, context_batch, entity_types_batch ): entity_ids_list = [] confidences_list = [] - tokens_match_conf_list = [] for entity_substr, entity_types in zip(entity_substr_list, entity_types_list): - entity_ids, confidences, tokens_match_conf = self.link_entity( - entity_substr, long_context, short_context, template_found, entity_types - ) + entity_ids, confidences = self.link_entity(entity_substr, context, template_found, entity_types) if self.num_entities_to_return == 1: if entity_ids: entity_ids_list.append(entity_ids[0]) confidences_list.append(confidences[0]) - tokens_match_conf_list.append(tokens_match_conf[0]) else: entity_ids_list.append("") confidences_list.append(0.0) - tokens_match_conf_list.append(0.0) else: entity_ids_list.append(entity_ids[: self.num_entities_to_return]) confidences_list.append(confidences[: self.num_entities_to_return]) - tokens_match_conf_list.append(tokens_match_conf[: self.num_entities_to_return]) entity_ids_batch.append(entity_ids_list) confidences_batch.append(confidences_list) - tokens_match_conf_batch.append(tokens_match_conf_list) - - return entity_ids_batch, confidences_batch, tokens_match_conf_batch - - def lemmatize_substr(self, text): - lemm_text = "" - if text: - pr_text = self.nlp(text) - processed_tokens = [] - for token in pr_text: - if token.tag_ in ["NNS", "NNP"] and self.inflect_engine.singular_noun(token.text): - processed_tokens.append(self.inflect_engine.singular_noun(token.text)) - else: - processed_tokens.append(token.text) - lemm_text = " ".join(processed_tokens) - return lemm_text + + return entity_ids_batch, confidences_batch def link_entity( self, entity: str, - long_context: Optional[str] = None, - short_context: Optional[str] = None, + context: Optional[str] = None, template_found: Optional[str] = None, entity_types: List[str] = None, cut_entity: bool = False, ) -> Tuple[List[str], List[float]]: confidences = [] - tokens_match_conf = [] if not entity: entities_ids = ["None"] else: - entity_is_uttr = False - lets_talk_phrases = ["let's talk", "let's chat", "what about", "do you know", "tell me about"] - found_lets_talk_phrase = any([phrase in short_context for phrase in lets_talk_phrases]) - if ( - short_context - and (entity == short_context or entity == short_context[:-1] or found_lets_talk_phrase) - and len(entity.split()) == 1 - ): - lemm_entity = self.lemmatize_substr(entity) - entity_is_uttr = True - else: - lemm_entity = entity - - candidate_entities = self.candidate_entities_inverted_index(lemm_entity) + candidate_entities = self.candidate_entities_inverted_index(entity) if self.types_dict: if entity_types: entity_types = set(entity_types) candidate_entities = [ - ent - for ent in candidate_entities - if self.types_dict.get(ent[1], set()).intersection(entity_types) + entity + for entity in candidate_entities + if self.types_dict.get(entity[1], set()).intersection(entity_types) ] - if template_found in ["what is xxx?", "what was xxx?"] or entity_is_uttr: + if template_found in ["what is xxx?", "what was xxx?"]: candidate_entities_filtered = [ - ent - for ent in candidate_entities - if not self.types_dict.get(ent[1], set()).intersection(self.black_list_what_is) + entity + for entity in candidate_entities + if not self.types_dict.get(entity[1], set()).intersection(self.black_list_what_is) ] if candidate_entities_filtered: candidate_entities = candidate_entities_filtered - if cut_entity and candidate_entities and len(lemm_entity.split()) > 1 and candidate_entities[0][3] == 1: - lemm_entity = self.cut_entity_substr(lemm_entity) - candidate_entities = self.candidate_entities_inverted_index(lemm_entity) - candidate_entities, candidate_names = self.candidate_entities_names(lemm_entity, candidate_entities) - entities_ids, confidences, tokens_match_conf, srtd_cand_ent = self.sort_found_entities( - candidate_entities, candidate_names, lemm_entity, entity, long_context + if cut_entity and candidate_entities and len(entity.split()) > 1 and candidate_entities[0][3] == 1: + entity = self.cut_entity_substr(entity) + candidate_entities = self.candidate_entities_inverted_index(entity) + candidate_entities, candidate_names = self.candidate_entities_names(entity, candidate_entities) + entities_ids, confidences, srtd_cand_ent = self.sort_found_entities( + candidate_entities, candidate_names, entity, context ) if template_found: entities_ids = self.filter_entities(entities_ids, template_found) - return entities_ids, confidences, tokens_match_conf + return entities_ids, confidences def cut_entity_substr(self, entity: str): word_tokens = nltk.word_tokenize(entity.lower()) @@ -351,10 +296,8 @@ def candidate_entities_inverted_index(self, entity: str) -> List[Tuple[Any, Any, for tok in word_tokens: candidate_entities_for_tok = set() if len(tok) > 1: - found = False if tok in self.inverted_index: candidate_entities_for_tok = set(self.inverted_index[tok]) - found = True if self.lemmatize: if self.lang_str == "@ru": @@ -367,12 +310,6 @@ def candidate_entities_inverted_index(self, entity: str) -> List[Tuple[Any, Any, candidate_entities_for_tok = candidate_entities_for_tok.union( set(self.inverted_index[lemmatized_tok]) ) - found = True - - if not found and self.use_prefix_tree: - words_with_levens_1 = self.searcher.search(tok, d=1) - for word in words_with_levens_1: - candidate_entities_for_tok = candidate_entities_for_tok.union(set(self.inverted_index[word[0]])) candidate_entities_for_tokens.append(candidate_entities_for_tok) for candidate_entities_for_tok in candidate_entities_for_tokens: @@ -391,100 +328,41 @@ def sort_found_entities( self, candidate_entities: List[Tuple[int, str, int]], candidate_names: List[List[str]], - lemm_entity: str, entity: str, context: str = None, ) -> Tuple[List[str], List[float], List[Tuple[str, str, int, int]]]: entities_ratios = [] - lemm_entity = lemm_entity.lower() for candidate, entity_names in zip(candidate_entities, candidate_names): entity_num, entity_id, num_rels, tokens_matched = candidate - fuzz_ratio = max([fuzz.ratio(name.lower(), lemm_entity) for name in entity_names]) - entity_tokens = re.findall(self.re_tokenizer, entity.lower()) - lemm_entity_tokens = re.findall(self.re_tokenizer, lemm_entity.lower()) - entity_tokens = { - word for word in entity_tokens if (len(word) > 1 and word != "'s" and word not in self.stopwords) - } - lemm_entity_tokens = { - word for word in lemm_entity_tokens if (len(word) > 1 and word != "'s" and word not in self.stopwords) - } - match_counts = [] - for name in entity_names: - name_tokens = re.findall(self.re_tokenizer, name.lower()) - name_tokens = { - word for word in name_tokens if (len(word) > 1 and word != "'s" and word not in self.stopwords) - } - entity_inters_len = len(entity_tokens.intersection(name_tokens)) - lemm_entity_inters_len = len(lemm_entity_tokens.intersection(name_tokens)) - - entity_ratio_1 = 0.0 - entity_ratio_2 = 0.0 - if len(entity_tokens): - entity_ratio_1 = entity_inters_len / len(entity_tokens) - if entity_ratio_1 > 1.0 and entity_ratio_1 != 0.0: - entity_ratio_1 = 1.0 / entity_ratio_1 - if len(name_tokens): - entity_ratio_2 = entity_inters_len / len(name_tokens) - if entity_ratio_2 > 1.0 and entity_ratio_2 != 0.0: - entity_ratio_2 = 1.0 / entity_ratio_2 - - lemm_entity_ratio_1 = 0.0 - lemm_entity_ratio_2 = 0.0 - if len(lemm_entity_tokens): - lemm_entity_ratio_1 = lemm_entity_inters_len / len(lemm_entity_tokens) - if lemm_entity_ratio_1 > 1.0 and lemm_entity_ratio_1 != 0.0: - lemm_entity_ratio_1 = 1.0 / lemm_entity_ratio_1 - if len(name_tokens): - lemm_entity_ratio_2 = lemm_entity_inters_len / len(name_tokens) - if lemm_entity_ratio_2 > 1.0 and lemm_entity_ratio_2 != 0.0: - lemm_entity_ratio_2 = 1.0 / lemm_entity_ratio_2 - - match_count = max(entity_ratio_1, entity_ratio_2, lemm_entity_ratio_1, lemm_entity_ratio_2) - match_counts.append(match_count) - match_counts = sorted(match_counts, reverse=True) - if match_counts: - tokens_matched = match_counts[0] - else: - tokens_matched = 0.0 - + fuzz_ratio = max([fuzz.ratio(name.lower(), entity) for name in entity_names]) entities_ratios.append((entity_num, entity_id, tokens_matched, fuzz_ratio, num_rels)) srtd_with_ratios = sorted(entities_ratios, key=lambda x: (x[2], x[3], x[4]), reverse=True) if self.use_descriptions: log.debug(f"context {context}") id_to_score = { - entity_id: (tokens_matched, score, num_rels) - for _, entity_id, tokens_matched, score, num_rels in srtd_with_ratios[ - : self.num_entities_for_bert_ranking - ] + entity_id: (tokens_matched, score) + for _, entity_id, tokens_matched, score, _ in srtd_with_ratios[: self.num_entities_for_bert_ranking] } entity_ids = [entity_id for _, entity_id, _, _, _ in srtd_with_ratios[: self.num_entities_for_bert_ranking]] scores = self.entity_ranker.rank_rels(context, entity_ids) entities_with_scores = [ - (entity_id, id_to_score[entity_id][0], id_to_score[entity_id][1], id_to_score[entity_id][2], score) - for entity_id, score in scores + (entity_id, id_to_score[entity_id][0], id_to_score[entity_id][1], score) for entity_id, score in scores ] - entities_with_scores = sorted(entities_with_scores, key=lambda x: (x[1], x[2], x[3], x[4]), reverse=True) - + entities_with_scores = sorted(entities_with_scores, key=lambda x: (x[1], x[2], x[3]), reverse=True) entities_with_scores = [ - ent - for ent in entities_with_scores - if ( - ent[4] > self.descr_rank_score_thres - or ent[2] == 100.0 - or (ent[1] == 1.0 and ent[2] > 92.0 and ent[3] > 20 and ent[4] > 0.2) - ) + entity + for entity in entities_with_scores + if (entity[3] > self.descr_rank_score_thres or entity[2] == 100.0) ] log.debug(f"entities_with_scores {entities_with_scores[:10]}") - entity_ids = [ent for ent, *_ in entities_with_scores] - confidences = [score for *_, score in entities_with_scores] - tokens_match_conf = [ratio for _, ratio, *_ in entities_with_scores] + entity_ids = [entity for entity, _, _, _ in entities_with_scores] + confidences = [score for _, _, _, score in entities_with_scores] else: entity_ids = [ent[1] for ent in srtd_with_ratios] - confidences = [ent[4] * 0.01 for ent in srtd_with_ratios] - tokens_match_conf = [ent[2] for ent in srtd_with_ratios] + confidences = [ent[3] * 0.01 for ent in srtd_with_ratios] - return entity_ids, confidences, tokens_match_conf, srtd_with_ratios + return entity_ids, confidences, srtd_with_ratios def candidate_entities_names( self, entity: str, candidate_entities: List[Tuple[int, str, int]] diff --git a/annotators/kbqa/query_generator.py b/annotators/kbqa/query_generator.py new file mode 100644 index 0000000000..bcc656455a --- /dev/null +++ b/annotators/kbqa/query_generator.py @@ -0,0 +1,294 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +import json +import re +import time +from logging import getLogger +from typing import Tuple, List, Optional, Union, Dict, Any +from collections import namedtuple, OrderedDict + +import numpy as np +import nltk + +from deeppavlov.core.common.registry import register +from deeppavlov.models.kbqa.wiki_parser import WikiParser +from deeppavlov.models.kbqa.rel_ranking_infer import RelRankerInfer +from deeppavlov.models.kbqa.utils import extract_year, extract_number, order_of_answers_sorting, make_combs, fill_query +from query_generator_base import QueryGeneratorBase + +log = getLogger(__name__) + + +@register("query_generator") +class QueryGenerator(QueryGeneratorBase): + """ + Class for query generation using Wikidata hdt file + """ + + def __init__( + self, + wiki_parser: WikiParser, + rel_ranker: RelRankerInfer, + entities_to_leave: int = 5, + rels_to_leave: int = 7, + max_comb_num: int = 10000, + return_all_possible_answers: bool = False, + return_answers: bool = False, + *args, + **kwargs, + ) -> None: + """ + + Args: + wiki_parser: component deeppavlov.models.kbqa.wiki_parser + rel_ranker: component deeppavlov.models.kbqa.rel_ranking_infer + entities_to_leave: how many entities to leave after entity linking + rels_to_leave: how many relations to leave after relation ranking + max_comb_num: the maximum number of combinations of candidate entities and relations + return_all_possible_answers: whether to return all found answers + return_answers: whether to return answers or candidate answers + **kwargs: + """ + self.wiki_parser = wiki_parser + self.rel_ranker = rel_ranker + self.entities_to_leave = entities_to_leave + self.rels_to_leave = rels_to_leave + self.max_comb_num = max_comb_num + self.return_all_possible_answers = return_all_possible_answers + self.return_answers = return_answers + self.replace_tokens = [ + ("wdt:p31", "wdt:P31"), + ("pq:p580", "pq:P580"), + ("pq:p582", "pq:P582"), + ("pq:p585", "pq:P585"), + ("pq:p1545", "pq:P1545"), + ] + super().__init__( + wiki_parser=self.wiki_parser, + rel_ranker=self.rel_ranker, + entities_to_leave=self.entities_to_leave, + rels_to_leave=self.rels_to_leave, + return_answers=self.return_answers, + *args, + **kwargs, + ) + + def __call__( + self, + question_batch: List[str], + question_san_batch: List[str], + template_type_batch: Union[List[List[str]], List[str]], + entities_from_ner_batch: List[List[str]], + types_from_ner_batch: List[List[str]], + q_type_flags_batch: List[str] = None, + ) -> List[Union[List[Tuple[str, Any]], List[str]]]: + + candidate_outputs_batch = [] + template_answers_batch = [] + qg_tm1 = time.time() + candidate_outputs = [] + template_answer = "" + if q_type_flags_batch is None: + q_type_flags_batch = ["" for _ in question_batch] + try: + log.info(f"kbqa inputs {question_batch} {entities_from_ner_batch} {types_from_ner_batch}") + for question, question_sanitized, template_type, entities_from_ner, types_from_ner, q_type_flag in zip( + question_batch, + question_san_batch, + template_type_batch, + entities_from_ner_batch, + types_from_ner_batch, + q_type_flags_batch, + ): + if template_type == "-1": + template_type = "7" + candidate_outputs, template_answer = self.find_candidate_answers( + question, question_sanitized, template_type, entities_from_ner, types_from_ner, q_type_flag + ) + except Exception as e: + log.info("query generator is broken") + log.exception(e) + candidate_outputs_batch.append(candidate_outputs) + template_answers_batch.append(template_answer) + qg_tm2 = time.time() + log.debug(f"--------query generator time {qg_tm2-qg_tm1}") + if self.return_answers: + answers = self.rel_ranker( + question_batch, candidate_outputs_batch, entities_from_ner_batch, template_answers_batch + ) + log.debug(f"(__call__)answers: {answers}") + if not answers: + answers = ["Not Found"] + return answers + else: + log.debug(f"(__call__)candidate_outputs_batch: {[output[:5] for output in candidate_outputs_batch]}") + return candidate_outputs_batch, entities_from_ner_batch, template_answers_batch + + def query_parser( + self, + question: str, + query_info: Dict[str, str], + entities_and_types_select: List[str], + entity_ids: List[List[str]], + type_ids: List[List[str]], + rels_from_template: Optional[List[Tuple[str]]] = None, + answer_types: Optional[List[str]] = None, + ) -> List[List[Union[Tuple[Any, ...], Any]]]: + question_tokens = nltk.word_tokenize(question) + query = query_info["query_template"].lower() + for old_tok, new_tok in self.replace_tokens: + query = query.replace(old_tok, new_tok) + log.debug(f"\n_______________________________\nquery: {query}\n_______________________________\n") + rels_for_search = query_info["rank_rels"] + rel_types = query_info["rel_types"] + query_seq_num = query_info["query_sequence"] + return_if_found = query_info["return_if_found"] + define_sorting_order = query_info["define_sorting_order"] + property_types = query_info["property_types"] + log.debug(f"(query_parser)query: {query}, {rels_for_search}, {query_seq_num}, {return_if_found}") + query_triplets = re.findall(r"{[ ]?(.*?)[ ]?}", query)[0].split(" . ") + log.debug(f"(query_parser)query_triplets: {query_triplets}") + query_triplets = [triplet.split(" ")[:3] for triplet in query_triplets] + query_sequence_dict = {num: triplet for num, triplet in zip(query_seq_num, query_triplets)} + query_sequence = [] + for i in range(1, max(query_seq_num) + 1): + query_sequence.append(query_sequence_dict[i]) + triplet_info_list = [ + ("forw" if triplet[2].startswith("?") else "backw", search_source, rel_type) + for search_source, triplet, rel_type in zip(rels_for_search, query_triplets, rel_types) + if search_source != "do_not_rank" + ] + log.debug(f"(query_parser)rel_directions: {triplet_info_list}") + entity_ids = [entity[: self.entities_to_leave] for entity in entity_ids] + entity_ids = [[entity for entity in entities if entity != "not in wiki"] for entities in entity_ids] + rel_tm1 = time.time() + if rels_from_template is not None: + rels = [[(rel, 1.0) for rel in rel_list] for rel_list in rels_from_template] + else: + rels = [self.find_top_rels(question, entity_ids, triplet_info) for triplet_info in triplet_info_list] + log.info(f"(query_parser)rels: {rels}") + rels = [[rel for rel in rel_list if rel[1] > 0.95] for rel_list in rels] + rel_tm2 = time.time() + log.debug(f"--------rels find time: {rel_tm2-rel_tm1}") + rels_from_query = [triplet[1] for triplet in query_triplets if triplet[1].startswith("?")] + answer_ent = re.findall(r"select [\(]?([\S]+) ", query) + order_info_nt = namedtuple("order_info", ["variable", "sorting_order"]) + order_variable = re.findall(r"order by (asc|desc)\((.*)\)", query) + if order_variable: + if define_sorting_order: + answers_sorting_order = order_of_answers_sorting(question) + else: + answers_sorting_order = order_variable[0][0] + order_info = order_info_nt(order_variable[0][1], answers_sorting_order) + else: + order_info = order_info_nt(None, None) + log.debug(f"question, order_info: {question}, {order_info}") + filter_from_query = re.findall(r"contains\((\?\w), (.+?)\)", query) + log.debug(f"(query_parser)filter_from_query: {filter_from_query}") + + year = extract_year(question_tokens, question) + number = extract_number(question_tokens, question) + log.debug(f"year {year}, number {number}") + if year: + filter_info = [(elem[0], elem[1].replace("n", year)) for elem in filter_from_query] + elif number: + filter_info = [(elem[0], elem[1].replace("n", number)) for elem in filter_from_query] + else: + filter_info = [elem for elem in filter_from_query if elem[1] != "n"] + for unk_prop, prop_type in property_types.items(): + filter_info.append((unk_prop, prop_type)) + log.debug(f"(query_parser)filter_from_query: {filter_from_query}") + rel_combs = make_combs(rels, permut=False) + entity_positions, type_positions = [elem.split("_") for elem in entities_and_types_select.split(" ")] + log.debug(f"entity_positions {entity_positions}, type_positions {type_positions}") + selected_entity_ids = [entity_ids[int(pos) - 1] for pos in entity_positions if int(pos) > 0] + selected_type_ids = [type_ids[int(pos) - 1] for pos in type_positions if int(pos) > 0] + entity_combs = make_combs(selected_entity_ids, permut=True) + type_combs = make_combs(selected_type_ids, permut=False) + log.debug( + f"(query_parser)entity_combs: {entity_combs[:3]}, type_combs: {type_combs[:3]}," + f" rel_combs: {rel_combs[:3]}" + ) + queries_list = [] + parser_info_list = [] + confidences_list = [] + all_combs_list = list(itertools.product(entity_combs, type_combs, rel_combs)) + answer_types = self.filter_answers(question.lower(), answer_types) + query_tm1 = time.time() + for comb_num, combs in enumerate(all_combs_list): + confidence = np.prod([score for rel, score in combs[2][:-1]]) + confidences_list.append(confidence) + query_hdt_seq = [ + fill_query(query_hdt_elem, combs[0], combs[1], combs[2]) for query_hdt_elem in query_sequence + ] + if comb_num == 0: + log.debug(f"\n__________________________\nfilled query: {query_hdt_seq}\n__________________________\n") + if comb_num > 0: + answer_types = [] + queries_list.append( + ( + rels_from_query + answer_ent, + query_hdt_seq, + filter_info, + order_info, + answer_types, + rel_types, + return_if_found, + ) + ) + + parser_info_list.append("query_execute") + if comb_num == self.max_comb_num: + break + + candidate_outputs = [] + candidate_outputs_list = [] + try: + candidate_outputs_list = self.wiki_parser(parser_info_list, queries_list) + except json.decoder.JSONDecodeError: + log.info("query execute, not received output from wiki parser") + if self.use_wp_api_requester and isinstance(candidate_outputs_list, list) and candidate_outputs_list: + candidate_outputs_list = candidate_outputs_list[0] + + if isinstance(candidate_outputs_list, list) and candidate_outputs_list: + outputs_len = len(candidate_outputs_list) + all_combs_list = all_combs_list[:outputs_len] + confidences_list = confidences_list[:outputs_len] + for combs, confidence, candidate_output in zip(all_combs_list, confidences_list, candidate_outputs_list): + candidate_outputs += [ + [combs[0]] + [rel for rel, score in combs[2][:-1]] + output + [confidence] + for output in candidate_output + ] + if self.return_all_possible_answers: + candidate_outputs_dict = OrderedDict() + for candidate_output in candidate_outputs: + candidate_output_key = (tuple(candidate_output[0]), tuple(candidate_output[1:-2])) + if candidate_output_key not in candidate_outputs_dict: + candidate_outputs_dict[candidate_output_key] = [] + candidate_outputs_dict[candidate_output_key].append(candidate_output[-2:]) + candidate_outputs = [] + for (_, candidate_rel_comb), candidate_output in candidate_outputs_dict.items(): + candidate_outputs.append( + list(candidate_rel_comb) + + [tuple([ans for ans, conf in candidate_output]), candidate_output[0][1]] + ) + else: + candidate_outputs = [output[1:] for output in candidate_outputs] + query_tm2 = time.time() + log.debug(f"--------queries execution time: {query_tm2-query_tm1}") + log.info(f"(query_parser)final outputs: {candidate_outputs[:3]}") + + return candidate_outputs diff --git a/annotators/kbqa/query_generator_base.py b/annotators/kbqa/query_generator_base.py new file mode 100644 index 0000000000..b4ef67b595 --- /dev/null +++ b/annotators/kbqa/query_generator_base.py @@ -0,0 +1,382 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import time +from logging import getLogger +from typing import Tuple, List, Optional, Union, Any + +from whapi import search, get_html +from bs4 import BeautifulSoup + +from deeppavlov.core.models.component import Component +from deeppavlov.core.models.serializable import Serializable +from deeppavlov.core.common.file import read_json +from deeppavlov.core.commands.utils import expand_path +from deeppavlov.models.kbqa.template_matcher import TemplateMatcher +from deeppavlov.models.kbqa.entity_linking import EntityLinker +from deeppavlov.models.kbqa.rel_ranking_infer import RelRankerInfer +from deeppavlov.models.kbqa.utils import FilterAnswers + +log = getLogger(__name__) + + +class QueryGeneratorBase(Component, Serializable): + """ + This class takes as input entity substrings, defines the template of the query and + fills the slots of the template with candidate entities and relations. + """ + + def __init__( + self, + template_matcher: TemplateMatcher, + linker_entities: EntityLinker, + linker_types: EntityLinker, + rel_ranker: RelRankerInfer, + load_path: str, + rank_rels_filename_1: str, + rank_rels_filename_2: str, + sparql_queries_filename: str, + wiki_parser=None, + wiki_file_format: str = "hdt", + entities_to_leave: int = 5, + rels_to_leave: int = 7, + answer_types_filename: str = None, + syntax_structure_known: bool = False, + use_wp_api_requester: bool = False, + use_el_api_requester: bool = False, + use_alt_templates: bool = True, + return_answers: bool = False, + *args, + **kwargs, + ) -> None: + """ + + Args: + template_matcher: component deeppavlov.models.kbqa.template_matcher + linker_entities: component deeppavlov.models.kbqa.entity_linking for linking of entities + linker_types: component deeppavlov.models.kbqa.entity_linking for linking of types + rel_ranker: component deeppavlov.models.kbqa.rel_ranking_infer + load_path: path to folder with wikidata files + rank_rels_filename_1: file with list of rels for first rels in questions with ranking + rank_rels_filename_2: file with list of rels for second rels in questions with ranking + sparql_queries_filename: file with sparql query templates + wiki_file_format: format of wikidata file + wiki_parser: component deeppavlov.models.kbqa.wiki_parser + entities_to_leave: how many entities to leave after entity linking + rels_to_leave: how many relations to leave after relation ranking + syntax_structure_known: if syntax tree parser was used to define query template type + use_api_requester: whether deeppavlov.models.api_requester.api_requester component will be used for + Entity Linking and Wiki Parser + return_answers: whether to return answers or candidate answers + """ + super().__init__(save_path=None, load_path=load_path) + self.template_matcher = template_matcher + self.linker_entities = linker_entities + self.linker_types = linker_types + self.wiki_parser = wiki_parser + self.wiki_file_format = wiki_file_format + self.rel_ranker = rel_ranker + self.rank_rels_filename_1 = rank_rels_filename_1 + self.rank_rels_filename_2 = rank_rels_filename_2 + self.rank_list_0 = [] + self.rank_list_1 = [] + self.entities_to_leave = entities_to_leave + self.rels_to_leave = rels_to_leave + self.syntax_structure_known = syntax_structure_known + self.use_wp_api_requester = use_wp_api_requester + self.use_el_api_requester = use_el_api_requester + self.use_alt_templates = use_alt_templates + self.sparql_queries_filename = sparql_queries_filename + self.return_answers = return_answers + self.filter_answers = FilterAnswers(answer_types_filename) + + self.load() + + def load(self) -> None: + with open(self.load_path / self.rank_rels_filename_1, "r") as fl1: + lines = fl1.readlines() + self.rank_list_0 = [line.split("\t")[0] for line in lines] + + with open(self.load_path / self.rank_rels_filename_2, "r") as fl2: + lines = fl2.readlines() + self.rank_list_1 = [line.split("\t")[0] for line in lines] + + self.template_queries = read_json(str(expand_path(self.sparql_queries_filename))) + + def save(self) -> None: + pass + + def find_candidate_answers( + self, + question: str, + question_sanitized: str, + template_types: Union[List[str], str], + entities_from_ner: List[str], + types_from_ner: List[str], + q_type_flag: str, + ) -> Union[List[Tuple[str, Any]], List[str]]: + + candidate_outputs = [] + self.template_nums = template_types + + replace_tokens = [ + (" - ", "-"), + (" .", ""), + ("{", ""), + ("}", ""), + (" ", " "), + ('"', "'"), + ("(", ""), + (")", ""), + ("–", "-"), + ] + for old, new in replace_tokens: + question = question.replace(old, new) + + temp_tm1 = time.time() + ( + entities_from_template, + types_from_template, + rels_from_template, + rel_dirs_from_template, + query_type_template, + entity_types, + template_answer, + answer_types, + template_found, + ) = self.template_matcher(question_sanitized, entities_from_ner) + answer_info = answer_types or q_type_flag + temp_tm2 = time.time() + log.debug(f"--------template matching time: {temp_tm2-temp_tm1}") + self.template_nums = [query_type_template] + + log.debug(f"question: {question}\n") + log.debug(f"template_type {self.template_nums}") + log.debug(f"types from template {types_from_template}") + + if entities_from_template or types_from_template: + if rels_from_template[0][0] == "PHOW": + how_to_content = self.find_answer_wikihow(entities_from_template[0]) + candidate_outputs = [["PHOW", how_to_content, 1.0]] + else: + el_tm1 = time.time() + if len(types_from_ner) > 1: + filtered_types = [] + for types in types_from_ner: + if any([elem[0] != "misc" for elem in types]): + filtered_types.append(types) + types_from_ner = [filtered_types[-1]] + entity_ids = self.get_entity_ids( + entities_from_template, "entities", template_found, question, types_from_ner + ) + type_ids = [] + el_tm2 = time.time() + log.debug(f"--------entity linking time: {el_tm2-el_tm1}") + log.debug(f"entities_from_template {entities_from_template}") + log.debug(f"entity_types {entity_types}") + log.debug(f"types_from_template {types_from_template}") + log.debug(f"rels_from_template {rels_from_template}") + log.debug(f"entity_ids {entity_ids}") + log.debug(f"type_ids {type_ids}") + + candidate_outputs = self.sparql_template_parser( + question_sanitized, entity_ids, type_ids, answer_types, rels_from_template, rel_dirs_from_template + ) + + if not candidate_outputs and entities_from_ner: + log.debug(f"(__call__)entities_from_ner: {entities_from_ner}") + log.debug(f"(__call__)types_from_ner: {types_from_ner}") + el_tm1 = time.time() + if len(entities_from_ner) > 1: + filtered_entities, filtered_types = [], [] + for entity, types in zip(entities_from_ner, types_from_ner): + if any([elem[0] != "misc" for elem in types]): + filtered_entities.append(entity) + filtered_types.append(types) + if filtered_entities: + entities_from_ner = [filtered_entities[-1]] + types_from_ner = [filtered_types[-1]] + else: + entities_from_ner, types_from_ner = [], [] + + entity_ids = self.get_entity_ids( + entities_from_ner, "entities", question=question, entity_types=types_from_ner + ) + type_ids = [] + el_tm2 = time.time() + log.debug(f"--------entity linking time: {el_tm2-el_tm1}") + log.debug(f"(__call__)entity_ids: {entity_ids}") + log.debug(f"(__call__)type_ids: {type_ids}") + self.template_nums = template_types + log.debug(f"(__call__)self.template_nums: {self.template_nums}") + if not self.syntax_structure_known: + entity_ids = entity_ids[:3] + candidate_outputs = self.sparql_template_parser(question_sanitized, entity_ids, type_ids, answer_info) + return candidate_outputs, template_answer + + def get_entity_ids( + self, + entities: List[str], + what_to_link: str, + template_found: str = None, + question: str = None, + entity_types: List[List[str]] = None, + ) -> List[List[str]]: + entity_ids = [] + if what_to_link == "entities": + entities = [entity.lower() for entity in entities] + el_output = [] + try: + el_output = self.linker_entities([entities], [entity_types], [[question.lower()]]) + except json.decoder.JSONDecodeError: + log.info("not received output from entity linking") + if el_output: + log.info(f"el input {entities} {template_found} {question} el output {el_output}") + if self.use_el_api_requester: + el_output = el_output[0] + entity_ids = [entity_info.get("entity_ids", []) for entity_info in el_output] + if not self.use_el_api_requester and entity_ids: + entity_ids = entity_ids[0] + if what_to_link == "types": + entity_ids, *_ = self.linker_types([entities]) + entity_ids = entity_ids[0] + + return entity_ids + + def sparql_template_parser( + self, + question: str, + entity_ids: List[List[str]], + type_ids: List[List[str]], + answer_types: List[str], + rels_from_template: Optional[List[Tuple[str]]] = None, + rel_dirs_from_template: Optional[List[str]] = None, + ) -> List[Tuple[str]]: + candidate_outputs = [] + log.debug(f"use alternative templates {self.use_alt_templates}") + log.debug(f"(find_candidate_answers)self.template_nums: {self.template_nums}") + templates = [] + for template_num in self.template_nums: + for num, template in self.template_queries.items(): + if (num == template_num and self.syntax_structure_known) or ( + template["template_num"] == template_num and not self.syntax_structure_known + ): + templates.append(template) + templates = [ + template + for template in templates + if ( + not self.syntax_structure_known + and [len(entity_ids), len(type_ids)] == template["entities_and_types_num"] + ) + or self.syntax_structure_known + ] + templates_string = "\n".join([template["query_template"] for template in templates]) + log.debug(f"{templates_string}") + if not templates: + return candidate_outputs + if rels_from_template is not None: + query_template = {} + for template in templates: + if template["rel_dirs"] == rel_dirs_from_template: + query_template = template + if query_template: + entities_and_types_select = query_template["entities_and_types_select"] + candidate_outputs = self.query_parser( + question, + query_template, + entities_and_types_select, + entity_ids, + type_ids, + rels_from_template, + answer_types, + ) + else: + for template in templates: + entities_and_types_select = template["entities_and_types_select"] + candidate_outputs = self.query_parser( + question, + template, + entities_and_types_select, + entity_ids, + type_ids, + rels_from_template, + answer_types, + ) + if candidate_outputs: + return candidate_outputs + + if not candidate_outputs and self.use_alt_templates: + alternative_templates = templates[0]["alternative_templates"] + for template_num, entities_and_types_select in alternative_templates: + candidate_outputs = self.query_parser( + question, + self.template_queries[template_num], + entities_and_types_select, + entity_ids, + type_ids, + rels_from_template, + answer_types, + ) + if candidate_outputs: + return candidate_outputs + + log.debug("candidate_rels_and_answers:\n" + "\n".join([str(output) for output in candidate_outputs[:5]])) + + return candidate_outputs + + def find_top_rels(self, question: str, entity_ids: List[List[str]], triplet_info: Tuple) -> List[Tuple[str, Any]]: + ex_rels = [] + direction, source, rel_type = triplet_info + if source == "wiki": + queries_list = list( + { + (entity, direction, rel_type) + for entity_id in entity_ids + for entity in entity_id[: self.entities_to_leave] + } + ) + parser_info_list = ["find_rels" for i in range(len(queries_list))] + try: + ex_rels = self.wiki_parser(parser_info_list, queries_list) + except json.decoder.JSONDecodeError: + log.info("find_top_rels, not received output from wiki parser") + if self.use_wp_api_requester and ex_rels: + ex_rels = [rel[0] for rel in ex_rels] + ex_rels = list(set(ex_rels)) + ex_rels = [rel.split("/")[-1] for rel in ex_rels] + elif source == "rank_list_1": + ex_rels = self.rank_list_0 + elif source == "rank_list_2": + ex_rels = self.rank_list_1 + rels_with_scores = [] + ex_rels = [rel for rel in ex_rels if rel.startswith("P")] + if ex_rels: + rels_with_scores = self.rel_ranker.rank_rels(question, ex_rels) + return rels_with_scores[: self.rels_to_leave] + + def find_answer_wikihow(self, howto_sentence: str) -> str: + tags = [] + search_results = search(howto_sentence, 5) + if search_results: + article_id = search_results[0]["article_id"] + html = get_html(article_id) + page = BeautifulSoup(html, "lxml") + tags = list(page.find_all(["p"])) + if tags: + howto_content = f"{tags[0].text.strip()}@en" + else: + howto_content = "Not Found" + return howto_content diff --git a/annotators/kbqa/rel_ranking_infer.py b/annotators/kbqa/rel_ranking_infer.py new file mode 100644 index 0000000000..e2a2b781a0 --- /dev/null +++ b/annotators/kbqa/rel_ranking_infer.py @@ -0,0 +1,216 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from logging import getLogger +from typing import Tuple, List, Any, Optional +from scipy.special import softmax + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component +from deeppavlov.core.models.serializable import Serializable +from deeppavlov.core.common.file import load_pickle +from deeppavlov.models.ranking.rel_ranker import RelRanker +from deeppavlov.models.kbqa.wiki_parser import WikiParser +from deeppavlov.models.kbqa.sentence_answer import sentence_answer + +log = getLogger(__name__) + + +@register("rel_ranking_infer") +class RelRankerInfer(Component, Serializable): + """Class for ranking of paths in subgraph""" + + def __init__( + self, + load_path: str, + rel_q2name_filename: str, + ranker: Optional[RelRanker] = None, + wiki_parser: Optional[WikiParser] = None, + batch_size: int = 32, + rels_to_leave: int = 40, + softmax: bool = False, + return_all_possible_answers: bool = False, + return_answer_ids: bool = False, + use_api_requester: bool = False, + return_sentence_answer: bool = False, + rank: bool = True, + return_confidences: bool = False, + **kwargs, + ): + """ + Args: + load_path: path to folder with wikidata files + rel_q2name_filename: name of file which maps relation id to name + ranker: component deeppavlov.models.ranking.rel_ranker + wiki_parser: component deeppavlov.models.wiki_parser + batch_size: infering batch size + rels_to_leave: how many relations to leave after relation ranking + return_all_possible_answers: whether to return all found answers + return_answer_ids: whether to return answer ids from Wikidata + use_api_requester: whether wiki parser will be used as external api + return_sentence_answer: whether to return answer as a sentence + return_confidences: whether to return confidences of candidate answers + **kwargs: + """ + super().__init__(save_path=None, load_path=load_path) + self.rel_q2name_filename = rel_q2name_filename + self.ranker = ranker + self.wiki_parser = wiki_parser + self.batch_size = batch_size + self.rels_to_leave = rels_to_leave + self.softmax = softmax + self.return_all_possible_answers = return_all_possible_answers + self.return_answer_ids = return_answer_ids + self.use_api_requester = use_api_requester + self.return_sentence_answer = return_sentence_answer + self.rank = rank + self.return_confidences = return_confidences + self.load() + + def load(self) -> None: + self.rel_q2name = load_pickle(self.load_path / self.rel_q2name_filename) + + def save(self) -> None: + pass + + def __call__( + self, + questions_list: List[str], + candidate_answers_list: List[List[Tuple[str]]], + entities_list: List[List[str]] = None, + template_answers_list: List[str] = None, + ) -> List[str]: + answers = [] + confidence = 0.0 + if entities_list is None: + entities_list = [[] for _ in questions_list] + if template_answers_list is None: + template_answers_list = ["" for _ in questions_list] + for question, candidate_answers, entities, template_answer in zip( + questions_list, candidate_answers_list, entities_list, template_answers_list + ): + answers_with_scores = [] + answer = "Not Found" + if self.rank: + n_batches = len(candidate_answers) // self.batch_size + int( + len(candidate_answers) % self.batch_size > 0 + ) + for i in range(n_batches): + questions_batch = [] + rels_labels_batch = [] + answers_batch = [] + confidences_batch = [] + for candidate_ans_and_rels in candidate_answers[i * self.batch_size : (i + 1) * self.batch_size]: + candidate_rels = [] + if candidate_ans_and_rels: + candidate_rels = candidate_ans_and_rels[:-2] + candidate_rels = [candidate_rel.split("/")[-1] for candidate_rel in candidate_rels] + candidate_answer = candidate_ans_and_rels[-2] + candidate_confidence = candidate_ans_and_rels[-1] + candidate_rels = " # ".join( + [ + self.rel_q2name[candidate_rel] + for candidate_rel in candidate_rels + if candidate_rel in self.rel_q2name + ] + ) + if candidate_rels: + questions_batch.append(question) + rels_labels_batch.append(candidate_rels) + answers_batch.append(candidate_answer) + confidences_batch.append(candidate_confidence) + + if questions_batch: + probas = self.ranker(questions_batch, rels_labels_batch) + probas = [proba[1] for proba in probas] + for j, (answer, confidence, rels_labels) in enumerate( + zip(answers_batch, confidences_batch, rels_labels_batch) + ): + answers_with_scores.append((answer, rels_labels, max(probas[j], confidence))) + + answers_with_scores = sorted(answers_with_scores, key=lambda x: x[-1], reverse=True) + else: + answers_with_scores = [(answer, rels, conf) for *rels, answer, conf in candidate_answers] + + answer_ids = tuple() + if answers_with_scores: + log.debug(f"answers: {answers_with_scores[0]}") + answer_ids = answers_with_scores[0][0] + if self.return_all_possible_answers and isinstance(answer_ids, tuple): + answer_ids_input = [(answer_id, question) for answer_id in answer_ids] + else: + answer_ids_input = [(answer_ids, question)] + parser_info_list = ["find_label" for _ in answer_ids_input] + answer_labels = self.wiki_parser(parser_info_list, answer_ids_input) + if self.use_api_requester: + answer_labels = [label[0] for label in answer_labels] + if self.return_all_possible_answers: + answer_labels = list(set(answer_labels)) + answer_labels = [label for label in answer_labels if (label and label != "Not Found")][:5] + answer_labels = [str(label) for label in answer_labels] + if len(answer_labels) > 2: + answer = f"{', '.join(answer_labels[:-1])} and {answer_labels[-1]}" + else: + answer = ", ".join(answer_labels) + else: + answer = answer_labels[0] + if self.return_sentence_answer: + try: + answer = sentence_answer(question, answer, entities, template_answer) + except Exception as e: + log.info(f"Error in sentence answer {e}") + confidence = answers_with_scores[0][2] + + if self.return_confidences: + answers.append((answer, confidence)) + else: + if self.return_answer_ids: + answers.append((answer, answer_ids)) + else: + answers.append(answer) + if not answers: + if self.return_confidences: + answers.append(("Not found", 0.0)) + else: + answers.append("Not found") + + return answers + + def rank_rels(self, question: str, candidate_rels: List[str]) -> List[Tuple[str, Any]]: + rels_with_scores = [] + if question is not None: + n_batches = len(candidate_rels) // self.batch_size + int(len(candidate_rels) % self.batch_size > 0) + for i in range(n_batches): + questions_batch = [] + rels_labels_batch = [] + rels_batch = [] + for candidate_rel in candidate_rels[i * self.batch_size : (i + 1) * self.batch_size]: + if candidate_rel in self.rel_q2name: + questions_batch.append(question) + rels_batch.append(candidate_rel) + rels_labels_batch.append(self.rel_q2name[candidate_rel]) + if questions_batch: + probas = self.ranker(questions_batch, rels_labels_batch) + probas = [proba[1] for proba in probas] + for j, rel in enumerate(rels_batch): + rels_with_scores.append((rel, probas[j])) + if self.softmax: + scores = [score for rel, score in rels_with_scores] + softmax_scores = softmax(scores) + rels_with_scores = [ + (rel, softmax_score) for (rel, score), softmax_score in zip(rels_with_scores, softmax_scores) + ] + rels_with_scores = sorted(rels_with_scores, key=lambda x: x[1], reverse=True) + + return rels_with_scores[: self.rels_to_leave] diff --git a/annotators/kbqa/requirements.txt b/annotators/kbqa/requirements.txt index 8ff7ea67b3..54630a5e11 100644 --- a/annotators/kbqa/requirements.txt +++ b/annotators/kbqa/requirements.txt @@ -9,3 +9,4 @@ Werkzeug<=2.0.3 click==7.1.2 torch==1.6.0 transformers==4.6.0 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu diff --git a/annotators/kbqa/server.py b/annotators/kbqa/server.py index 121557e877..c07c7ff073 100644 --- a/annotators/kbqa/server.py +++ b/annotators/kbqa/server.py @@ -17,7 +17,11 @@ kbqa = build_model(config_name, download=True) if NER_INPUT: test_res = kbqa( - ["What is the capital of Russia?"], ["What is the capital of Russia?"], ["-1"], [["Russia"]], [[]] + ["What is the capital of Russia?"], + ["What is the capital of Russia?"], + ["-1"], + [["Russia"]], + [[[("country", 1.0)]]], ) else: test_res = kbqa(["What is the capital of Russia?"]) @@ -36,7 +40,7 @@ def respond(): questions = inp.get("x_init", [" "]) template_types = ["-1" for _ in questions] entities = inp.get("entities", [[]]) - entity_types = [[] for _ in questions] + entity_tags = inp.get("entity_tags", [[]]) sanitized_questions, sanitized_entities = [], [] nf_numbers = [] if len(questions) == len(entities): @@ -51,7 +55,7 @@ def respond(): kbqa_input = [] if sanitized_questions: if NER_INPUT: - kbqa_input = [sanitized_questions, sanitized_questions, template_types, sanitized_entities, entity_types] + kbqa_input = [sanitized_questions, sanitized_questions, template_types, sanitized_entities, entity_tags] else: kbqa_input = [sanitized_questions] logger.info(f"kbqa_input: {kbqa_input}") diff --git a/annotators/kbqa/test_kbqa.py b/annotators/kbqa/test_kbqa.py index 2073276f4c..f437faf2c2 100644 --- a/annotators/kbqa/test_kbqa.py +++ b/annotators/kbqa/test_kbqa.py @@ -5,8 +5,8 @@ def main(): url = "http://0.0.0.0:8072/model" request_data = [ - {"x_init": ["Who is Donald Trump?"], "entities": [["Donald Trump"]]}, - {"x_init": ["How old is Donald Trump?"], "entities": [["Donald Trump"]]}, + {"x_init": ["Who is Donald Trump?"], "entities": [["Donald Trump"]], "entity_tags": [[["per", 1.0]]]}, + {"x_init": ["How old is Donald Trump?"], "entities": [["Donald Trump"]], "entity_tags": [[["per", 1.0]]]}, ] gold_answers = ["Donald Trump is 45th and current president of the United States.", "Donald Trump is 75 years old."] diff --git a/annotators/midas_classification/Dockerfile b/annotators/midas_classification/Dockerfile index 0afff9a0f5..642920fd82 100644 --- a/annotators/midas_classification/Dockerfile +++ b/annotators/midas_classification/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.14.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.14.1 ARG CONFIG ARG SED_ARG=" | " @@ -15,7 +16,6 @@ COPY . /src/ WORKDIR /src -RUN python -m deeppavlov install $CONFIG RUN python -m spacy download en_core_web_sm RUN sed -i "s|$SED_ARG|g" "$CONFIG" diff --git a/annotators/midas_classification/requirements.txt b/annotators/midas_classification/requirements.txt index 90e3297b25..e93c1afffa 100644 --- a/annotators/midas_classification/requirements.txt +++ b/annotators/midas_classification/requirements.txt @@ -6,4 +6,6 @@ gunicorn==19.9.0 numpy==1.17.2 spacy==3.0.6 jinja2<=3.0.3 -Werkzeug<=2.0.3 \ No newline at end of file +Werkzeug<=2.0.3 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/annotators/sentseg_ru/Dockerfile b/annotators/sentseg_ru/Dockerfile index 1908d8c330..6150cd6e30 100644 --- a/annotators/sentseg_ru/Dockerfile +++ b/annotators/sentseg_ru/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.17.2 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.17.2 ARG CONFIG ARG SED_ARG=" | " @@ -15,7 +16,6 @@ COPY . /src/ WORKDIR /src RUN pip install pymorphy2==0.9.1 -RUN python -m deeppavlov install $CONFIG RUN python -m spacy download ru_core_news_sm RUN sed -i "s|$SED_ARG|g" "$CONFIG" diff --git a/annotators/sentseg_ru/requirements.txt b/annotators/sentseg_ru/requirements.txt index 87708bd4d1..9650f64b60 100644 --- a/annotators/sentseg_ru/requirements.txt +++ b/annotators/sentseg_ru/requirements.txt @@ -6,3 +6,7 @@ requests==2.22.0 spacy==3.2.0 jinja2<=3.0.3 Werkzeug<=2.0.3 +transformers==4.6.0 +torch==1.6.0 +torchvision==0.7.0 +cryptography==2.8 \ No newline at end of file diff --git a/annotators/speech_function_classifier/models.py b/annotators/speech_function_classifier/models.py index 6aa2ffa0e7..76e1bc6b57 100644 --- a/annotators/speech_function_classifier/models.py +++ b/annotators/speech_function_classifier/models.py @@ -22,7 +22,7 @@ cuda_is_available = torch.cuda.is_available() -with open("data/res_cor.json") as data: +with open("/models/res_cor.json") as data: res_cor = json.load(data) with open("/models/track_list.txt") as track_list: diff --git a/annotators/spelling_preprocessing_ru/Dockerfile b/annotators/spelling_preprocessing_ru/Dockerfile index b680e623d5..e7d444238d 100644 --- a/annotators/spelling_preprocessing_ru/Dockerfile +++ b/annotators/spelling_preprocessing_ru/Dockerfile @@ -25,14 +25,12 @@ ENV PORT=$PORT COPY ./annotators/spelling_preprocessing_ru/requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt -RUN pip install git+https://github.com/deepmipt/DeepPavlov.git@${COMMIT} +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@${COMMIT} COPY $SRC_DIR /src WORKDIR /src -RUN python -m deeppavlov install $CONFIG - RUN sed -i "s|$SED_ARG|g" "$CONFIG" CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8074 diff --git a/annotators/spelling_preprocessing_ru/requirements.txt b/annotators/spelling_preprocessing_ru/requirements.txt index 7582d4cb08..0327ef4901 100644 --- a/annotators/spelling_preprocessing_ru/requirements.txt +++ b/annotators/spelling_preprocessing_ru/requirements.txt @@ -2,4 +2,7 @@ sentry-sdk[flask]==0.14.1 flask==1.1.1 gunicorn==19.9.0 requests==2.22.0 -itsdangerous==2.0.1 \ No newline at end of file +itsdangerous==2.0.1 +sortedcontainers==2.1.0 +git+https://github.com/kpu/kenlm.git@96d303cfb1a0c21b8f060dbad640d7ab301c019a#egg=kenlm +cryptography==2.8 \ No newline at end of file diff --git a/assistant_dists/deepy_adv/README.md b/assistant_dists/deepy_adv/README.md index ac7a2e1698..b28ec81c9d 100644 --- a/assistant_dists/deepy_adv/README.md +++ b/assistant_dists/deepy_adv/README.md @@ -8,10 +8,10 @@ Deepy was inspired by Gerty 3000, a moonbase A.I. Assistant from the Moon Movie ![img](https://cdn-images-1.medium.com/max/800/0*HarsFmC8UKJBaNU6.jpg) ## Learn More About Deepy -Official wiki is located here: [Deepy Wiki](https://github.com/deepmipt/assistant-base/wiki). +Official wiki is located here: [Deepy Wiki](https://github.com/deeppavlovteam/assistant-base/wiki). ## Distributions -You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deepmipt/assistant-base/wiki/Distributions) +You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deeppavlovteam/assistant-base/wiki/Distributions) # Quick Demo 1. Clone repository diff --git a/assistant_dists/deepy_base/README.md b/assistant_dists/deepy_base/README.md index ac7a2e1698..b28ec81c9d 100644 --- a/assistant_dists/deepy_base/README.md +++ b/assistant_dists/deepy_base/README.md @@ -8,10 +8,10 @@ Deepy was inspired by Gerty 3000, a moonbase A.I. Assistant from the Moon Movie ![img](https://cdn-images-1.medium.com/max/800/0*HarsFmC8UKJBaNU6.jpg) ## Learn More About Deepy -Official wiki is located here: [Deepy Wiki](https://github.com/deepmipt/assistant-base/wiki). +Official wiki is located here: [Deepy Wiki](https://github.com/deeppavlovteam/assistant-base/wiki). ## Distributions -You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deepmipt/assistant-base/wiki/Distributions) +You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deeppavlovteam/assistant-base/wiki/Distributions) # Quick Demo 1. Clone repository diff --git a/assistant_dists/deepy_faq/README.md b/assistant_dists/deepy_faq/README.md index ac7a2e1698..b28ec81c9d 100644 --- a/assistant_dists/deepy_faq/README.md +++ b/assistant_dists/deepy_faq/README.md @@ -8,10 +8,10 @@ Deepy was inspired by Gerty 3000, a moonbase A.I. Assistant from the Moon Movie ![img](https://cdn-images-1.medium.com/max/800/0*HarsFmC8UKJBaNU6.jpg) ## Learn More About Deepy -Official wiki is located here: [Deepy Wiki](https://github.com/deepmipt/assistant-base/wiki). +Official wiki is located here: [Deepy Wiki](https://github.com/deeppavlovteam/assistant-base/wiki). ## Distributions -You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deepmipt/assistant-base/wiki/Distributions) +You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deeppavlovteam/assistant-base/wiki/Distributions) # Quick Demo 1. Clone repository diff --git a/assistant_dists/deepy_gobot_base/README.md b/assistant_dists/deepy_gobot_base/README.md index ac7a2e1698..b28ec81c9d 100644 --- a/assistant_dists/deepy_gobot_base/README.md +++ b/assistant_dists/deepy_gobot_base/README.md @@ -8,10 +8,10 @@ Deepy was inspired by Gerty 3000, a moonbase A.I. Assistant from the Moon Movie ![img](https://cdn-images-1.medium.com/max/800/0*HarsFmC8UKJBaNU6.jpg) ## Learn More About Deepy -Official wiki is located here: [Deepy Wiki](https://github.com/deepmipt/assistant-base/wiki). +Official wiki is located here: [Deepy Wiki](https://github.com/deeppavlovteam/assistant-base/wiki). ## Distributions -You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deepmipt/assistant-base/wiki/Distributions) +You can find distributions in the /assistant_dists subdirectory of the repository. Learn more about distributions here: [Distributions](https://github.com/deeppavlovteam/assistant-base/wiki/Distributions) # Quick Demo 1. Clone repository diff --git a/assistant_dists/dream/cpu.yml b/assistant_dists/dream/cpu.yml index eaf049edb1..1e026ce4eb 100644 --- a/assistant_dists/dream/cpu.yml +++ b/assistant_dists/dream/cpu.yml @@ -27,9 +27,6 @@ services: combined-classification: environment: CUDA_VISIBLE_DEVICES: "" - masked-lm: - environment: - CUDA_VISIBLE_DEVICES: "" text-qa: environment: CUDA_VISIBLE_DEVICES: "" diff --git a/assistant_dists/dream/dev.yml b/assistant_dists/dream/dev.yml index eb2da6c8b2..3fb0f8df13 100755 --- a/assistant_dists/dream/dev.yml +++ b/assistant_dists/dream/dev.yml @@ -246,11 +246,6 @@ services: - "./common:/src/common" ports: - 8086:8086 - masked-lm: - volumes: - - "./services/masked_lm:/src" - ports: - - 8088:8088 entity-storer: volumes: - "./common:/src/common" diff --git a/assistant_dists/dream/docker-compose.override.yml b/assistant_dists/dream/docker-compose.override.yml index 89f468c9ef..171f4a91ed 100644 --- a/assistant_dists/dream/docker-compose.override.yml +++ b/assistant_dists/dream/docker-compose.override.yml @@ -12,7 +12,7 @@ services: comet-conceptnet:8065, news-api-skill:8066, dff-short-story-skill:8057, factoid-qa:8071, kbqa:8072, spelling-preprocessing:8074, entity-linking:8075, wiki-parser:8077, text-qa:8078, knowledge-grounding:8083, combined-classification:8087, knowledge-grounding-skill:8085, - dff-friendship-skill:8086, masked-lm:8088, entity-storer:8089, + dff-friendship-skill:8086, entity-storer:8089, dff-book-skill:8032, dff-grounding-skill:8080, dff-animals-skill:8094, dff-travel-skill:8096, dff-food-skill:8097, dff-sport-skill:8098, dff-science-skill:8101, midas-classification:8090, fact-random:8119, fact-retrieval:8100, @@ -124,9 +124,9 @@ services: deploy: resources: limits: - memory: 256M + memory: 100M reservations: - memory: 256M + memory: 100M sentrewrite: env_file: [.env] @@ -513,10 +513,9 @@ services: env_file: [.env] build: args: - CONFIG: kbqa_entity_linking_page.json + CONFIG: entity_linking_eng.json PORT: 8075 SRC_DIR: annotators/entity_linking - COMMIT: 5b99ac3392e8e178e2bb4f9b218d4ddb2ec2e242 context: ./ dockerfile: annotators/entity_linking/Dockerfile environment: @@ -683,24 +682,6 @@ services: reservations: memory: 256M - masked-lm: - env_file: [.env] - build: - context: ./services/masked_lm/ - args: - SERVICE_PORT: 8088 - PRETRAINED_MODEL_NAME_OR_PATH: "bert-base-uncased" - command: flask run -h 0.0.0.0 -p 8088 - environment: - - CUDA_VISIBLE_DEVICES=0 - - FLASK_APP=server - deploy: - resources: - limits: - memory: 2.5G - reservations: - memory: 2.5G - entity-storer: env_file: [.env] build: @@ -1079,7 +1060,7 @@ services: command: flask run -h 0.0.0.0 -p 8103 environment: - FLASK_APP=server - - CUDA_VISIBLE_DEVICES=7 + - CUDA_VISIBLE_DEVICES=0 deploy: resources: limits: diff --git a/assistant_dists/dream/gpu1.yml b/assistant_dists/dream/gpu1.yml index 82c63a5a75..1d95e8a5ae 100644 --- a/assistant_dists/dream/gpu1.yml +++ b/assistant_dists/dream/gpu1.yml @@ -121,10 +121,6 @@ services: restart: unless-stopped dff-grounding-skill: restart: unless-stopped - masked-lm: - restart: unless-stopped - environment: - - CUDA_VISIBLE_DEVICES=8 dff-friendship-skill: restart: unless-stopped entity-storer: diff --git a/assistant_dists/dream/proxy.yml b/assistant_dists/dream/proxy.yml index b654ee4839..4031e95e47 100644 --- a/assistant_dists/dream/proxy.yml +++ b/assistant_dists/dream/proxy.yml @@ -315,15 +315,6 @@ services: - PROXY_PASS=dream.deeppavlov.ai:8086 - PORT=8086 - masked-lm: - command: ["nginx", "-g", "daemon off;"] - build: - context: dp/proxy/ - dockerfile: Dockerfile - environment: - - PROXY_PASS=dream.deeppavlov.ai:8088 - - PORT=8088 - entity-storer: command: ["nginx", "-g", "daemon off;"] build: diff --git a/assistant_dists/dream/test.yml b/assistant_dists/dream/test.yml index 72fdef5c46..fe884d598d 100644 --- a/assistant_dists/dream/test.yml +++ b/assistant_dists/dream/test.yml @@ -58,11 +58,8 @@ services: - "~/.deeppavlov:/root/.deeppavlov" environment: - CUDA_VISIBLE_DEVICES=7 - eliza: convert-reddit: personal-info-skill: - asr: - misheard-asr: dff-book-skill: dff-weather-skill: emotion-skill: @@ -84,9 +81,6 @@ services: - "~/.deeppavlov:/root/.deeppavlov" spelling-preprocessing: dff-grounding-skill: - masked-lm: - environment: - - CUDA_VISIBLE_DEVICES=7 dff-friendship-skill: entity-storer: knowledge-grounding-skill: @@ -124,14 +118,11 @@ services: - CUDA_VISIBLE_DEVICES=8 dff-coronavirus-skill: dff-short-story-skill: - midas-predictor: - environment: - - CUDA_VISIBLE_DEVICES=6 dialogpt: environment: - CUDA_VISIBLE_DEVICES=6 infilling: environment: - - CUDA_VISIBLE_DEVICES=8 + - CUDA_VISIBLE_DEVICES=7 dff-template-skill: version: '3.7' diff --git a/assistant_dists/dream_mini/cpu.yml b/assistant_dists/dream_mini/cpu.yml index 4f4672ad82..d073539763 100644 --- a/assistant_dists/dream_mini/cpu.yml +++ b/assistant_dists/dream_mini/cpu.yml @@ -1,14 +1,14 @@ version: '3.7' services: - convers-evaluator-annotator: + dialogpt: environment: DEVICE: cpu CUDA_VISIBLE_DEVICES: "" - dialogpt: + intent-catcher: environment: DEVICE: cpu CUDA_VISIBLE_DEVICES: "" - intent-catcher: + sentence-ranker: environment: DEVICE: cpu CUDA_VISIBLE_DEVICES: "" diff --git a/assistant_dists/dream_mini/dev.yml b/assistant_dists/dream_mini/dev.yml index b40cacc257..457ef846d7 100644 --- a/assistant_dists/dream_mini/dev.yml +++ b/assistant_dists/dream_mini/dev.yml @@ -5,12 +5,6 @@ services: - ".:/dp-agent" ports: - 4242:4242 - convers-evaluator-annotator: - volumes: - - "./annotators/ConversationEvaluator:/src" - - "~/.deeppavlov:/root/.deeppavlov" - ports: - - 8004:8004 dff-program-y-skill: volumes: - "./skills/dff_program_y_skill:/src" @@ -57,4 +51,9 @@ services: - "./services/dialogpt:/src" ports: - 8125:8125 + sentence-ranker: + volumes: + - "./services/sentence_ranker:/src" + ports: + - 8128:8128 version: "3.7" diff --git a/assistant_dists/dream_mini/docker-compose.override.yml b/assistant_dists/dream_mini/docker-compose.override.yml index 15f02554e0..9d6be70ef4 100644 --- a/assistant_dists/dream_mini/docker-compose.override.yml +++ b/assistant_dists/dream_mini/docker-compose.override.yml @@ -2,29 +2,10 @@ services: agent: command: sh -c 'bin/wait && python -m deeppavlov_agent.run agent.pipeline_config=assistant_dists/dream_mini/pipeline_conf.json' environment: - WAIT_HOSTS: "convers-evaluator-annotator:8004, dff-program-y-skill:8008, sentseg:8011, convers-evaluation-selector:8009, + WAIT_HOSTS: "dff-program-y-skill:8008, sentseg:8011, convers-evaluation-selector:8009, dff-intent-responder-skill:8012, intent-catcher:8014, badlisted-words:8018, - spelling-preprocessing:8074, dialogpt:8125" + spelling-preprocessing:8074, dialogpt:8125, sentence-ranker:8128" WAIT_HOSTS_TIMEOUT: ${WAIT_TIMEOUT:-480} - convers-evaluator-annotator: - env_file: [.env] - build: - args: - CONFIG: conveval.json - PORT: 8004 - DATA_URL: https://files.deeppavlov.ai/alexaprize_data/cobot_conveval2.tar.gz - context: . - dockerfile: ./annotators/ConversationEvaluator/Dockerfile - environment: - - CUDA_VISIBLE_DEVICES=0 - deploy: - mode: replicated - replicas: 1 - resources: - limits: - memory: 2G - reservations: - memory: 2G dff-program-y-skill: env_file: [.env] @@ -173,4 +154,22 @@ services: reservations: memory: 2G + sentence-ranker: + env_file: [ .env ] + build: + args: + SERVICE_PORT: 8128 + PRETRAINED_MODEL_NAME_OR_PATH: sentence-transformers/bert-base-nli-mean-tokens + context: ./services/sentence_ranker/ + command: flask run -h 0.0.0.0 -p 8128 + environment: + - CUDA_VISIBLE_DEVICES=0 + - FLASK_APP=server + deploy: + resources: + limits: + memory: 3G + reservations: + memory: 3G + version: '3.7' diff --git a/assistant_dists/dream_mini/pipeline_conf.json b/assistant_dists/dream_mini/pipeline_conf.json index 930ddb471b..cbed0bddc6 100644 --- a/assistant_dists/dream_mini/pipeline_conf.json +++ b/assistant_dists/dream_mini/pipeline_conf.json @@ -202,13 +202,13 @@ ], "state_manager_method": "add_hypothesis_annotation_batch" }, - "convers_evaluator_annotator": { + "sentence_ranker": { "connector": { "protocol": "http", "timeout": 1, - "url": "http://convers-evaluator-annotator:8004/batch_model" + "url": "http://sentence-ranker:8128/respond" }, - "dialog_formatter": "state_formatters.dp_formatters:convers_evaluator_annotator_formatter", + "dialog_formatter": "state_formatters.dp_formatters:sentence_ranker_formatter", "response_formatter": "state_formatters.dp_formatters:simple_formatter_service", "previous_services": ["skills"], "state_manager_method": "add_hypothesis_annotation_batch" diff --git a/assistant_dists/dream_mini/proxy.yml b/assistant_dists/dream_mini/proxy.yml index 8944ef9368..ffdfa47ea6 100644 --- a/assistant_dists/dream_mini/proxy.yml +++ b/assistant_dists/dream_mini/proxy.yml @@ -1,12 +1,4 @@ services: - convers-evaluator-annotator: - command: ["nginx", "-g", "daemon off;"] - build: - context: dp/proxy/ - dockerfile: Dockerfile - environment: - - PROXY_PASS=dream.deeppavlov.ai:8004 - - PORT=8004 dff-program-y-skill: command: ["nginx", "-g", "daemon off;"] @@ -80,4 +72,12 @@ services: - PROXY_PASS=dream.deeppavlov.ai:8125 - PORT=8125 + sentence-ranker: + command: [ "nginx", "-g", "daemon off;" ] + build: + context: dp/proxy/ + dockerfile: Dockerfile + environment: + - PROXY_PASS=dream.deeppavlov.ai:8128 + - PORT=8128 version: '3.7' diff --git a/assistant_dists/dream_sfc/cpu.yml b/assistant_dists/dream_sfc/cpu.yml index d8268f581e..a5362fe5d0 100644 --- a/assistant_dists/dream_sfc/cpu.yml +++ b/assistant_dists/dream_sfc/cpu.yml @@ -27,9 +27,6 @@ services: combined-classification: environment: CUDA_VISIBLE_DEVICES: "" - masked-lm: - environment: - CUDA_VISIBLE_DEVICES: "" text-qa: environment: CUDA_VISIBLE_DEVICES: "" diff --git a/assistant_dists/dream_sfc/dev.yml b/assistant_dists/dream_sfc/dev.yml index 4abfc6812e..ccb16523d7 100755 --- a/assistant_dists/dream_sfc/dev.yml +++ b/assistant_dists/dream_sfc/dev.yml @@ -244,11 +244,6 @@ services: - "./common:/src/common" ports: - 8086:8086 - masked-lm: - volumes: - - "./services/masked_lm:/src" - ports: - - 8088:8088 entity-storer: volumes: - "./common:/src/common" diff --git a/assistant_dists/dream_sfc/docker-compose.override.yml b/assistant_dists/dream_sfc/docker-compose.override.yml index fde44e4a4e..297ba3b171 100644 --- a/assistant_dists/dream_sfc/docker-compose.override.yml +++ b/assistant_dists/dream_sfc/docker-compose.override.yml @@ -12,7 +12,7 @@ services: comet-conceptnet:8065, news-api-skill:8066, dff-short-story-skill:8057, factoid-qa:8071, kbqa:8072, spelling-preprocessing:8074, entity-linking:8075, wiki-parser:8077, text-qa:8078, knowledge-grounding:8083, combined-classification:8087, knowledge-grounding-skill:8085, - dff-friendship-skill:8086, masked-lm:8088, entity-storer:8089, + dff-friendship-skill:8086, entity-storer:8089, dff-book-sfc-skill:8034, dff-grounding-skill:8080, dff-animals-skill:8094, dff-travel-skill:8096, dff-food-skill:8097, dff-sport-skill:8098, dff-science-skill:8101, midas-classification:8090, fact-random:8119, fact-retrieval:8100, @@ -659,24 +659,6 @@ services: reservations: memory: 256M - masked-lm: - env_file: [.env] - build: - context: ./services/masked_lm/ - args: - SERVICE_PORT: 8088 - PRETRAINED_MODEL_NAME_OR_PATH: "bert-base-uncased" - command: flask run -h 0.0.0.0 -p 8088 - environment: - - CUDA_VISIBLE_DEVICES=0 - - FLASK_APP=server - deploy: - resources: - limits: - memory: 2.5G - reservations: - memory: 2.5G - entity-storer: env_file: [.env] build: diff --git a/assistant_dists/dream_sfc/gpu1.yml b/assistant_dists/dream_sfc/gpu1.yml index 6b81123b0a..a5c7930c5c 100644 --- a/assistant_dists/dream_sfc/gpu1.yml +++ b/assistant_dists/dream_sfc/gpu1.yml @@ -119,10 +119,6 @@ services: restart: unless-stopped dff-grounding-skill: restart: unless-stopped - masked-lm: - restart: unless-stopped - environment: - - CUDA_VISIBLE_DEVICES=8 dff-friendship-skill: restart: unless-stopped entity-storer: diff --git a/assistant_dists/dream_sfc/proxy.yml b/assistant_dists/dream_sfc/proxy.yml index 35732fc5ce..10b97ee525 100644 --- a/assistant_dists/dream_sfc/proxy.yml +++ b/assistant_dists/dream_sfc/proxy.yml @@ -315,15 +315,6 @@ services: - PROXY_PASS=dream.deeppavlov.ai:8086 - PORT=8086 - masked-lm: - command: ["nginx", "-g", "daemon off;"] - build: - context: dp/proxy/ - dockerfile: Dockerfile - environment: - - PROXY_PASS=dream.deeppavlov.ai:8088 - - PORT=8088 - entity-storer: command: ["nginx", "-g", "daemon off;"] build: diff --git a/assistant_dists/dream_sfc/test.yml b/assistant_dists/dream_sfc/test.yml index 82cd9ba9f4..03b8e28fcf 100644 --- a/assistant_dists/dream_sfc/test.yml +++ b/assistant_dists/dream_sfc/test.yml @@ -85,9 +85,6 @@ services: - "~/.deeppavlov:/root/.deeppavlov" spelling-preprocessing: dff-grounding-skill: - masked-lm: - environment: - - CUDA_VISIBLE_DEVICES=7 dff-friendship-skill: entity-storer: knowledge-grounding-skill: diff --git a/common/dialogflow_framework/requirements.txt b/common/dialogflow_framework/requirements.txt index de00d17fcc..c41148450a 100644 --- a/common/dialogflow_framework/requirements.txt +++ b/common/dialogflow_framework/requirements.txt @@ -6,7 +6,7 @@ gunicorn==19.9.0 healthcheck==1.3.3 # dialogflow framework programy==4.3 -git+https://github.com/deepmipt/dialog_flow_engine.git@3a2e3e5d99cd3090c8f72315885dc91d398f2d74 +git+https://github.com/deeppavlovteam/dialog_flow_engine.git@3a2e3e5d99cd3090c8f72315885dc91d398f2d74 # test jinja2<=3.0.3 Werkzeug<=2.0.3 diff --git a/common/utils.py b/common/utils.py index c8dd4864ae..4dfd16b2de 100644 --- a/common/utils.py +++ b/common/utils.py @@ -1208,18 +1208,6 @@ def is_special_factoid_question(annotated_utterance): ) -def get_conv_eval_annotations(annotated_utterance): - default_conv_eval = { - "isResponseOnTopic": 0.0, - "isResponseInteresting": 0.0, - "responseEngagesUser": 0.0, - "isResponseComprehensible": 0.0, - "isResponseErroneous": 0.0, - } - - return annotated_utterance.get("annotations", {}).get("convers_evaluator_annotator", default_conv_eval) - - def get_dialog_breakdown_annotations(annotated_utterance): breakdown = annotated_utterance.get("annotations", {}).get("dialog_breakdown", {}).get("breakdown", 0.0) > 0.5 return breakdown diff --git a/dockerfile_agent b/dockerfile_agent index 69f1ff8bdf..6054414e22 100644 --- a/dockerfile_agent +++ b/dockerfile_agent @@ -22,7 +22,7 @@ WORKDIR /dp-agent RUN mkdir /pavlov && \ cd /pavlov && \ - git clone https://github.com/deepmipt/DeepPavlov && \ + git clone https://github.com/deeppavlovteam/DeepPavlov && \ cd DeepPavlov && \ pip install -e . diff --git a/dp/dockerfile_skill b/dp/dockerfile_skill index 79de7dc040..6577ce4303 100644 --- a/dp/dockerfile_skill +++ b/dp/dockerfile_skill @@ -14,7 +14,7 @@ WORKDIR dp-agent COPY . /base/dp-agent -RUN python -m deeppavlov install $CONFIG && \ - sed -i "/uvicorn.run/s/app,/app, timeout_keep_alive=20,/g" "/base/DeepPavlov/deeppavlov/utils/server/server.py" +# RUN python -m deeppavlov install $CONFIG +RUN sed -i "/uvicorn.run/s/app,/app, timeout_keep_alive=20,/g" "/base/DeepPavlov/deeppavlov/utils/server/server.py" CMD python -m deeppavlov riseapi $CONFIG -p $PORT -d \ No newline at end of file diff --git a/dp/dockerfile_skill_cpu b/dp/dockerfile_skill_cpu index 6c32c43f63..2a1f25e006 100644 --- a/dp/dockerfile_skill_cpu +++ b/dp/dockerfile_skill_cpu @@ -1,4 +1,5 @@ FROM deeppavlov/base-cpu:0.6.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.6.1 ARG skillconfig ARG skillport @@ -27,7 +28,7 @@ RUN python -m deeppavlov download $CONFIG RUN mkdir -p /dp-agent/$DIR COPY $DIR /dp-agent/$DIR -RUN python -m deeppavlov install $CONFIG +# RUN python -m deeppavlov install $CONFIG COPY dp/ /dp-agent/dp RUN python dp/dp_server_config.py diff --git a/dp/dockerfile_skill_gpu b/dp/dockerfile_skill_gpu index 5575111e18..a4c1ad87b6 100644 --- a/dp/dockerfile_skill_gpu +++ b/dp/dockerfile_skill_gpu @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG COMMIT=0.13.0 @@ -24,5 +25,5 @@ WORKDIR /src RUN sed -i "s|$SED_ARG|g" "$CONFIG" -RUN python -m deeppavlov install $CONFIG +# RUN python -m deeppavlov install $CONFIG CMD python -m deeppavlov riseapi $CONFIG -p $PORT -d diff --git a/response_selectors/convers_evaluation_based_selector/server.py b/response_selectors/convers_evaluation_based_selector/server.py index 1c02c45616..dfcaa22d7f 100644 --- a/response_selectors/convers_evaluation_based_selector/server.py +++ b/response_selectors/convers_evaluation_based_selector/server.py @@ -20,17 +20,14 @@ low_priority_intents, substitute_nonwords, is_toxic_or_badlisted_utterance, - get_conv_eval_annotations, ) from tag_based_selection import tag_based_response_selection from utils import ( add_question_to_statement, lower_duplicates_score, lower_retrieve_skills_confidence_if_scenario_exist, - calculate_single_convers_evaluator_score, + calculate_single_evaluator_score, downscore_toxic_badlisted_responses, - CONV_EVAL_STRENGTH, - CONFIDENCE_STRENGTH, how_are_you_spec, what_i_can_do_spec, misheard_with_spec1, @@ -104,7 +101,9 @@ def respond(): ) logger.info(msg) - curr_scores += [get_conv_eval_annotations(skill_data)] + curr_scores += [ + calculate_single_evaluator_score(skill_data.get("annotations"), skill_data["confidence"]) + ] curr_is_toxics = np.array(curr_is_toxics) curr_scores = np.array(curr_scores) @@ -293,24 +292,17 @@ def rule_score_based_selection(dialog, candidates, scores, confidences, is_toxic dummy_question_human_attr = candidates[i].get("human_attributes", {}) if curr_score is None: - cand_scores = scores[i] + score = scores[i] confidence = confidences[i] skill_name = skill_names[i] - score_conv_eval = calculate_single_convers_evaluator_score(cand_scores) - score = CONV_EVAL_STRENGTH * score_conv_eval + CONFIDENCE_STRENGTH * confidence logger.info( - f"Skill {skill_name} has final score: {score}. Confidence: {confidence}. " - f"Toxicity: {is_toxics[i]}. Cand scores: {cand_scores}" + f"Skill {skill_name} has final score: {score}. Confidence: {confidence}. " f"Toxicity: {is_toxics[i]}" ) curr_single_scores.append(score) else: - cand_scores = scores[i] + score = scores[i] skill_name = skill_names[i] - score_conv_eval = calculate_single_convers_evaluator_score(cand_scores) - score = CONV_EVAL_STRENGTH * score_conv_eval + curr_score - logger.info( - f"Skill {skill_name} has final score: {score}. " f"Toxicity: {is_toxics[i]}. Cand scores: {cand_scores}" - ) + logger.info(f"Skill {skill_name} has final score: {score}. " f"Toxicity: {is_toxics[i]}") curr_single_scores.append(score) highest_conf_exist = True if any(confidences >= 1.0) else False diff --git a/response_selectors/convers_evaluation_based_selector/tag_based_selection.py b/response_selectors/convers_evaluation_based_selector/tag_based_selection.py index 8b0b42233b..d13c4c173e 100644 --- a/response_selectors/convers_evaluation_based_selector/tag_based_selection.py +++ b/response_selectors/convers_evaluation_based_selector/tag_based_selection.py @@ -28,9 +28,6 @@ get_dialog_breakdown_annotations, ) from utils import ( - calculate_single_convers_evaluator_score, - CONV_EVAL_STRENGTH, - CONFIDENCE_STRENGTH, how_are_you_spec, what_i_can_do_spec, misheard_with_spec1, @@ -251,28 +248,6 @@ def acknowledgement_decision(all_user_intents): return False -def compute_curr_single_scores(candidates, scores, confidences): - curr_single_scores = [] - if all(["hypothesis_scorer" in cand["annotations"] for cand in candidates]): - for i in range(len(candidates)): - curr_single_scores.append(candidates[i]["annotations"]["hypothesis_scorer"]) - else: - for i in range(len(scores)): - cand_scores = scores[i] - confidence = confidences[i] - skill_name = candidates[i]["skill_name"] - if all(["dialogrpt" in cand["annotations"] for cand in candidates]): - score_conv_eval = candidates[i]["annotations"]["dialogrpt"] - else: - score_conv_eval = calculate_single_convers_evaluator_score(cand_scores) - score = CONV_EVAL_STRENGTH * score_conv_eval + CONFIDENCE_STRENGTH * confidence - - logger.info(f"Skill {skill_name} has final score: {score}. Confidence: {confidence}.") - curr_single_scores.append(score) - - return curr_single_scores - - def add_to_top1_category(cand_id, categorized, _is_require_action_intent): if _is_require_action_intent: categorized["active_same_topic_entity_no_db_reqda"].append(cand_id) @@ -351,7 +326,9 @@ def rule_based_prioritization(cand_uttr, dialog): return flag -def tag_based_response_selection(dialog, candidates, scores, confidences, bot_utterances, all_prev_active_skills=None): +def tag_based_response_selection( + dialog, candidates, curr_single_scores, confidences, bot_utterances, all_prev_active_skills=None +): all_prev_active_skills = all_prev_active_skills if all_prev_active_skills is not None else [] all_prev_active_skills = Counter(all_prev_active_skills) annotated_uttr = dialog["human_utterances"][-1] @@ -423,21 +400,17 @@ def tag_based_response_selection(dialog, candidates, scores, confidences, bot_ut if confidences[cand_id] == 0.0 and cand_uttr["skill_name"] not in ACTIVE_SKILLS: logger.info(f"Dropping cand_id: {cand_id} due to toxicity/badlists") continue + skill_name = cand_uttr["skill_name"] + confidence = confidences[cand_id] + score = curr_single_scores[cand_id] + logger.info(f"Skill {skill_name} has final score: {score}. Confidence: {confidence}.") all_cand_intents, all_cand_topics, all_cand_named_entities, all_cand_nounphrases = get_main_info_annotations( cand_uttr ) skill_name = cand_uttr["skill_name"] _is_dialog_abandon = get_dialog_breakdown_annotations(cand_uttr) and PRIORITIZE_NO_DIALOG_BREAKDOWN - _is_just_prompt = ( - cand_uttr["skill_name"] == "dummy_skill" - and any( - [ - question_type in cand_uttr.get("type", "") - for question_type in ["normal_question", "link_to_for_response_selector"] - ] - ) - ) or cand_uttr.get("response_parts", []) == ["prompt"] + _is_just_prompt = cand_uttr.get("response_parts", []) == ["prompt"] if cand_uttr["confidence"] == 1.0: # for those hypotheses where developer forgot to set tag to MUST_CONTINUE cand_uttr["can_continue"] = MUST_CONTINUE @@ -646,7 +619,6 @@ def tag_based_response_selection(dialog, candidates, scores, confidences, bot_ut logger.info(f"Current CASE: {CASE}") # now compute current scores as one float value - curr_single_scores = compute_curr_single_scores(candidates, scores, confidences) # remove disliked skills from hypotheses if IGNORE_DISLIKED_SKILLS: diff --git a/response_selectors/convers_evaluation_based_selector/utils.py b/response_selectors/convers_evaluation_based_selector/utils.py index 81a2d60bc5..44ae0490f7 100644 --- a/response_selectors/convers_evaluation_based_selector/utils.py +++ b/response_selectors/convers_evaluation_based_selector/utils.py @@ -103,7 +103,7 @@ def lower_duplicates_score(candidates, bot_utt_counter, scores, confidences): # no penalties for repeat intent if cand["skill_name"] == "dff_intent_responder_skill" and "#+#repeat" in cand["text"]: continue - # TODO: remove the quick fix of gcs petitions, issue is https://github.com/deepmipt/assistant/issues/80 + # TODO: remove the quick fix of gcs petitions, issue is https://github.com/deeppavlovteam/assistant/issues/80 if cand["skill_name"] in ["game_cooperative_skill", "news_api_skill", "dff_movie_skill"]: continue @@ -119,8 +119,7 @@ def lower_duplicates_score(candidates, bot_utt_counter, scores, confidences): # apply penalties to non-script skills and in case if response consists only from duplicates if confidences[i] < 1.0 or n_duplicates == len(cand_sents): confidences[i] /= coeff - scores[i]["isResponseInteresting"] /= coeff - scores[i]["responseEngagesUser"] /= coeff + scores[i] /= coeff def lower_retrieve_skills_confidence_if_scenario_exist(candidates, scores, confidences): @@ -134,33 +133,42 @@ def lower_retrieve_skills_confidence_if_scenario_exist(candidates, scores, confi for i, cand in enumerate(candidates): if cand["skill_name"] in retrieve_skills: confidences[i] *= lower_coeff - scores[i]["isResponseInteresting"] *= lower_coeff - - -def calculate_single_convers_evaluator_score(cand_scores): - score_conv_eval = sum( - [ - cand_scores["isResponseOnTopic"], - cand_scores["isResponseInteresting"], - cand_scores["responseEngagesUser"], - cand_scores["isResponseComprehensible"], - ] - ) - score_conv_eval -= cand_scores["isResponseErroneous"] - return score_conv_eval + scores[i] *= lower_coeff + + +def calculate_single_evaluator_score(hypothesis_annotations, confidence): + if "convers_evaluator_annotator" in hypothesis_annotations: + cand_scores = hypothesis_annotations["convers_evaluator_annotator"] + score_conv_eval = sum( + [ + cand_scores["isResponseOnTopic"], + cand_scores["isResponseInteresting"], + cand_scores["responseEngagesUser"], + cand_scores["isResponseComprehensible"], + ] + ) + score_conv_eval -= cand_scores["isResponseErroneous"] + score = CONV_EVAL_STRENGTH * score_conv_eval + CONFIDENCE_STRENGTH * confidence + return score + elif "dialogrpt" in hypothesis_annotations: + score_conv_eval = hypothesis_annotations["dialogrpt"] + score = CONV_EVAL_STRENGTH * score_conv_eval + CONFIDENCE_STRENGTH * confidence + return score + elif "sentence_ranker" in hypothesis_annotations: + score_conv_eval = hypothesis_annotations["sentence_ranker"] + score = CONV_EVAL_STRENGTH * score_conv_eval + CONFIDENCE_STRENGTH * confidence + return score + elif "hypothesis_scorer" in hypothesis_annotations: + return hypothesis_annotations["hypothesis_scorer"] + else: + return 0.0 def downscore_toxic_badlisted_responses(scores, confidences, is_toxics): # exclude toxic messages and messages with badlisted phrases ids = np.arange(len(confidences))[is_toxics] logger.info(f"Bot excluded utterances: {ids}. is_toxics: {is_toxics}") - scores[ids] = { - "isResponseOnTopic": 0.0, - "isResponseInteresting": 0.0, - "responseEngagesUser": 0.0, - "isResponseComprehensible": 0.0, - "isResponseErroneous": 1.0, - } + scores[ids] = 0.0 confidences[ids] = 0.0 return len(ids), scores, confidences diff --git a/services/dialogpt/server.py b/services/dialogpt/server.py index 5ffe5e200d..53c8dd9c63 100644 --- a/services/dialogpt/server.py +++ b/services/dialogpt/server.py @@ -43,10 +43,14 @@ logging.getLogger("werkzeug").setLevel("WARNING") -def generate_responses(context, model, tokenizer): +def generate_responses(context, model, tokenizer, continue_last_uttr=False): encoded_context = [] - for uttr in context[-MAX_HISTORY_DEPTH:]: + for uttr in context[-MAX_HISTORY_DEPTH:-1]: encoded_context += [tokenizer.encode(uttr + " " + tokenizer.eos_token, return_tensors="pt")] + if continue_last_uttr: + encoded_context += [tokenizer.encode(context[-1] + " ", return_tensors="pt")] + else: + encoded_context += [tokenizer.encode(context[-1] + " " + tokenizer.eos_token, return_tensors="pt")] bot_input_ids = torch.cat(encoded_context, dim=-1) with torch.no_grad(): @@ -93,3 +97,32 @@ def respond(): total_time = time.time() - st_time logger.info(f"dialogpt exec time: {total_time:.3f}s") return jsonify(list(zip(responses, confidences))) + + +@app.route("/continue", methods=["POST"]) +def continue_last_uttr(): + st_time = time.time() + contexts = request.json.get("utterances_histories", []) + + try: + responses = [] + for context in contexts: + curr_responses = [] + outputs = generate_responses(context, model, tokenizer, continue_last_uttr=True) + for response in outputs: + if len(response) > 3: + # drop too short responses + curr_responses += [response] + else: + curr_responses += [""] + + responses += [curr_responses] + + except Exception as exc: + logger.exception(exc) + sentry_sdk.capture_exception(exc) + responses = [[""]] * len(contexts) + + total_time = time.time() - st_time + logger.info(f"dialogpt continue exec time: {total_time:.3f}s") + return jsonify(responses) diff --git a/services/dialogpt/test.py b/services/dialogpt/test.py index 49dfe94518..c445d8d180 100644 --- a/services/dialogpt/test.py +++ b/services/dialogpt/test.py @@ -15,6 +15,15 @@ def test_respond(): len(sample[0]) > 0 and all([len(text) > 0 for text in sample[0]]) and all([conf > 0.0 for conf in sample[1]]) for sample in result ], f"Got\n{result}\n, but expected:\n{gold_result}" + + url = "http://0.0.0.0:8125/continue" + + contexts = [["hi", "hi. how are"], ["let's chat about movies", "cool. what movies do you"]] + gold_result = [["I'm good, how are you?", 0.9], ["I like the new one.", 0.9]] + result = requests.post(url, json={"utterances_histories": contexts}).json() + assert [ + all([len(text) > 0 for text in sample]) for sample in result + ], f"Got\n{result}\n, but expected:\n{gold_result}" print("Success") diff --git a/services/knowledge_grounding/requirements.txt b/services/knowledge_grounding/requirements.txt index 1328517d60..ba1fb2c0fc 100644 --- a/services/knowledge_grounding/requirements.txt +++ b/services/knowledge_grounding/requirements.txt @@ -6,3 +6,4 @@ jinja2<=3.0.3 sentry-sdk[flask]==0.14.1 jinja2<=3.0.3 Werkzeug<=2.0.3 +markupsafe==2.0.1 \ No newline at end of file diff --git a/services/sentence_ranker/Dockerfile b/services/sentence_ranker/Dockerfile new file mode 100644 index 0000000000..193c30028d --- /dev/null +++ b/services/sentence_ranker/Dockerfile @@ -0,0 +1,23 @@ +# syntax=docker/dockerfile:experimental + +FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime + +WORKDIR /src + +ARG PRETRAINED_MODEL_NAME_OR_PATH +ENV PRETRAINED_MODEL_NAME_OR_PATH ${PRETRAINED_MODEL_NAME_OR_PATH} +ARG SERVICE_PORT +ENV SERVICE_PORT ${SERVICE_PORT} + +RUN mkdir /data/ + +COPY ./requirements.txt /src/requirements.txt +RUN pip install -r /src/requirements.txt + +RUN python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('${PRETRAINED_MODEL_NAME_OR_PATH}');" +RUN python -c "from transformers import AutoModel; AutoModel.from_pretrained('${PRETRAINED_MODEL_NAME_OR_PATH}');" + +COPY . /src + +CMD gunicorn --workers=1 server:app -b 0.0.0.0:${SERVICE_PORT} --timeout=300 + diff --git a/services/sentence_ranker/README.md b/services/sentence_ranker/README.md new file mode 100644 index 0000000000..bccab841e3 --- /dev/null +++ b/services/sentence_ranker/README.md @@ -0,0 +1,9 @@ +# Sentence Ranker Service + +This is a universal service for evaluation of a sentence pair. + +The model can be selected from HugginFace library and passed as a `PRETRAINED_MODEL_NAME_OR_PATH` parameter. + +The service accepts a batch of sentence pairs (a pair is a list of two strings), and returns a batch of floating point values. + +To rank a list of sentence pairs, one can get floating point values for each pair and maximize the value. diff --git a/services/sentence_ranker/requirements.txt b/services/sentence_ranker/requirements.txt new file mode 100644 index 0000000000..3f8f604296 --- /dev/null +++ b/services/sentence_ranker/requirements.txt @@ -0,0 +1,10 @@ +transformers==4.0.1 +sentencepiece==0.1.94 +flask==1.1.1 +gunicorn==19.9.0 +requests==2.22.0 +sentry-sdk[flask]==0.14.1 +scikit-learn==0.21.3 +itsdangerous==2.0.1 +jinja2<=3.0.3 +Werkzeug<=2.0.3 diff --git a/services/sentence_ranker/server.py b/services/sentence_ranker/server.py new file mode 100644 index 0000000000..b847bcd07d --- /dev/null +++ b/services/sentence_ranker/server.py @@ -0,0 +1,95 @@ +import logging +import time +import os + +import sentry_sdk +import torch +from flask import Flask, request, jsonify +from sentry_sdk.integrations.flask import FlaskIntegration +from sklearn.metrics.pairwise import cosine_similarity +from transformers import AutoModel, AutoTokenizer + + +sentry_sdk.init(dsn=os.getenv("SENTRY_DSN"), integrations=[FlaskIntegration()]) + +logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO) +logger = logging.getLogger(__name__) + +PRETRAINED_MODEL_NAME_OR_PATH = os.environ.get( + "PRETRAINED_MODEL_NAME_OR_PATH", "DeepPavlov/bert-base-multilingual-cased-sentence" +) +logger.info(f"PRETRAINED_MODEL_NAME_OR_PATH = {PRETRAINED_MODEL_NAME_OR_PATH}") + +try: + tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME_OR_PATH) + model = AutoModel.from_pretrained(PRETRAINED_MODEL_NAME_OR_PATH) + if torch.cuda.is_available(): + model.to("cuda") + logger.info("sentence-ranker is set to run on cuda") + + logger.info("sentence-ranker is ready") +except Exception as e: + sentry_sdk.capture_exception(e) + logger.exception(e) + raise e + +app = Flask(__name__) +logging.getLogger("werkzeug").setLevel("WARNING") + + +def get_sim_for_pair_embeddings(sentence_pairs_batch): + # source code: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 + # initialize dictionary to store tokenized sentences + tokens = {"input_ids": [], "attention_mask": []} + + for pair in sentence_pairs_batch: + # encode each sentence and append to dictionary + for sentence in pair: + new_tokens = tokenizer.encode_plus( + sentence, max_length=64, truncation=True, padding="max_length", return_tensors="pt" + ) + tokens["input_ids"].append(new_tokens["input_ids"][0]) + tokens["attention_mask"].append(new_tokens["attention_mask"][0]) + + # reformat list of tensors into single tensor + tokens["input_ids"] = torch.stack(tokens["input_ids"]) + tokens["attention_mask"] = torch.stack(tokens["attention_mask"]) + if torch.cuda.is_available(): + tokens["input_ids"] = tokens["input_ids"].cuda() + tokens["attention_mask"] = tokens["attention_mask"].cuda() + + embeddings = model(**tokens).last_hidden_state + attention_mask = tokens["attention_mask"] + mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float() + masked_embeddings = embeddings * mask + summed = torch.sum(masked_embeddings, 1) + summed_mask = torch.clamp(mask.sum(1), min=1e-9) + mean_pooled = summed / summed_mask + # convert from PyTorch tensor to numpy array + if torch.cuda.is_available(): + mean_pooled = mean_pooled.cpu() + mean_pooled = mean_pooled.detach().numpy() + + # calculate + scores = [] + for i in range(len(sentence_pairs_batch)): + scores += [cosine_similarity([mean_pooled[i * 2]], [mean_pooled[i * 2 + 1]]).tolist()[0][0]] + return scores + + +@app.route("/respond", methods=["POST"]) +def respond(): + st_time = time.time() + sentence_pairs = request.json.get("sentence_pairs", []) + + try: + scores = get_sim_for_pair_embeddings(sentence_pairs) + logger.info(f"sentence-ranker output: {scores}") + except Exception as exc: + logger.exception(exc) + sentry_sdk.capture_exception(exc) + scores = [0.0] * len(sentence_pairs) + + total_time = time.time() - st_time + logger.info(f"sentence-ranker exec time: {total_time:.3f}s") + return jsonify([{"batch": scores}]) diff --git a/services/sentence_ranker/test.py b/services/sentence_ranker/test.py new file mode 100644 index 0000000000..bf5346f474 --- /dev/null +++ b/services/sentence_ranker/test.py @@ -0,0 +1,25 @@ +import requests + + +def test_respond(): + url = "http://0.0.0.0:8128/respond" + + sentence_pairs = [ + ["Привет! Как дела?", "хорошо. а у тебя как дела?"], + ["Привет! Как дела?", "какой твой любимый фильм?"], + ["Какой твой любимый фильм?", "Гордость и предубеждение"], + ["Какой твой любимый фильм?", "пересматриваю Гордость и предубеждение иногда."], + ["Какой твой любимый фильм?", "я люблю играть в компьютерные игры."], + ] + + gold = [0.8988315, 0.62241143, 0.65046525, 0.54038674, 0.48419473] + + request_data = {"sentence_pairs": sentence_pairs} + result = requests.post(url, json=request_data).json()[0]["batch"] + for i, score in enumerate(result): + assert score != 0.0, f"Expected:{gold[i]}\tGot\n{score}" + print("Success!") + + +if __name__ == "__main__": + test_respond() diff --git a/services/sentence_ranker/test.sh b/services/sentence_ranker/test.sh new file mode 100755 index 0000000000..cf55721bd3 --- /dev/null +++ b/services/sentence_ranker/test.sh @@ -0,0 +1,4 @@ +#!/bin/bash + + +python test.py diff --git a/services/text_qa/Dockerfile b/services/text_qa/Dockerfile index 9fd1f1bcb0..b7788c41c2 100644 --- a/services/text_qa/Dockerfile +++ b/services/text_qa/Dockerfile @@ -19,7 +19,7 @@ COPY ./requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt RUN rm -r /etc/apt/sources.list.d && apt-get update && apt-get install git -y -RUN pip install git+https://github.com/deepmipt/DeepPavlov.git@${COMMIT} +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@${COMMIT} COPY . /src diff --git a/services/text_qa/logit_ranker.py b/services/text_qa/logit_ranker.py index 5bb563d57b..c479a5c765 100644 --- a/services/text_qa/logit_ranker.py +++ b/services/text_qa/logit_ranker.py @@ -142,11 +142,18 @@ def __call__( ) logger.info(f"batch_best_answers {batch_best_answers}") if self.top_n == 1: - batch_best_answers = [x[0] for x in batch_best_answers] - batch_best_answers_place = [x[0] for x in batch_best_answers_place] - batch_best_answers_score = [x[0] for x in batch_best_answers_score] - batch_best_answers_doc_ids = [x[0] for x in batch_best_answers_doc_ids] - batch_best_answers_sentences = [x[0] for x in batch_best_answers_sentences] + if batch_best_answers and batch_best_answers[0]: + batch_best_answers = [x[0] for x in batch_best_answers] + batch_best_answers_place = [x[0] for x in batch_best_answers_place] + batch_best_answers_score = [x[0] for x in batch_best_answers_score] + batch_best_answers_doc_ids = [x[0] for x in batch_best_answers_doc_ids] + batch_best_answers_sentences = [x[0] for x in batch_best_answers_sentences] + else: + batch_best_answers = ["" for _ in questions_batch] + batch_best_answers_place = [0 for _ in questions_batch] + batch_best_answers_score = [0.0 for _ in questions_batch] + batch_best_answers_doc_ids = ["" for _ in questions_batch] + batch_best_answers_sentences = ["" for _ in questions_batch] if doc_ids_batch is None: if self.return_answer_sentence: diff --git a/services/text_qa/requirements.txt b/services/text_qa/requirements.txt index 7fdced4dc1..be4769e031 100644 --- a/services/text_qa/requirements.txt +++ b/services/text_qa/requirements.txt @@ -7,4 +7,5 @@ click==7.1.2 jinja2<=3.0.3 Werkzeug<=2.0.3 torch==1.6.0 -transformers==2.11.0 \ No newline at end of file +transformers==2.11.0 +cryptography==2.8 \ No newline at end of file diff --git a/services/wikidata_dial_service/Dockerfile b/services/wikidata_dial_service/Dockerfile index 2a79995511..54eab47cf4 100644 --- a/services/wikidata_dial_service/Dockerfile +++ b/services/wikidata_dial_service/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG COMMIT=0.13.0 @@ -26,7 +27,6 @@ WORKDIR /src RUN sed -i "s|$SED_ARG|g" "kg_dial_generator.json" -RUN python -m deeppavlov install $CONFIG RUN python -m spacy download en_core_web_sm CMD gunicorn --workers=1 --timeout 500 --graceful-timeout 500 server:app -b 0.0.0.0:8092 diff --git a/services/wikidata_dial_service/requirements.txt b/services/wikidata_dial_service/requirements.txt index c7fe18eeb5..b5d102f61a 100644 --- a/services/wikidata_dial_service/requirements.txt +++ b/services/wikidata_dial_service/requirements.txt @@ -9,3 +9,5 @@ torch==1.7.0 torchtext==0.4.0 transformers==4.0.0 click==7.1.2 +git+https://github.com/deeppavlovteam/bert.git@feat/multi_gpu +tensorflow==1.15.5 diff --git a/skills/dff_sport_skill/dialogflows/flows/sport.py b/skills/dff_sport_skill/dialogflows/flows/sport.py index 1ca7e797b3..49c6e24703 100644 --- a/skills/dff_sport_skill/dialogflows/flows/sport.py +++ b/skills/dff_sport_skill/dialogflows/flows/sport.py @@ -55,9 +55,6 @@ sentry_sdk.init(dsn=os.getenv("SENTRY_DSN")) LANGUAGE = os.getenv("LANGUAGE", "EN") - -MASKED_LM_SERVICE_URL = os.getenv("MASKED_LM_SERVICE_URL") - logger = logging.getLogger(__name__) diff --git a/skills/dff_travel_skill/dialogflows/flows/travel.py b/skills/dff_travel_skill/dialogflows/flows/travel.py index c757fe2dd1..2c0822e025 100644 --- a/skills/dff_travel_skill/dialogflows/flows/travel.py +++ b/skills/dff_travel_skill/dialogflows/flows/travel.py @@ -46,9 +46,6 @@ sentry_sdk.init(dsn=os.getenv("SENTRY_DSN")) - -MASKED_LM_SERVICE_URL = os.getenv("MASKED_LM_SERVICE_URL") - logger = logging.getLogger(__name__) SUPER_CONFIDENCE = 1.0 diff --git a/skills/dummy_skill/connector.py b/skills/dummy_skill/connector.py index 6af1968bc1..1baac00f25 100644 --- a/skills/dummy_skill/connector.py +++ b/skills/dummy_skill/connector.py @@ -222,7 +222,7 @@ async def send(self, payload: Dict, callback: Callable): logger.info("Found special nounphrases for questions. Return question with the same nounphrase.") cands += [choice(questions_same_nps)] confs += [0.5] - attrs += [{"type": "nounphrase_question"}] + attrs += [{"type": "nounphrase_question", "response_parts": ["prompt"]}] human_attrs += [{}] bot_attrs += [{}] @@ -267,13 +267,13 @@ async def send(self, payload: Dict, callback: Callable): else: confs += [0.05] # Use it only as response selector retrieve skill output modifier cands += [link_to_question] - attrs += [{"type": "link_to_for_response_selector"}] + attrs += [{"type": "link_to_for_response_selector", "response_parts": ["prompt"]}] human_attrs += [human_attr] bot_attrs += [{}] elif is_russian: cands += [random.choice(RUSSIAN_RANDOM_QUESTIONS)] confs += [0.8] - attrs += [{"type": "link_to_for_response_selector"}] + attrs += [{"type": "link_to_for_response_selector", "response_parts": ["prompt"]}] human_attrs += [{}] bot_attrs += [{}] @@ -293,7 +293,7 @@ async def send(self, payload: Dict, callback: Callable): logger.info("Found special nounphrases for facts. Return fact with the same nounphrase.") cands += [choice(facts_same_nps)] confs += [0.5] - attrs += [{"type": "nounphrase_fact"}] + attrs += [{"type": "nounphrase_fact", "response_parts": ["body"]}] human_attrs += [{}] bot_attrs += [{}] diff --git a/skills/faq_skill_deepy/requirements.txt b/skills/faq_skill_deepy/requirements.txt index afbbb9e9ac..c7b542f3fd 100644 --- a/skills/faq_skill_deepy/requirements.txt +++ b/skills/faq_skill_deepy/requirements.txt @@ -8,4 +8,5 @@ spacy==2.2.3 deeppavlov==0.14.0 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 jinja2<=3.0.3 -Werkzeug<=2.0.3 \ No newline at end of file +Werkzeug<=2.0.3 +cryptography==2.8 \ No newline at end of file diff --git a/skills/wikidata_dial_skill/Dockerfile b/skills/wikidata_dial_skill/Dockerfile index fba4d1909d..0d09caafae 100644 --- a/skills/wikidata_dial_skill/Dockerfile +++ b/skills/wikidata_dial_skill/Dockerfile @@ -1,4 +1,5 @@ FROM deeppavlov/base-gpu:0.12.1 +RUN pip install git+https://github.com/deeppavlovteam/DeepPavlov.git@0.12.1 ARG CONFIG ARG COMMIT=0.13.0 diff --git a/state_formatters/dp_formatters.py b/state_formatters/dp_formatters.py index c202ae2e6c..fdcac14901 100755 --- a/state_formatters/dp_formatters.py +++ b/state_formatters/dp_formatters.py @@ -360,6 +360,14 @@ def convers_evaluator_annotator_formatter(dialog: Dict) -> List[Dict]: return [conv] +def sentence_ranker_formatter(dialog: Dict) -> List[Dict]: + dialog = utils.get_last_n_turns(dialog) + dialog = utils.remove_clarification_turns_from_dialog(dialog) + last_human_uttr = dialog["human_utterances"][-1]["text"] + sentence_pairs = [[last_human_uttr, h["text"]] for h in dialog["human_utterances"][-1]["hypotheses"]] + return [{"sentence_pairs": sentence_pairs}] + + def dp_classes_formatter_service(payload: List): # Used by: dp_toxic_formatter return payload[0] @@ -566,24 +574,23 @@ def wp_formatter_dialog(dialog: Dict): def el_formatter_dialog(dialog: Dict): # Used by: entity_linking annotator num_last_utterances = 2 - ner_output = get_entities(dialog["human_utterances"][-1], only_named=True, with_labels=True) - nounphrases = get_entities(dialog["human_utterances"][-1], only_named=False, with_labels=False) - entity_substr_list = [] - if ner_output: - for entity in ner_output: - if entity and isinstance(entity, dict) and "text" in entity and entity["text"].lower() != "alexa": - entity_substr_list.append(entity["text"]) - entity_substr_lower_list = {entity_substr.lower() for entity_substr in entity_substr_list} + entities_with_labels = get_entities(dialog["human_utterances"][-1], only_named=False, with_labels=True) + entity_substr_list, entity_tags_list = [], [] + for entity in entities_with_labels: + if entity and isinstance(entity, dict) and "text" in entity and entity["text"].lower() != "alexa": + entity_substr_list.append(entity["text"]) + if "finegrained_label" in entity: + finegrained_labels = [[label.lower(), conf] for label, conf in entity["finegrained_label"]] + entity_tags_list.append(finegrained_labels) + elif "label" in entity: + entity_tags_list.append([[entity["label"].lower(), 1.0]]) + else: + entity_tags_list.append([["misc", 1.0]]) dialog = utils.get_last_n_turns(dialog, bot_last_turns=1) dialog = utils.replace_with_annotated_utterances(dialog, mode="punct_sent") context = [[uttr["text"] for uttr in dialog["utterances"][-num_last_utterances:]]] - if nounphrases: - entity_substr_list += [ - nounphrase for nounphrase in nounphrases if nounphrase.lower() not in entity_substr_lower_list - ] - entity_substr_list = list(set(entity_substr_list)) - return [{"entity_substr": [entity_substr_list], "template": [""], "context": context}] + return [{"entity_substr": [entity_substr_list], "entity_tags": [entity_tags_list], "context": context}] def kbqa_formatter_dialog(dialog: Dict): @@ -596,17 +603,20 @@ def kbqa_formatter_dialog(dialog: Dict): sentences = [deepcopy(annotations["sentseg"]["punct_sent"])] else: sentences = [deepcopy(dialog["human_utterances"][-1]["text"])] - entity_substr = get_entities(dialog["human_utterances"][-1], only_named=True, with_labels=False) - nounphrases = get_entities(dialog["human_utterances"][-1], only_named=False, with_labels=False) - entities = [] - if entity_substr: - entities = [entity_substr] - elif nounphrases: - entities = [nounphrases] - else: - entities = [[]] - - return [{"x_init": sentences, "entities": entities}] + entities_with_labels = get_entities(dialog["human_utterances"][-1], only_named=False, with_labels=True) + entity_substr_list, entity_tags_list = [], [] + for entity in entities_with_labels: + if entity and isinstance(entity, dict) and "text" in entity and entity["text"].lower() != "alexa": + entity_substr_list.append(entity["text"]) + if "finegrained_label" in entity: + finegrained_labels = [[label.lower(), conf] for label, conf in entity["finegrained_label"]] + entity_tags_list.append(finegrained_labels) + elif "label" in entity: + entity_tags_list.append([[entity["label"].lower(), 1.0]]) + else: + entity_tags_list.append([["misc", 1.0]]) + + return [{"x_init": sentences, "entities": [entity_substr_list], "entity_tags": [entity_tags_list]}] def fact_random_formatter_dialog(dialog: Dict): @@ -640,11 +650,11 @@ def fact_retrieval_formatter_dialog(dialog: Dict): entity_substr_list = [] entity_pages_titles_list = [] for entity_info in entity_info_list: - if "entity_pages" in entity_info and entity_info["entity_pages"]: - entity_pages_list.append(entity_info["entity_pages"]) + if "pages_titles" in entity_info and entity_info["pages_titles"]: + entity_pages_list.append(entity_info["first_paragraphs"]) entity_ids_list.append(entity_info["entity_ids"]) entity_substr_list.append(entity_info["entity_substr"]) - entity_pages_titles_list.append(entity_info["entity_pages_titles"]) + entity_pages_titles_list.append(entity_info["pages_titles"]) return [ { "human_sentences": [last_human_utt["text"]], diff --git a/tests/runtests.sh b/tests/runtests.sh index f8424a7ef4..888c0bb9e0 100755 --- a/tests/runtests.sh +++ b/tests/runtests.sh @@ -143,7 +143,7 @@ if [[ "$MODE" == "test_skills" || "$MODE" == "all" ]]; then comet-conceptnet convers-evaluation-selector emotion-skill game-cooperative-skill \ entity-linking kbqa text-qa wiki-parser convert-reddit convers-evaluator-annotator \ dff-book-skill combined-classification knowledge-grounding knowledge-grounding-skill \ - dff-grounding-skill dff-coronavirus-skill dff-friendship-skill masked-lm entity-storer \ + dff-grounding-skill dff-coronavirus-skill dff-friendship-skill entity-storer \ dff-travel-skill dff-animals-skill dff-food-skill dff-sport-skill midas-classification \ fact-random fact-retrieval dff-intent-responder-skill badlisted-words \ dff-gossip-skill dff-wiki-skill topic-recommendation dff-science-skill personal-info-skill \