buriy · ex00 · Feb 24, 2020 · Feb 24, 2020 · Feb 24, 2020 · Feb 24, 2020
diff --git a/Dockerfile b/Dockerfile
@@ -6,7 +6,7 @@ RUN mkdir $PROJECT_DIR
 
 WORKDIR /
 #istall components for ru2 
-RUN conda install -y -c conda-forge spacy==2.1.9
+RUN conda install -y -c conda-forge spacy==2.2.4
 RUN pip install pymorphy2==0.8
 RUN git clone -b v2.1 https://github.com/buriy/spacy-ru.git
 RUN cp -r /spacy-ru/ru2/. $PROJECT_DIR/ru2

diff --git a/README.md b/README.md
@@ -1,7 +1,48 @@
 # Модель русского языка для библиотеки spaCy
 
-## Преимущества модели ru2
-Оно старается определять не только x.pos_, но и x.lemma_ -- лемму слова (например, для существительных лемма совпадает с формой: "именительный падеж, единственное число") (edited) 
+## Появились модели для spacy 2.3:
+
+https://github.com/buriy/spacy-ru/releases/tag/v2.3_beta 
+
+Они используются следующим образом:
+```pip install spacy<2.4
+wget https://github.com/buriy/spacy-ru/releases/download/v2.3_beta/ru2_combined_400ks_96.zip
+unzip ru2_combined_400ks_96.zip
+```
+Потом:
+```
+import spacy
+nlp = spacy.load('ru2_combined_400ks_96')
+```
+
+А здесь дальше идёт документация для версии 2.1
+
+## Преимущества этой модели ru2 для версии 2.1
+Модель ru2 умеет определять не только POS-tag в x.pos_, но и лемму слова в x.lemma_ . Например, для существительных, лемма совпадает с формой именительного падежа, единственного числа.
+Из-за особенностей устройства библиотеки spacy, для более хорошего качества лемм, нужно писать
+```
+import ru2
+nlp = ru2.load_ru2('ru2')
+```
+вместо стандартного
+```
+import spacy
+nlp = spacy.load('ru2')
+```
+
+Это также починит `.noun_chunks()` для русского, но они пока не идеально работают, будем доделывать.
+
+## Модель ru2e
+Это "пустая" модель, которая использует стемминг (`pip install pystemmer`), полезна для пользовательских задач классификации, особенно, когда данных мало. Поскольку в этой модели нет POS-теггера, она не умеет получать леммы.
+Для использования стемминга надо писать аналогично модели ru2 выше:
+```
+import ru2e
+nlp = ru2e.load_ru2('ru2e')
+```
+
+Смотри пример в 
+https://github.com/buriy/spacy-ru/blob/master/notebooks/examples/textcat_news_topics.ipynb
+
 # Установка
 
 Инсталляция сейчас не супер-простая, кроме того, thinc не всегда из коробки работает.
@@ -41,13 +82,14 @@ docker run --rm spacy:ru2
 ```
 
 ### Предупреждения и возможные проблемы
- - Если нужен работающий thinc на GPU, то, возможно, нужно исправить (явно указать) путь к cuda и переустановить библиотеку:
+ - Если нужна работа на GPU (ускоряет обучение в 2-3 раза, инференс -- до 5 раз), то, возможно, нужно исправить (явно указать) путь к cuda и переустановить библиотеку thinc:
 ```bash
 pip uninstall -y thinc
 CUDA_HOME=/usr/local/cuda pip install --no-cache-dir thinc==7.0.8
 ```
-Другой вариант -- попробовать что-то типа `pip install "spacy[cuda91]<2.2"` или `pip install "spacy[cuda10]<2.2"` для spacy версии 2.1.x.
-Так же стоит проверить что `cupy` установлена верно для вашей версии cuda -[link](https://docs-cupy.chainer.org/en/stable/install.html#install-cupy)
+Другой вариант -- попробовать что-то типа `pip install "spacy[cuda91]<2.2"` или `pip install "spacy[cuda10]<2.2"` для spacy версии 2.1.x и вашей версии cuda.
+
+Если GPU по-прежнему не работает -- стоит явно проверить, что `cupy` установлена верно для вашей версии cuda: [link](https://docs-cupy.chainer.org/en/stable/install.html#install-cupy)
 пример установки для cuda 10.0
 ```bash
 $ nvcc -V
@@ -66,13 +108,12 @@ $ pip install --no-cache-dir "spacy[cuda10]<2.2"
 Successfully installed blis-0.2.4 preshed-2.0.1 spacy-2.1.9 thinc-7.0.8
 ```
 
-- Если вы переходите с xx на ru/ru2, то имейте в виду, что токенизация в ru/ru2 и xx отличается, т.к. xx не отделяет буквы от цифр и дефисы.
+- Если вы переходите с многоязычной модели "xx" на модели ru/ru2, то имейте в виду, что токенизация в моделях ru/ru2 и xx отличается, т.к. xx не отделяет буквы от цифр и дефисы, то есть скажем слова "24-часовой" и "va-sya103" будут едиными неделимыми токенами.
 - На Windows клонирование репозитория с настройкой `core.autocrlf true` в `git` 
 может испортить некоторые файлы и привести к ошибкам типа `msgpack._cmsgpack.unpackbTypeError: unhashable type: 'list'`.
 Для того чтобы этого избежать надо либо клонировать с `core.autocrlf false`, либо, например, 
 скачивать архив репозитория вручную через веб-интерфейс.
 Обсуждение проблемы и решение можно найти [здесь](https://github.com/explosion/spaCy/issues/1634).
 - Попытка вызова `spacy.displacy.serve()` или некоторых других функций на Python 3 может привести к 
-ошибке `TypeError: __init__() got an unexpected keyword argument 'encoding'`. Чтобы этого избежать,
-необходимо явно установить старую версию `msgpack-numpy<0.4.4.0`. Обсуждение проблемы и решение можно
-найти [здесь](https://github.com/explosion/spaCy/issues/2810).
+ошибке `TypeError: __init__() got an unexpected keyword argument 'encoding'`. Чтобы этого избежать, раньше
+необходимо было явно установить старую версию `msgpack-numpy<0.4.4.0`. Сейчас вроде бы поправили. Обсуждение проблемы и решение можно найти [здесь](https://github.com/explosion/spaCy/issues/2810).
diff --git a/notebooks/training/factu_ner_example.ipynb b/notebooks/training/factu_ner_example.ipynb
@@ -0,0 +1,113 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "FactRu data path:  /home/exo/projects/spacy-ru/data/factRuEval-2016/\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<utils.corpus.Corpus at 0x7f67e4516400>"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\n",
+    "import sys\n",
+    "sys.path.append('/home/exo/projects/spacy-ru/')\n",
+    "import utils.corpus\n",
+    "NERUS = utils.corpus.FactRu.load('ru',\"/home/exo/projects/spacy-ru/data/factRuEval-2016/\")\n",
+    "NERUS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CORPORA = [NERUS]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "AttributeError",
+     "evalue": "'Corpus' object has no attribute 'ner'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-3-96c500e01a11>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mCORPORA\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mner\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mds_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mds_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mAttributeError\u001b[0m: 'Corpus' object has no attribute 'ner'"
+     ]
+    }
+   ],
+   "source": [
+    "for c in CORPORA:\n",
+    "    print(len(c.ner), len(c.ds_train), len(c.ds_test))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "122 132\n"
+     ]
+    }
+   ],
+   "source": [
+    "for c in CORPORA:\n",
+    "    print(len(c.ds_train), len(c.ds_test))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "spacy-ru",
+   "language": "python",
+   "name": "spacy-ru"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -21,6 +21,7 @@ nerus = "^1.4.0"
 pymystem3 = "^0.2.0"
 pystemmer = "^1.3.0"
 pandas = ">=0.23"
+tqdm = ">=4.46.0"
 
 [tool.poetry.dev-dependencies]
 #spacy = { git = "https://github.com/buriy/spaCy.git", branch = "master" }
diff --git a/ru2/__init__.py b/ru2/__init__.py
@@ -15,5 +15,10 @@ class Russian2(Russian):
     lang = 'ru'
     Defaults = Russian2Defaults
 
+def load_ru2(path, exclude=[]):
+    nlp = Russian2()
+    nlp.from_disk(path)
+    nlp.disable_pipes(*exclude)
+    return nlp
 
-__all__ = ['Russian2']
+__all__ = ['Russian2', 'load_ru2']
diff --git a/utils/corpus.py b/utils/corpus.py
@@ -1,9 +1,19 @@
+import os
 import random
+import urllib.request
+from itertools import chain
+from typing import List
+
 import spacy
 import spacy.gold
-from itertools import chain
+from corus import load_factru
+from corus.sources.factru import DEVSET as FACTRU_DEVSET, TESTSET as FACTRU_TESTSET
+from corus.sources.factru import FactruSpan
+from nerus import const as nerus_const
 from tqdm.auto import tqdm
 
+from .tqdm import TqdmUpTo
+
 
 class Dataset:
     """
@@ -78,9 +88,9 @@ def iter(self, nlp, limit=None):
             random.shuffle(ds_copy)
         if limit:
             ds_copy = ds_copy[:limit]
-        for dict_ in self.ds:
+        for dict_ in ds_copy:
             doc = nlp(dict_["raw"])
-            gold = spacy.gold.GoldParse()
+            gold = spacy.gold.GoldParse(doc)
             gold.ner = dict_.get("entities", [])
             yield doc, gold
 
@@ -138,8 +148,8 @@ def from_gold(lang, ds_gold):
 
     @staticmethod
     def from_raw(lang, ds_train, ds_test):
-        train = RawDataset(lang, ds_train)
-        test = RawDataset(lang, ds_test)
+        train = RawDataset(lang, ds_train, is_train=True)
+        test = RawDataset(lang, ds_test, is_train=False)
         return Corpus(train, test)
 
 
@@ -171,3 +181,61 @@ def tag_morphology(tag):
         k, v = p.split("=", 1)
         info[k] = v
     return info
+
+
+class FactRu(Corpus):
+
+    @staticmethod
+    def _resolve_data_path(data_path, download_if_not_exist):
+        data_path = data_path or os.path.join(nerus_const.SOURCES_DIR, nerus_const.FACTRU_DIR, 'master.zip')
+        print("FactRu data path: ", data_path)
+        if not os.path.exists(data_path):
+            if download_if_not_exist:
+                os.makedirs(os.path.dirname(data_path))
+
+                print("Download FactRu corpus to ", data_path)
+                with TqdmUpTo(unit='B', unit_scale=True, unit_divisor=1024, miniters=1,
+                              desc="FactRu corpus downloading") as tqdm_:
+                    try:
+                        urllib.request.urlretrieve(nerus_const.FACTRU_URL, data_path, reporthook=tqdm_.update_to)
+                        tqdm_.total = tqdm_.n
+                    except Exception as e:
+                        os.remove(data_path)
+                        raise e
+                # TODO unpack in script
+                raise Exception("{} file is downloaded, please unzip master.zip and restart script".format(data_path))
+            else:
+                raise FileExistsError("Source data for FactRuEval corpus is not exist: {}".format(data_path))
+        return data_path
+
+    @staticmethod
+    def _load_list_dict(factru_data) -> List[dict]:
+        output_data = []
+        for element in factru_data:
+            dict_ = {"raw": element.text}
+            entities = []
+            for fact in element.facts:
+                for slot in fact.slots:
+                    if isinstance(slot.value, FactruSpan):
+                        span = slot.value
+                        entities.append((span.start, span.stop, span.type))
+
+                    elif not isinstance(slot.value, str):
+                        for obj in slot.value.objects:
+                            for span in obj.spans:
+                                entities.append((span.start, span.stop, span.type))
+
+            dict_["entities"] = entities
+            output_data.append(dict_)
+        return output_data
+
+    @staticmethod
+    def load(lang: str, data_path: str = None, download_if_not_exist: bool = True) -> Corpus:
+        data_path = FactRu._resolve_data_path(data_path, download_if_not_exist)
+
+        factru_dev_data = load_factru(data_path, sets=[FACTRU_DEVSET])
+        factru_test_data = load_factru(data_path, sets=[FACTRU_TESTSET])
+
+        train = RawDataset(lang, FactRu._load_list_dict(factru_dev_data), is_train=True)
+        test = RawDataset(lang, FactRu._load_list_dict(factru_test_data), is_train=False)
+        return Corpus(train, test)
diff --git a/utils/tqdm.py b/utils/tqdm.py
@@ -19,3 +19,24 @@ def tqdm_batches(batches, total=None, leave=True, **info):
         yield batch
         batch_iter.update(bl)
     batch_iter.close()
+
+
+class TqdmUpTo(tqdm):
+    """Alternative Class-based version of the above.
+    Provides `update_to(n)` which uses `tqdm.update(delta_n)`.
+    Inspired by [twine#242](https://github.com/pypa/twine/pull/242),
+    [here](https://github.com/pypa/twine/commit/42e55e06).
+    """
+
+    def update_to(self, b=1, bsize=1, tsize=None):
+        """
+        b  : int, optional
+            Number of blocks transferred so far [default: 1].
+        bsize  : int, optional
+            Size of each block (in tqdm units) [default: 1].
+        tsize  : int, optional
+            Total size (in tqdm units). If [default: None] remains unchanged.
+        """
+        if tsize is not None:
+            self.total = tsize
+        self.update(b * bsize - self.n)  # will also set self.n = b * bsize