diff --git a/sem30/main.ipynb b/sem30/main.ipynb
index 82a7391..ba957e5 100644
--- a/sem30/main.ipynb
+++ b/sem30/main.ipynb
@@ -1,1414 +1,1572 @@
 {
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "a2f21391-ecd3-4a99-9109-8989542546e5",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "# Установка некоторых внешних зависимостей"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 21,
-   "id": "e3f70fb0-8b21-40fd-a28a-03fea5100f27",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!pip install -qU https://github.com/kpu/kenlm/archive/master.zip sacremoses"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a73e8d26-6df2-4f4e-a8a6-3d5a28989380",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "# Подключение библиотек"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 23,
-   "id": "423007be-e0c6-4cb1-8a0f-aa5128106acc",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import gzip\n",
-    "\n",
-    "import kenlm\n",
-    "from sacremoses import MosesDetokenizer"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7b642fd2-71db-469b-923f-73ac2d87cb9a",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "# MOSES"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "426c343a-56ff-4f96-b601-ce67a645c334",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Подготовка данных для обучения переводчика\n",
-    "\n",
-    "Скачать данные в папку `/mnt/DATA/MSU2024NLP/corpus` и распаковать ее там (должны получить папку training)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4c4ffa1e-eca0-438a-b622-0f2f2300d1c9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# !wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz\n",
-    "# !tar zxvf training-parallel-nc-v8.tgz"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b24fd07f-7876-49aa-85db-ddcfe428f4f8",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### Токенизация текста\n",
-    "\n",
-    "В данном случае мы используем токенизатор, который представлен разработчиками moses. На текущий момент он все еще обновляется (добаляются новые языки и тд).\n",
-    "\n",
-    "Всегда можно использовать другие реализации токенизаторов (например соверменные BPE/WordPiece). В целом используя такого рода токенизаторы возможно получить прирост качества на некоторых языках, а также получить мультиязычный статистический перевод.\n",
-    "\n",
-    "[1] Asvarov A., Grabovoy A. The impact of multilinguality and tokenization on statistical machine translation // 2024 35th Conference of Open Innovations Association (FRUCT). — IEEE: 2024. — P. 149–157."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "d5a06ada-053f-45a3-8607-46f0ca862716",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Tokenizer Version 1.1\n",
-      "Language: en\n",
-      "Number of threads: 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/tokenizer/tokenizer.perl -l en \\\n",
-    "        < /corpus/training/news-commentary-v8.ru-en.en \\\n",
-    "        > /corpus/news-commentary-v8.ru-en.tok.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "8fdcfff2-9a7b-4625-85fe-93c5ec0ac1c8",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Tokenizer Version 1.1\n",
-      "Language: ru\n",
-      "Number of threads: 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/tokenizer/tokenizer.perl -l ru \\\n",
-    "        < /corpus/training/news-commentary-v8.ru-en.ru > \\\n",
-    "        /corpus/news-commentary-v8.ru-en.tok.ru\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "714abf17-144d-49c4-8922-f61fecf9ec60",
-   "metadata": {},
-   "source": [
-    "Получили выборку размером в 150тыс предложений для обучения статистического переводчика. Выборка из примера, имеет целый спектр недостатков (и в целом достаточно маленькая для статистического переводочика).\n",
-    "\n",
-    "Заметка из опыта:\n",
-    "1. Для обучения неплохого статистического переводчика требуется несколько милионов пар предложений.\n",
-    "2. Параллельные предложения для разных языков можно брать из https://opus.nlpl.eu.\n",
-    "3. На качество перевода очень сильно влияет качество данных и качество токенизатора."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "f16b299b-5a47-424b-8beb-54b284b27d4b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "531ab059-b1c7-43ba-a726-d7bc603c6a95",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "2584696f-14af-4063-a13e-ddb733ed3028",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .\n",
-      "Lately , with gold prices up more than 300 % over the last decade , it is harder than ever . Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .\n",
-      "Wouldn ’ t you know it ?\n",
-      "Since their articles appeared , the price of gold has moved up still further .\n",
-      "Gold prices even hit a record-high $ 1,300 recently .\n",
-      "Last December , many gold bugs were arguing that the price was inevitably headed for $ 2,000 .\n",
-      "Now , emboldened by continuing appreciation , some are suggesting that gold could be headed even higher than that .\n",
-      "One successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980 ’ s .\n",
-      "Since then , the index has climbed above 10,000 .\n",
-      "Now that gold has crossed the magic $ 1,000 barrier , why can ’ t it increase ten-fold , too ?\n"
-     ]
-    }
-   ],
-   "source": [
-    "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "1fc0eda2-c125-43e3-bc61-373b6d5b0486",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "САН-ФРАНЦІСКО . Разговор о стоимости золота редко получается рациональным , тем более в последнее время , так как цены на золото выросли более чем на 300 % за десятилетие .\n",
-      "Еще в декабре прошлого года экономисты-коллеги Мартин Фельдштейн и Нуриэль Рубини опубликовали свои пророческие статьи в колонках альтернативных мнений , храбро ставя в них под вопрос стремление игры на повышение , благоразумно указывая на риски золота .\n",
-      "И что бы вы думали ?\n",
-      "С тех пор как вышли их статьи , стоимость золота повысилась еще больше .\n",
-      "Недавно цена на золото даже достигла рекордной отметки в 1 300 долларов за унцию .\n",
-      "В декабре прошлого года многие сторонники сохранения денежных функций золота утверждали , что его цена неизбежно дойдет до 2 000 долларов .\n",
-      "Теперь , воодушевленные постоянным ростом цен , некоторые считают , что цена на золото может вырасти еще сильнее .\n",
-      "Один успешный инвестор в золото недавно объяснил мне , что цена акций падала в течение более десятилетия , прежде чем в начале 1980-х гг. индекс Доу Джонса пересек 1 000 отметку .\n",
-      "С тех пор данный индекс постоянно рос и уже превысил 10 000 отметку .\n",
-      "Теперь , когда цена на золото пересекла магический барьер в 1 000 долларов , не может ли она также вырасти в десять раз ?\n"
-     ]
-    }
-   ],
-   "source": [
-    "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9206c0eb-6593-4f58-8c28-040e6de63671",
-   "metadata": {},
-   "source": [
-    "### Востановления регистра первого слова предложения\n",
-    "\n",
-    "В целом регистры слов (с большой буквы или маленькой) очень сильно влияют на различные модели NLP. Часто строятся модели, которые работают только с нижним регистров (весь текст переводят в lower). В целом и тут можно было бы сделать так, но в документации рекомендуют сделать \"перераспределение\" регистров слов, чтобы на первое слово в предложении не влиял регистр, а он влиял только на те слова, которые чаще встречаются в верхнем регистре."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e54e10d7-2c9d-4024-823d-295a40b1bda4",
-   "metadata": {},
-   "source": [
-    "#### Обучение"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "c55392f0-853d-46c0-b943-b1ddd7fb9521",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/train-truecaser.perl \\\n",
-    "        --model /corpus/truecase-model.en \\\n",
-    "        --corpus /corpus/news-commentary-v8.ru-en.tok.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "0d89cfc0-9a92-42b0-9ebe-c569aa620d1b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/train-truecaser.perl \\\n",
-    "        --model /corpus/truecase-model.ru \\\n",
-    "        --corpus /corpus/news-commentary-v8.ru-en.tok.ru\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b757aa40-dee8-4fe5-a21e-664c5681b8a8",
-   "metadata": {},
-   "source": [
-    "#### Применение"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "id": "47006d65-9d96-4d14-ab2f-7ee6099094bc",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/truecase.perl \\\n",
-    "        --model /corpus/truecase-model.en \\\n",
-    "        < /corpus/news-commentary-v8.ru-en.tok.en \\\n",
-    "        > /corpus/news-commentary-v8.ru-en.true.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "id": "dfd7b142-169c-4d38-a958-ba347731ff9a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/truecase.perl \\\n",
-    "        --model /corpus/truecase-model.ru \\\n",
-    "        < /corpus/news-commentary-v8.ru-en.tok.ru \\\n",
-    "        > /corpus/news-commentary-v8.ru-en.true.ru\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "id": "c7a8f2a0-ceb4-4f9c-b342-3589416d72d5",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "id": "897f9cd6-7182-4cee-811b-d89a30c26f5f",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "1e58e6db-4f90-4b2a-ba46-a9db4741f73b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .\n",
-      "lately , with gold prices up more than 300 % over the last decade , it is harder than ever . just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .\n",
-      "wouldn ’ t you know it ?\n",
-      "since their articles appeared , the price of gold has moved up still further .\n",
-      "gold prices even hit a record-high $ 1,300 recently .\n",
-      "last December , many gold bugs were arguing that the price was inevitably headed for $ 2,000 .\n",
-      "now , emboldened by continuing appreciation , some are suggesting that gold could be headed even higher than that .\n",
-      "one successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980 ’ s .\n",
-      "since then , the index has climbed above 10,000 .\n",
-      "now that gold has crossed the magic $ 1,000 barrier , why can ’ t it increase ten-fold , too ?\n"
-     ]
-    }
-   ],
-   "source": [
-    "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "id": "302a7bb4-dd4a-488a-8d94-3c576174cbb6",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "САН-ФРАНЦІСКО . разговор о стоимости золота редко получается рациональным , тем более в последнее время , так как цены на золото выросли более чем на 300 % за десятилетие .\n",
-      "еще в декабре прошлого года экономисты-коллеги Мартин Фельдштейн и Нуриэль Рубини опубликовали свои пророческие статьи в колонках альтернативных мнений , храбро ставя в них под вопрос стремление игры на повышение , благоразумно указывая на риски золота .\n",
-      "и что бы вы думали ?\n",
-      "с тех пор как вышли их статьи , стоимость золота повысилась еще больше .\n",
-      "недавно цена на золото даже достигла рекордной отметки в 1 300 долларов за унцию .\n",
-      "в декабре прошлого года многие сторонники сохранения денежных функций золота утверждали , что его цена неизбежно дойдет до 2 000 долларов .\n",
-      "теперь , воодушевленные постоянным ростом цен , некоторые считают , что цена на золото может вырасти еще сильнее .\n",
-      "один успешный инвестор в золото недавно объяснил мне , что цена акций падала в течение более десятилетия , прежде чем в начале 1980-х гг. индекс Доу Джонса пересек 1 000 отметку .\n",
-      "с тех пор данный индекс постоянно рос и уже превысил 10 000 отметку .\n",
-      "теперь , когда цена на золото пересекла магический барьер в 1 000 долларов , не может ли она также вырасти в десять раз ?\n"
-     ]
-    }
-   ],
-   "source": [
-    "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8cc8ce66-4013-4b46-8594-ab3b774dd50e",
-   "metadata": {},
-   "source": [
-    "В целом после предобработки данных видно, что слова теперь не всегда начинаются с верхнего регистра, а только в тех случаях, когда это обосновано."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cf94c10c-4936-4111-a628-eadc99e1d36b",
-   "metadata": {},
-   "source": [
-    "### Фильтрация предложений по длине (1-80символов)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "id": "9895d5a0-1b1e-4e9f-9605-5c72273bec0a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "clean-corpus.perl: processing /corpus/news-commentary-v8.ru-en.true.ru & .en to /corpus/news-commentary-v8.ru-en.clean, cutoff 1-80, ratio 9\n",
-      "..........(100000).....\n",
-      "Input sentences: 150217  Output sentences:  146806\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/training/clean-corpus-n.perl \\\n",
-    "        /corpus/news-commentary-v8.ru-en.true ru en \\\n",
-    "        /corpus/news-commentary-v8.ru-en.clean 1 80\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "id": "0f462b8d-a190-4705-9cab-b2f781f251b7",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "146806 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.en\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "id": "fa86725c-ed19-4d3d-b7a1-b6a615f4ad3f",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "146806 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.ru\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.ru"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "82c5ef06-fb26-4a78-871e-82b3cb0218a6",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Обучение языковой модели\n",
-    "\n",
-    "Языковая модель (LM) используется для обеспечения корректного написания перевода, поэтому она построена на целевом языке. В примере строится 3-граммная языковая модель (на базе библиотеки KenLM)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "id": "1c161035-4d77-4c54-8cea-b396995b61c5",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "=== 1/5 Counting and sorting n-grams ===\n",
-      "Reading /corpus/news-commentary-v8.ru-en.true.en\n",
-      "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
-      "****************************************************************************************************\n",
-      "Unigram tokens 4008930 types 62481\n",
-      "=== 2/5 Calculating and sorting adjusted counts ===\n",
-      "Chain sizes: 1:749772 2:4672178176 3:8760334336\n",
-      "Statistics:\n",
-      "1 62481 D1=0.624963 D2=0.967568 D3+=1.32128\n",
-      "2 905932 D1=0.742376 D2=1.08513 D3+=1.36658\n",
-      "3 2382985 D1=0.837421 D2=1.16432 D3+=1.35272\n",
-      "Memory estimate for binary LM:\n",
-      "type    MB\n",
-      "probing 63 assuming -p 1.5\n",
-      "probing 68 assuming -r models -p 1.5\n",
-      "trie    25 without quantization\n",
-      "trie    14 assuming -q 8 -b 8 quantization \n",
-      "trie    24 assuming -a 22 array pointer compression\n",
-      "trie    12 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n",
-      "=== 3/5 Calculating and sorting initial probabilities ===\n",
-      "Chain sizes: 1:749772 2:14494912 3:47659700\n",
-      "=== 4/5 Calculating and writing order-interpolated probabilities ===\n",
-      "Chain sizes: 1:749772 2:14494912 3:47659700\n",
-      "=== 5/5 Writing ARPA model ===\n",
-      "Name:lmplz\tVmPeak:13295048 kB\tVmRSS:22284 kB\tRSSMax:3097284 kB\tuser:5.47198\tsys:1.30134\tCPU:6.77332\treal:6.84119\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/bin/lmplz \\\n",
-    "        -o 3 \\\n",
-    "        < /corpus/news-commentary-v8.ru-en.true.en \\\n",
-    "        > /corpus/news-commentary-v8.ru-en.arpa.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 53,
-   "id": "d55b7ef2-cd4e-46eb-89e6-b54c296d7bc2",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Reading /corpus/news-commentary-v8.ru-en.arpa.en\n",
-      "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
-      "****************************************************************************************************\n",
-      "SUCCESS\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/bin/build_binary \\\n",
-    "        /corpus/news-commentary-v8.ru-en.arpa.en \\\n",
-    "        /corpus/news-commentary-v8.ru-en.blm.en\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "eb4849c8-e5ec-4edf-afac-0b2e128bed00",
-   "metadata": {},
-   "source": [
-    "### ARPA формат языковой модели\n",
-    "\n",
-    "В рамках данного формата подсчитаны все условные и полные вероятности по токенам в обучающей выборке. В формате:\n",
-    "1. Указываются все n-gram, которые запрашивались для обучения (в нашем случае 1/2/3-gram).\n",
-    "2. Указывается число n-gram, которое получилось после обучения (для каждого из 1/2/3).\n",
-    "3. Далее для каждой n-gram выдается:\n",
-    "    - Вероятность последнего токена в n-gram при условии предыдущих (если нет, то при условии токена \\<s> --- начала): $p(w_{n}|w_1, w_2, \\cdots w_{n-1})$.\n",
-    "    - Сама n-gram: $w_1, w_2, \\cdots w_n$.\n",
-    "    - Полная вероятность встретить n-gram: $p(w_1, w_2, \\cdots w_n)$"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 49,
-   "id": "07920849-4efd-44fe-96e1-bf5ad6e168a4",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\\data\\\n",
-      "ngram 1=62481\n",
-      "ngram 2=905932\n",
-      "ngram 3=2382985\n",
-      "\n",
-      "\\1-grams:\n",
-      "-5.970607\t<unk>\t0\n",
-      "0\t<s>\t-1.3166543\n",
-      "-3.831133\t</s>\t0\n",
-      "-4.735731\tSan\t-0.38199094\n",
-      "-5.8285656\tFRANCISCO\t-0.12937579\n",
-      "-2.2254546\t–\t-0.7769198\n",
-      "-4.735731\tIt\t-0.27971995\n",
-      "-2.4997213\thas\t-0.7978452\n",
-      "-3.5793376\tnever\t-0.29212165\n",
-      "-3.6418643\tbeen\t-0.2847993\n",
-      "-4.0782976\teasy\t-0.40742075\n",
-      "-2.0642972\tto\t-0.90560645\n",
-      "-2.6063344\thave\t-0.70169735\n",
-      "-2.3986692\ta\t-0.69319177\n"
-     ]
-    }
-   ],
-   "source": [
-    "!head -n 20 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a17991e8-db30-4317-8d33-75616017630d",
-   "metadata": {},
-   "source": [
-    "### Работа с языковой моделью при помощи KenLM\n",
-    "\n",
-    "Языковые модели могут быть полезны в различных задачах NLP.\n",
-    "\n",
-    "Пример: использование перплексии для детекции машинной генерации."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 54,
-   "id": "a97bf13b-a150-446a-908f-cef7a9e098f0",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Loading the LM will be faster if you build a binary file.\n",
-      "Reading /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en\n",
-      "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
-      "****************************************************************************************************\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "(-16.963727951049805, 671.8742423267146)"
-      ]
-     },
-     "execution_count": 54,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "language_model = kenlm.Model('/mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en')\n",
-    "\n",
-    "sentence = 'What about your name ?'\n",
-    "language_model.score(sentence), language_model.perplexity(sentence)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 55,
-   "id": "f0b34d7c-0d1d-43e4-bf4c-21654e3ae9df",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "(-16.963727951049805, 671.8742423267146)"
-      ]
-     },
-     "execution_count": 55,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "language_model = kenlm.Model('/mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.blm.en')\n",
-    "\n",
-    "sentence = 'What about your name ?'\n",
-    "language_model.score(sentence), language_model.perplexity(sentence)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4888d7a1-cf8d-4416-8204-28c2e8a7fcca",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Обучение переводчика\n",
-    "\n",
-    "Для этого производится:\n",
-    "- выравнивание слов (с помощью GIZA++);\n",
-    "- извлечение и оценку фраз;\n",
-    "- создание лексической таблицы реранжирования.\n",
-    "\n",
-    "P.S. следующая ячейка работает порядка одного часа."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 62,
-   "id": "2e946018-d91b-44cb-9fde-5b2e7ae7a293",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/training/train-model.perl \\\n",
-    "        -root-dir /corpus/train \\\n",
-    "        -corpus /corpus/news-commentary-v8.ru-en.clean \\\n",
-    "        -f ru -e en \\\n",
-    "        -alignment grow-diag-final-and -reordering msd-bidirectional-fe \\\n",
-    "        -lm 0:3:/corpus/news-commentary-v8.ru-en.blm.en \\\n",
-    "        -external-bin-dir /opt/moses_tools/ \\\n",
-    "        -cores 4 \\\n",
-    "        --parallel -mgiza -mgiza-cpus 4 &> /corpus/training.log\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5636d121-0a17-4ce2-8b2d-26f523cda11f",
-   "metadata": {},
-   "source": [
-    "После обучения модели, получаем следующий конфигурационный файл, который полностью описывает модель переводчика.\n",
-    "\n",
-    "Модель состоит из:\n",
-    "1. Таблицы вероятностей различных n-gram на языке источника и языке таргета.\n",
-    "2. Таблица перестановки слов:\n",
-    "    - wbe --- извлечение вероятнотней на базе отдельных слов внутри фраз;\n",
-    "    - msd --- последовательность перестановки слов в фразе;\n",
-    "    - bidirectional --- выбор как нужно вероятности считать слева на право или справа на лево (тут и те и те);\n",
-    "    - fe --- используем какие фразы, только исходного языка или и выходного тоже (тут оба);\n",
-    "    - allff --- строим много признаков для оценки вероятности.\n",
-    "3. Языкова модель выходного текста."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 71,
-   "id": "5bcaba57-746d-47a3-8bb2-04046fd0e34a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "#########################\n",
-      "### MOSES CONFIG FILE ###\n",
-      "#########################\n",
-      "\n",
-      "# input factors\n",
-      "[input-factors]\n",
-      "0\n",
-      "\n",
-      "# mapping steps\n",
-      "[mapping]\n",
-      "0 T 0\n",
-      "\n",
-      "[distortion-limit]\n",
-      "6\n",
-      "\n",
-      "# feature functions\n",
-      "[feature]\n",
-      "UnknownWordPenalty\n",
-      "WordPenalty\n",
-      "PhrasePenalty\n",
-      "PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/corpus/train/model/phrase-table.gz input-factor=0 output-factor=0\n",
-      "LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz\n",
-      "Distortion\n",
-      "KENLM name=LM0 factor=0 path=/corpus/news-commentary-v8.ru-en.blm.en order=3\n",
-      "\n",
-      "# dense weights for feature functions\n",
-      "[weight]\n",
-      "# The default weights are NOT optimized for translation quality. You MUST tune the weights.\n",
-      "# Documentation for tuning is here: http://www.statmt.org/moses/?n=FactoredTraining.Tuning \n",
-      "UnknownWordPenalty0= 1\n",
-      "WordPenalty0= -1\n",
-      "PhrasePenalty0= 0.2\n",
-      "TranslationModel0= 0.2 0.2 0.2 0.2\n",
-      "LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3\n",
-      "Distortion0= 0.3\n",
-      "LM0= 0.5\n"
-     ]
-    }
-   ],
-   "source": [
-    "!cat /mnt/DATA/MSU2024NLP/corpus/train/model/moses.ini"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "id": "93206383-89ee-4e5d-9916-8020d111c19b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "! ) и регулирующим ||| ! ) and in ||| 1 7.22626e-06 1 0.0174206 ||| 0-0 1-1 2-2 3-3 ||| 1 1 1 ||| |||\n",
-      "! ) и регулирующим органам , ||| ! ) and in regulators , ||| 1 1.19187e-07 1 0.00118438 ||| 0-0 1-1 2-2 3-3 4-4 5-5 ||| 1 1 1 ||| |||\n",
-      "! ) и регулирующим органам , которые ||| ! ) and in regulators , who ||| 1 2.3692e-08 1 0.00017379 ||| 0-0 1-1 2-2 3-3 4-4 5-5 6-6 ||| 1 1 1 ||| |||\n",
-      "! ) и регулирующим органам ||| ! ) and in regulators ||| 1 1.60583e-07 1 0.00239106 ||| 0-0 1-1 2-2 3-3 4-4 ||| 1 1 1 ||| |||\n",
-      "! ) людям ||| ! ) to people ||| 1 0.0335482 0.5 0.00886362 ||| 0-0 1-1 2-3 ||| 1 2 1 ||| |||\n",
-      "! ) людям ||| listen ! ) to people ||| 1 0.0335482 0.5 1.08136e-07 ||| 0-1 1-2 2-4 ||| 1 2 1 ||| |||\n",
-      "! ) людям о ||| ! ) to people about ||| 0.5 0.00783575 0.5 0.0012305 ||| 0-0 1-1 2-3 3-4 ||| 2 2 1 ||| |||\n",
-      "! ) людям о ||| listen ! ) to people about ||| 0.5 0.00783575 0.5 1.50121e-08 ||| 0-1 1-2 2-4 3-5 ||| 2 2 1 ||| |||\n",
-      "! ) людям о существующих ||| ! ) to people about ||| 0.5 4.81115e-07 0.5 0.0012305 ||| 0-0 1-1 2-3 3-4 ||| 2 2 1 ||| |||\n",
-      "! ) людям о существующих ||| listen ! ) to people about ||| 0.5 4.81115e-07 0.5 1.50121e-08 ||| 0-1 1-2 2-4 3-5 ||| 2 2 1 ||| |||\n",
-      "! ) людям о существующих рисках . ||| ! ) to people about risks . ||| 1 9.15623e-09 1 0.000708028 ||| 0-0 1-1 2-3 3-4 5-5 6-6 ||| 1 1 1 ||| |||\n",
-      "! ) людям о существующих рисках ||| ! ) to people about risks ||| 1 9.82981e-09 0.5 0.00076376 ||| 0-0 1-1 2-3 3-4 5-5 ||| 1 2 1 ||| |||\n",
-      "! ) людям о существующих рисках ||| listen ! ) to people about risks ||| 1 9.82981e-09 0.5 9.31787e-09 ||| 0-1 1-2 2-4 3-5 5-6 ||| 1 2 1 ||| |||\n",
-      "! ) могли ||| ! ) were allowed ||| 1 0.00342545 1 2.13705e-05 ||| 0-0 1-1 2-2 2-3 ||| 2 2 2 ||| |||\n",
-      "! ) могли свободно перемещаться ||| ! ) were allowed to move freely ||| 1 2.8914e-05 1 3.85126e-07 ||| 0-0 1-1 2-2 2-3 4-4 4-5 3-6 4-6 ||| 2 2 2 ||| |||\n",
-      "! ) может ||| ! ) can ||| 1 0.196422 1 0.148141 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
-      "! ) может заметить ||| ! ) can tell ||| 1 0.00070655 1 0.0019752 ||| 0-0 1-1 2-2 3-3 ||| 1 1 1 ||| |||\n",
-      "! ) может заметить разницу . ||| ! ) can tell the difference . ||| 1 1.46388e-05 1 6.80977e-05 ||| 0-0 1-1 2-2 3-3 4-4 4-5 5-6 ||| 1 1 1 ||| |||\n",
-      "! ) может заметить разницу ||| ! ) can tell the difference ||| 1 1.57156e-05 1 7.34579e-05 ||| 0-0 1-1 2-2 3-3 4-4 4-5 ||| 1 1 1 ||| |||\n",
-      "! ) можно ||| ! ) can be ||| 1 0.053651 1 0.0271419 ||| 0-0 1-1 2-2 2-3 ||| 1 1 1 ||| |||\n",
-      "! ) можно связать с ||| ! ) can be connected to the ||| 1 8.88438e-07 1 2.49973e-07 ||| 0-0 1-1 2-2 2-3 4-4 3-5 4-6 ||| 1 1 1 ||| |||\n",
-      "! , Google ||| ! , Google ||| 1 0.421594 1 0.299012 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
-      "! , Google и ||| ! , Google , and ||| 1 0.167997 1 0.00288072 ||| 0-0 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||\n",
-      "! , ||| ! , ||| 0.333333 0.629148 0.333333 0.400216 ||| 0-0 1-1 ||| 3 3 1 ||| |||\n",
-      "! , ||| ! ; ||| 1 0.117336 0.333333 0.00189467 ||| 0-0 1-1 ||| 1 3 1 ||| |||\n",
-      "! , ||| ! movement , ||| 1 0.629148 0.333333 3.96614e-05 ||| 0-0 1-2 ||| 1 3 1 ||| |||\n",
-      "! , и ||| ! movement , and ||| 1 0.497454 1 2.95386e-05 ||| 0-0 1-2 2-3 ||| 1 1 1 ||| |||\n",
-      "! , и его ||| ! movement , and his ||| 1 0.228702 1 1.04108e-05 ||| 0-0 1-2 2-3 3-4 ||| 1 1 1 ||| |||\n",
-      "! , при этом ||| ! ; nor ||| 1 9.34048e-05 1 1.80558e-05 ||| 0-0 1-1 2-2 3-2 ||| 1 1 1 ||| |||\n",
-      "! , при этом у них нет ||| ! ; nor do they have ||| 1 2.87476e-10 1 7.10103e-10 ||| 0-0 1-1 2-2 3-2 4-3 5-4 4-5 6-5 ||| 1 1 1 ||| |||\n",
-      "! , принимает ||| who ||| 0.000126183 3.82829e-09 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
-      "! , принимает решение ||| who ||| 0.000126183 1.09106e-12 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
-      "! , принимает решение о ||| who ||| 0.000126183 2.77272e-15 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
-      "! , принимает решение о том , ||| who ||| 0.000126183 2.10722e-18 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
-      "! , принимает решение о том ||| who ||| 0.000126183 1.04201e-17 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
-      "! - ||| ! , ||| 0.333333 0.00355096 0.5 0.0950936 ||| 0-0 1-1 ||| 3 2 1 ||| |||\n",
-      "! - ||| meantime , ||| 0.0136986 3.52027e-05 0.5 0.000275631 ||| 0-0 1-1 ||| 73 2 1 ||| |||\n",
-      "! - Министерства Иностранных Дел ||| ! --of the Foreign ||| 1 0.0017815 1 0.00103764 ||| 0-0 1-1 2-1 3-1 4-1 4-3 ||| 1 1 1 ||| |||\n",
-      "! - Министерства Иностранных Дел и ||| ! --of the Foreign and ||| 0.5 0.00140859 1 0.000772806 ||| 0-0 1-1 2-1 3-1 4-1 4-3 5-4 ||| 2 1 1 ||| |||\n",
-      "! - Министерства Иностранных Дел и Министерства ||| ! --of the Foreign and ||| 0.5 1.92977e-08 1 0.000772806 ||| 0-0 1-1 2-1 3-1 4-1 4-3 5-4 ||| 2 1 1 ||| |||\n",
-      "! - вскричала она . ||| ! , she cried . ||| 1 0.00029683 1 0.0100263 ||| 0-0 1-1 3-2 2-3 4-4 ||| 1 1 1 ||| |||\n",
-      "! - вскричала она ||| ! , she cried ||| 1 0.000318666 1 0.0108155 ||| 0-0 1-1 3-2 2-3 ||| 1 1 1 ||| |||\n",
-      "! - невидимая рука Адама Смита ||| ! -- Adam Smith &apos;s invisible hand ||| 1 7.57105e-05 1 1.91127e-05 ||| 0-0 1-1 4-2 5-3 1-4 2-5 3-6 ||| 1 1 1 ||| |||\n",
-      "! ? ! ||| ! ? ! ||| 1 0.655991 1 0.612445 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
-      "! ? ||| ! ? ||| 1 0.77388 1 0.758011 ||| 0-0 1-1 ||| 1 1 1 ||| |||\n",
-      "! ||| ! doesn ||| 1 0.847666 0.00215983 0.000176055 ||| 0-0 ||| 1 463 1 ||| |||\n",
-      "! ||| ! doesn ’ t have ||| 1 0.847666 0.00215983 4.40182e-12 ||| 0-0 ||| 1 463 1 ||| |||\n",
-      "! ||| ! doesn ’ t ||| 1 0.847666 0.00215983 9.45753e-10 ||| 0-0 ||| 1 463 1 ||| |||\n",
-      "! ||| ! doesn ’ ||| 1 0.847666 0.00215983 2.58544e-06 ||| 0-0 ||| 1 463 1 ||| |||\n",
-      "! ||| ! is as good as dead ||| 1 0.847666 0.00215983 1.11895e-14 ||| 0-0 ||| 1 463 1 ||| |||\n"
-     ]
-    }
-   ],
-   "source": [
-    "extracted = ''\n",
-    "with gzip.open('/mnt/DATA/MSU2024NLP/corpus/train/model/phrase-table.gz','rt') as f:\n",
-    "    for _ in range(1000):\n",
-    "        extracted += next(f)\n",
-    "print('\\n'.join(extracted.split('\\n')[150:200]))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "id": "1ebf1f18-86e1-4273-a07b-6e3acd169bcb",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "! ! ! ||| ! ! ! ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "! ! ||| ! ! ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857\n",
-      "! ! ||| ! ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "! &amp; quot ; ||| ! ” ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857\n",
-      "! &amp; quot ; – ||| ! ” is ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "! &amp; quot ; – требование , ||| ! ” is a demand ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "! &amp; quot ; – требование ||| ! ” is a demand ||| 0.6 0.2 0.2 0.2 0.2 0.6\n",
-      "! &amp; quot ||| ! ” ||| 0.6 0.2 0.2 0.2 0.2 0.6\n",
-      "! &quot; , ||| ! &quot; ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "! &quot; , а ||| ! &quot; rather ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "extracted = ''\n",
-    "with gzip.open('/mnt/DATA/MSU2024NLP/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz','rt') as f:\n",
-    "    for _ in range(10):\n",
-    "        extracted += next(f)\n",
-    "print(extracted)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9d75022a-81f1-4689-be1d-69d8bd12e0ca",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Подготовка тестовой и валидационной выборки"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "312a5667-bf56-454f-a309-5f7495ea741d",
-   "metadata": {},
-   "source": [
-    "### Загрузка данных\n",
-    "\n",
-    "Полученные данные получить в `/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.<en/ru>`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 63,
-   "id": "71a52c16-5829-4dfe-b443-90d41544b711",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# !wget https://object.pouta.csc.fi/OPUS-Books/v1/moses/en-ru.txt.zip\n",
-    "# !unzip unzip en-ru.txt.zip"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ec419640-a2aa-4c75-b2d4-2df0c86b9edc",
-   "metadata": {},
-   "source": [
-    "### Готовим тестовые и валидационные данные по аналогии с обучающимися данными"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 64,
-   "id": "aa484e56-e66f-4a21-bec1-c53efe60464d",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Tokenizer Version 1.1\n",
-      "Language: en\n",
-      "Number of threads: 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/tokenizer/tokenizer.perl \\\n",
-    "        -l en \\\n",
-    "        < /corpus/testing/Books.en-ru.en \\\n",
-    "        > /corpus/testing/Books.en-ru.tok.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 65,
-   "id": "ccad464c-d48f-412c-8c0d-fb2c3cb2b198",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Tokenizer Version 1.1\n",
-      "Language: ru\n",
-      "Number of threads: 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/tokenizer/tokenizer.perl \\\n",
-    "        -l ru \\\n",
-    "        < /corpus/testing/Books.en-ru.ru \\\n",
-    "        > /corpus/testing/Books.en-ru.tok.ru\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 66,
-   "id": "5ab3e995-4cfc-4398-b40d-dc7e96fdb238",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/truecase.perl \\\n",
-    "        --model /corpus/truecase-model.en \\\n",
-    "        < /corpus/testing/Books.en-ru.tok.en \\\n",
-    "        > /corpus/testing/Books.en-ru.true.en\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 67,
-   "id": "558cb6ab-f0f0-43e6-a9eb-822aa9bcd35d",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/recaser/truecase.perl \\\n",
-    "        --model /corpus/truecase-model.ru \\\n",
-    "        < /corpus/testing/Books.en-ru.tok.ru \\\n",
-    "        > /corpus/testing/Books.en-ru.true.ru\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 68,
-   "id": "f421a934-abc1-44ab-a0a1-445671b5ed5b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "17496 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 69,
-   "id": "c6e4e8ae-66f4-4499-9924-129f4ab2371e",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "17496 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru\n"
-     ]
-    }
-   ],
-   "source": [
-    "!wc -l /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3f1be11d-03c6-45dd-93be-17ef1b68b6a0",
-   "metadata": {},
-   "source": [
-    "### Делим на тестовую часть и на валидационную (нужен небольшой объем данных)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 70,
-   "id": "49f97a30-6564-42a1-8d3d-fb10f458c3da",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!head -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.val.true.en\n",
-    "!head -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.val.true.ru\n",
-    "\n",
-    "!tail -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en\n",
-    "!tail -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "95c90be5-b2e7-4d8e-a82c-7bd3dc25ed3a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/bin/moses \\\n",
-    "        -f /corpus/train/model/moses.ini \\\n",
-    "        -threads 4 -s 40 -dl 8 \\\n",
-    "        < /corpus/testing/Books.en-ru.test.true.ru \\\n",
-    "        > /corpus/testing/Books.en-ru.test.true.en.TRANSLATED\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 30,
-   "id": "f45ef53f-84e8-428f-b13f-c1dceb427c09",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "отчего же не потушить свечу, когда смотреть больше не на что, когда гадко смотреть на все это?\n",
-      "why is not calm свечу when look more than that, when гадко look at all this?\n",
-      "\n",
-      "но как?\n",
-      "but how?\n",
-      "\n",
-      "зачем этот кондуктор пробежал по жердочке, зачем они кричат, эти молодые люди в том вагоне?\n",
-      "why this кондуктор пробежал on жердочке, why should they кричат, these young people is вагоне?\n",
-      "\n",
-      "зачем они говорят, зачем они смеются?\n",
-      "why should they say, why should they sneer?\n",
-      "\n",
-      "все неправда, все ложь, все обман, все зло!.. \"\n",
-      "it is untrue, all lies, all fraud, all evil!. \"\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru', 'r') as fin:\n",
-    "    with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en.TRANSLATED', 'r') as fout:\n",
-    "        for _ in range(5):\n",
-    "            text = MosesDetokenizer('ru').detokenize(next(fin).strip().split())  + '\\n' \\\n",
-    "                 + MosesDetokenizer('en').detokenize(next(fout).strip().split()) + '\\n'\n",
-    "            print(text)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "id": "d703529f-f91c-425e-8d51-bfc2a9a45bec",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n",
-      "BLEU = 7.83, 47.7/14.5/5.5/2.1 (BP=0.831, ratio=0.844, hyp_len=20892, ref_len=24754)\n"
-     ]
-    }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/generic/multi-bleu.perl \\\n",
-    "        -lc /corpus/testing/Books.en-ru.test.true.en \\\n",
-    "        < /corpus/testing/Books.en-ru.test.true.en.TRANSLATED\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2d8a4420-7a8a-4a27-a983-1b0819d8b588",
-   "metadata": {},
-   "source": [
-    "## Подбор гиперпараметров на валидации"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 76,
-   "id": "228890f8-d0ce-4728-8f23-e8453afd03fe",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/training/mert-moses.pl \\\n",
-    "        --working-dir /corpus/train/mert-dir/ \\\n",
-    "        --threads 4 \\\n",
-    "        /corpus/testing/Books.en-ru.val.true.ru \\\n",
-    "        /corpus/testing/Books.en-ru.val.true.en \\\n",
-    "        /opt/moses/bin/moses /corpus/train/model/moses.ini \\\n",
-    "        --mertdir /opt/moses/bin/ &> /corpus/mert.out\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 77,
-   "id": "3ad8d545-5921-4fb6-919b-f9ce1fde898b",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "# MERT optimized configuration\n",
-      "# decoder /opt/moses/bin/moses\n",
-      "# BLEU 0.0654292 on dev /corpus/testing/Books.en-ru.val.true.ru\n",
-      "# We were before running iteration 26\n",
-      "# finished Thu Sep 26 23:19:46 UTC 2024\n",
-      "### MOSES CONFIG FILE ###\n",
-      "#########################\n",
-      "\n",
-      "# input factors\n",
-      "[input-factors]\n",
-      "0\n",
-      "\n",
-      "# mapping steps\n",
-      "[mapping]\n",
-      "0 T 0\n",
-      "\n",
-      "[distortion-limit]\n",
-      "6\n",
-      "\n",
-      "# feature functions\n",
-      "[feature]\n",
-      "UnknownWordPenalty\n",
-      "WordPenalty\n",
-      "PhrasePenalty\n",
-      "PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/corpus/train/model/phrase-table.gz input-factor=0 output-factor=0\n",
-      "LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz\n",
-      "Distortion\n",
-      "KENLM name=LM0 factor=0 path=/corpus/news-commentary-v8.ru-en.blm.en order=3\n",
-      "\n",
-      "# dense weights for feature functions\n",
-      "[weight]\n",
-      "\n",
-      "LexicalReordering0= 0.00137749 -0.0615252 0.122087 0.0459385 0.146033 0.162246\n",
-      "Distortion0= -0.0214977\n",
-      "LM0= 0.0788921\n",
-      "WordPenalty0= -0.204385\n",
-      "PhrasePenalty0= 0.0267022\n",
-      "TranslationModel0= 0.00329618 0.0880665 0.0237377 0.0142152\n",
-      "UnknownWordPenalty0= 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "!cat /mnt/DATA/MSU2024NLP/corpus/train/mert-dir/moses.ini"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "81aad600-c036-47e6-aebf-78f755314525",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/bin/moses \\\n",
-    "        -f /corpus/train/mert-dir/moses.ini \\\n",
-    "        -threads 4 -s 40 -dl 8 \\\n",
-    "        < /corpus/testing/Books.en-ru.test.true.ru \\\n",
-    "        > /corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 31,
-   "id": "bbdcc0d9-4160-47e3-bc01-4d55de3ec0fb",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "отчего же не потушить свечу, когда смотреть больше не на что, когда гадко смотреть на все это?\n",
-      "why is no longer on свечу calm that, when you look at the same time, when гадко look at all this?\n",
-      "\n",
-      "но как?\n",
-      "but how?\n",
-      "\n",
-      "зачем этот кондуктор пробежал по жердочке, зачем они кричат, эти молодые люди в том вагоне?\n",
-      "why this кондуктор кричат пробежал жердочке on these young people, why they вагоне in?\n",
-      "\n",
-      "зачем они говорят, зачем они смеются?\n",
-      "why do they say, why should they sneer?\n",
-      "\n",
-      "все неправда, все ложь, все обман, все зло!.. \"\n",
-      "still, it is untrue, all of the lie, all of deception. \"all the many evils!\n",
-      "\n"
-     ]
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "a2f21391-ecd3-4a99-9109-8989542546e5",
+      "metadata": {
+        "tags": [],
+        "id": "a2f21391-ecd3-4a99-9109-8989542546e5"
+      },
+      "source": [
+        "# Установка некоторых внешних зависимостей"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e3f70fb0-8b21-40fd-a28a-03fea5100f27",
+      "metadata": {
+        "tags": [],
+        "id": "e3f70fb0-8b21-40fd-a28a-03fea5100f27"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -qU https://github.com/kpu/kenlm/archive/master.zip sacremoses"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a73e8d26-6df2-4f4e-a8a6-3d5a28989380",
+      "metadata": {
+        "tags": [],
+        "id": "a73e8d26-6df2-4f4e-a8a6-3d5a28989380"
+      },
+      "source": [
+        "# Подключение библиотек"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "423007be-e0c6-4cb1-8a0f-aa5128106acc",
+      "metadata": {
+        "tags": [],
+        "id": "423007be-e0c6-4cb1-8a0f-aa5128106acc"
+      },
+      "outputs": [],
+      "source": [
+        "import gzip\n",
+        "\n",
+        "import kenlm\n",
+        "from sacremoses import MosesDetokenizer"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "7b642fd2-71db-469b-923f-73ac2d87cb9a",
+      "metadata": {
+        "tags": [],
+        "id": "7b642fd2-71db-469b-923f-73ac2d87cb9a"
+      },
+      "source": [
+        "# MOSES"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "426c343a-56ff-4f96-b601-ce67a645c334",
+      "metadata": {
+        "tags": [],
+        "id": "426c343a-56ff-4f96-b601-ce67a645c334"
+      },
+      "source": [
+        "## Подготовка данных для обучения переводчика\n",
+        "\n",
+        "Скачать данные в папку `/mnt/DATA/MSU2024NLP/corpus` и распаковать ее там (должны получить папку training)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "4c4ffa1e-eca0-438a-b622-0f2f2300d1c9",
+      "metadata": {
+        "id": "4c4ffa1e-eca0-438a-b622-0f2f2300d1c9"
+      },
+      "outputs": [],
+      "source": [
+        "# !wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz\n",
+        "# !tar zxvf training-parallel-nc-v8.tgz"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "b24fd07f-7876-49aa-85db-ddcfe428f4f8",
+      "metadata": {
+        "tags": [],
+        "id": "b24fd07f-7876-49aa-85db-ddcfe428f4f8"
+      },
+      "source": [
+        "### Токенизация текста\n",
+        "\n",
+        "В данном случае мы используем токенизатор, который представлен разработчиками moses. На текущий момент он все еще обновляется (добаляются новые языки и тд).\n",
+        "\n",
+        "Всегда можно использовать другие реализации токенизаторов (например соверменные BPE/WordPiece). В целом используя такого рода токенизаторы возможно получить прирост качества на некоторых языках, а также получить мультиязычный статистический перевод.\n",
+        "\n",
+        "[1] Asvarov A., Grabovoy A. The impact of multilinguality and tokenization on statistical machine translation // 2024 35th Conference of Open Innovations Association (FRUCT). — IEEE: 2024. — P. 149–157."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d5a06ada-053f-45a3-8607-46f0ca862716",
+      "metadata": {
+        "tags": [],
+        "id": "d5a06ada-053f-45a3-8607-46f0ca862716",
+        "outputId": "98ba6981-a551-45c1-ecfd-c5c17b2d3bbc"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Tokenizer Version 1.1\n",
+            "Language: en\n",
+            "Number of threads: 1\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/tokenizer/tokenizer.perl -l en \\\n",
+        "        < /corpus/training/news-commentary-v8.ru-en.en \\\n",
+        "        > /corpus/news-commentary-v8.ru-en.tok.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8fdcfff2-9a7b-4625-85fe-93c5ec0ac1c8",
+      "metadata": {
+        "tags": [],
+        "id": "8fdcfff2-9a7b-4625-85fe-93c5ec0ac1c8",
+        "outputId": "38adb031-3091-4f3b-95da-e3a1f4d13566"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Tokenizer Version 1.1\n",
+            "Language: ru\n",
+            "Number of threads: 1\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/tokenizer/tokenizer.perl -l ru \\\n",
+        "        < /corpus/training/news-commentary-v8.ru-en.ru > \\\n",
+        "        /corpus/news-commentary-v8.ru-en.tok.ru\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "714abf17-144d-49c4-8922-f61fecf9ec60",
+      "metadata": {
+        "id": "714abf17-144d-49c4-8922-f61fecf9ec60"
+      },
+      "source": [
+        "Получили выборку размером в 150тыс предложений для обучения статистического переводчика. Выборка из примера, имеет целый спектр недостатков (и в целом достаточно маленькая для статистического переводочика).\n",
+        "\n",
+        "Заметка из опыта:\n",
+        "1. Для обучения неплохого статистического переводчика требуется несколько милионов пар предложений.\n",
+        "2. Параллельные предложения для разных языков можно брать из https://opus.nlpl.eu.\n",
+        "3. На качество перевода очень сильно влияет качество данных и качество токенизатора."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f16b299b-5a47-424b-8beb-54b284b27d4b",
+      "metadata": {
+        "tags": [],
+        "id": "f16b299b-5a47-424b-8beb-54b284b27d4b",
+        "outputId": "a5430856-a78f-41a6-8676-4c44a2687091"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "531ab059-b1c7-43ba-a726-d7bc603c6a95",
+      "metadata": {
+        "tags": [],
+        "id": "531ab059-b1c7-43ba-a726-d7bc603c6a95",
+        "outputId": "8c51beeb-380d-4f3c-80d9-2b7a93c4de11"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "2584696f-14af-4063-a13e-ddb733ed3028",
+      "metadata": {
+        "tags": [],
+        "id": "2584696f-14af-4063-a13e-ddb733ed3028",
+        "outputId": "7cb3ff8a-2ac3-4d06-bb07-27d492d1e4cc"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .\n",
+            "Lately , with gold prices up more than 300 % over the last decade , it is harder than ever . Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .\n",
+            "Wouldn ’ t you know it ?\n",
+            "Since their articles appeared , the price of gold has moved up still further .\n",
+            "Gold prices even hit a record-high $ 1,300 recently .\n",
+            "Last December , many gold bugs were arguing that the price was inevitably headed for $ 2,000 .\n",
+            "Now , emboldened by continuing appreciation , some are suggesting that gold could be headed even higher than that .\n",
+            "One successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980 ’ s .\n",
+            "Since then , the index has climbed above 10,000 .\n",
+            "Now that gold has crossed the magic $ 1,000 barrier , why can ’ t it increase ten-fold , too ?\n"
+          ]
+        }
+      ],
+      "source": [
+        "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1fc0eda2-c125-43e3-bc61-373b6d5b0486",
+      "metadata": {
+        "tags": [],
+        "id": "1fc0eda2-c125-43e3-bc61-373b6d5b0486",
+        "outputId": "d4ccef6b-a1de-4485-a058-ad1d8d9e6422"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "САН-ФРАНЦІСКО . Разговор о стоимости золота редко получается рациональным , тем более в последнее время , так как цены на золото выросли более чем на 300 % за десятилетие .\n",
+            "Еще в декабре прошлого года экономисты-коллеги Мартин Фельдштейн и Нуриэль Рубини опубликовали свои пророческие статьи в колонках альтернативных мнений , храбро ставя в них под вопрос стремление игры на повышение , благоразумно указывая на риски золота .\n",
+            "И что бы вы думали ?\n",
+            "С тех пор как вышли их статьи , стоимость золота повысилась еще больше .\n",
+            "Недавно цена на золото даже достигла рекордной отметки в 1 300 долларов за унцию .\n",
+            "В декабре прошлого года многие сторонники сохранения денежных функций золота утверждали , что его цена неизбежно дойдет до 2 000 долларов .\n",
+            "Теперь , воодушевленные постоянным ростом цен , некоторые считают , что цена на золото может вырасти еще сильнее .\n",
+            "Один успешный инвестор в золото недавно объяснил мне , что цена акций падала в течение более десятилетия , прежде чем в начале 1980-х гг. индекс Доу Джонса пересек 1 000 отметку .\n",
+            "С тех пор данный индекс постоянно рос и уже превысил 10 000 отметку .\n",
+            "Теперь , когда цена на золото пересекла магический барьер в 1 000 долларов , не может ли она также вырасти в десять раз ?\n"
+          ]
+        }
+      ],
+      "source": [
+        "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.tok.ru"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9206c0eb-6593-4f58-8c28-040e6de63671",
+      "metadata": {
+        "id": "9206c0eb-6593-4f58-8c28-040e6de63671"
+      },
+      "source": [
+        "### Востановления регистра первого слова предложения\n",
+        "\n",
+        "В целом регистры слов (с большой буквы или маленькой) очень сильно влияют на различные модели NLP. Часто строятся модели, которые работают только с нижним регистров (весь текст переводят в lower). В целом и тут можно было бы сделать так, но в документации рекомендуют сделать \"перераспределение\" регистров слов, чтобы на первое слово в предложении не влиял регистр, а он влиял только на те слова, которые чаще встречаются в верхнем регистре."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "e54e10d7-2c9d-4024-823d-295a40b1bda4",
+      "metadata": {
+        "id": "e54e10d7-2c9d-4024-823d-295a40b1bda4"
+      },
+      "source": [
+        "#### Обучение"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c55392f0-853d-46c0-b943-b1ddd7fb9521",
+      "metadata": {
+        "tags": [],
+        "id": "c55392f0-853d-46c0-b943-b1ddd7fb9521"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/train-truecaser.perl \\\n",
+        "        --model /corpus/truecase-model.en \\\n",
+        "        --corpus /corpus/news-commentary-v8.ru-en.tok.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "0d89cfc0-9a92-42b0-9ebe-c569aa620d1b",
+      "metadata": {
+        "tags": [],
+        "id": "0d89cfc0-9a92-42b0-9ebe-c569aa620d1b"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/train-truecaser.perl \\\n",
+        "        --model /corpus/truecase-model.ru \\\n",
+        "        --corpus /corpus/news-commentary-v8.ru-en.tok.ru\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "b757aa40-dee8-4fe5-a21e-664c5681b8a8",
+      "metadata": {
+        "id": "b757aa40-dee8-4fe5-a21e-664c5681b8a8"
+      },
+      "source": [
+        "#### Применение"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "47006d65-9d96-4d14-ab2f-7ee6099094bc",
+      "metadata": {
+        "tags": [],
+        "id": "47006d65-9d96-4d14-ab2f-7ee6099094bc"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/truecase.perl \\\n",
+        "        --model /corpus/truecase-model.en \\\n",
+        "        < /corpus/news-commentary-v8.ru-en.tok.en \\\n",
+        "        > /corpus/news-commentary-v8.ru-en.true.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "dfd7b142-169c-4d38-a958-ba347731ff9a",
+      "metadata": {
+        "tags": [],
+        "id": "dfd7b142-169c-4d38-a958-ba347731ff9a"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/truecase.perl \\\n",
+        "        --model /corpus/truecase-model.ru \\\n",
+        "        < /corpus/news-commentary-v8.ru-en.tok.ru \\\n",
+        "        > /corpus/news-commentary-v8.ru-en.true.ru\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c7a8f2a0-ceb4-4f9c-b342-3589416d72d5",
+      "metadata": {
+        "tags": [],
+        "id": "c7a8f2a0-ceb4-4f9c-b342-3589416d72d5",
+        "outputId": "0ee2cdb2-3577-4f75-e5b1-07d1e6644f53"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "897f9cd6-7182-4cee-811b-d89a30c26f5f",
+      "metadata": {
+        "tags": [],
+        "id": "897f9cd6-7182-4cee-811b-d89a30c26f5f",
+        "outputId": "8f536964-ec92-4470-dd64-0f39ca6839d3"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "150217 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1e58e6db-4f90-4b2a-ba46-a9db4741f73b",
+      "metadata": {
+        "tags": [],
+        "id": "1e58e6db-4f90-4b2a-ba46-a9db4741f73b",
+        "outputId": "f9f873c9-11d4-4122-e005-5ce89d4bd15b"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .\n",
+            "lately , with gold prices up more than 300 % over the last decade , it is harder than ever . just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .\n",
+            "wouldn ’ t you know it ?\n",
+            "since their articles appeared , the price of gold has moved up still further .\n",
+            "gold prices even hit a record-high $ 1,300 recently .\n",
+            "last December , many gold bugs were arguing that the price was inevitably headed for $ 2,000 .\n",
+            "now , emboldened by continuing appreciation , some are suggesting that gold could be headed even higher than that .\n",
+            "one successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980 ’ s .\n",
+            "since then , the index has climbed above 10,000 .\n",
+            "now that gold has crossed the magic $ 1,000 barrier , why can ’ t it increase ten-fold , too ?\n"
+          ]
+        }
+      ],
+      "source": [
+        "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "302a7bb4-dd4a-488a-8d94-3c576174cbb6",
+      "metadata": {
+        "tags": [],
+        "id": "302a7bb4-dd4a-488a-8d94-3c576174cbb6",
+        "outputId": "bbc9ffb1-5be1-4e7d-f3a7-a7eed37c5353"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "САН-ФРАНЦІСКО . разговор о стоимости золота редко получается рациональным , тем более в последнее время , так как цены на золото выросли более чем на 300 % за десятилетие .\n",
+            "еще в декабре прошлого года экономисты-коллеги Мартин Фельдштейн и Нуриэль Рубини опубликовали свои пророческие статьи в колонках альтернативных мнений , храбро ставя в них под вопрос стремление игры на повышение , благоразумно указывая на риски золота .\n",
+            "и что бы вы думали ?\n",
+            "с тех пор как вышли их статьи , стоимость золота повысилась еще больше .\n",
+            "недавно цена на золото даже достигла рекордной отметки в 1 300 долларов за унцию .\n",
+            "в декабре прошлого года многие сторонники сохранения денежных функций золота утверждали , что его цена неизбежно дойдет до 2 000 долларов .\n",
+            "теперь , воодушевленные постоянным ростом цен , некоторые считают , что цена на золото может вырасти еще сильнее .\n",
+            "один успешный инвестор в золото недавно объяснил мне , что цена акций падала в течение более десятилетия , прежде чем в начале 1980-х гг. индекс Доу Джонса пересек 1 000 отметку .\n",
+            "с тех пор данный индекс постоянно рос и уже превысил 10 000 отметку .\n",
+            "теперь , когда цена на золото пересекла магический барьер в 1 000 долларов , не может ли она также вырасти в десять раз ?\n"
+          ]
+        }
+      ],
+      "source": [
+        "!head -n 10 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.true.ru"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8cc8ce66-4013-4b46-8594-ab3b774dd50e",
+      "metadata": {
+        "id": "8cc8ce66-4013-4b46-8594-ab3b774dd50e"
+      },
+      "source": [
+        "В целом после предобработки данных видно, что слова теперь не всегда начинаются с верхнего регистра, а только в тех случаях, когда это обосновано."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "cf94c10c-4936-4111-a628-eadc99e1d36b",
+      "metadata": {
+        "id": "cf94c10c-4936-4111-a628-eadc99e1d36b"
+      },
+      "source": [
+        "### Фильтрация предложений по длине (1-80символов)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "9895d5a0-1b1e-4e9f-9605-5c72273bec0a",
+      "metadata": {
+        "tags": [],
+        "id": "9895d5a0-1b1e-4e9f-9605-5c72273bec0a",
+        "outputId": "af88ce83-eab6-44ab-a906-c96b0fab1418"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "clean-corpus.perl: processing /corpus/news-commentary-v8.ru-en.true.ru & .en to /corpus/news-commentary-v8.ru-en.clean, cutoff 1-80, ratio 9\n",
+            "..........(100000).....\n",
+            "Input sentences: 150217  Output sentences:  146806\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/training/clean-corpus-n.perl \\\n",
+        "        /corpus/news-commentary-v8.ru-en.true ru en \\\n",
+        "        /corpus/news-commentary-v8.ru-en.clean 1 80\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "0f462b8d-a190-4705-9cab-b2f781f251b7",
+      "metadata": {
+        "tags": [],
+        "id": "0f462b8d-a190-4705-9cab-b2f781f251b7",
+        "outputId": "48e4011f-9787-4261-a4b7-6b8d39db31d4"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "146806 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.en\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "fa86725c-ed19-4d3d-b7a1-b6a615f4ad3f",
+      "metadata": {
+        "tags": [],
+        "id": "fa86725c-ed19-4d3d-b7a1-b6a615f4ad3f",
+        "outputId": "b055c3cf-2e68-4dc3-c4a9-6497b864f097"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "146806 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.ru\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.clean.ru"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "82c5ef06-fb26-4a78-871e-82b3cb0218a6",
+      "metadata": {
+        "tags": [],
+        "id": "82c5ef06-fb26-4a78-871e-82b3cb0218a6"
+      },
+      "source": [
+        "## Обучение языковой модели\n",
+        "\n",
+        "Языковая модель (LM) используется для обеспечения корректного написания перевода, поэтому она построена на целевом языке. В примере строится 3-граммная языковая модель (на базе библиотеки KenLM)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1c161035-4d77-4c54-8cea-b396995b61c5",
+      "metadata": {
+        "tags": [],
+        "id": "1c161035-4d77-4c54-8cea-b396995b61c5",
+        "outputId": "b4112630-b68b-40b7-b55d-7a1badea3fc8"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "=== 1/5 Counting and sorting n-grams ===\n",
+            "Reading /corpus/news-commentary-v8.ru-en.true.en\n",
+            "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
+            "****************************************************************************************************\n",
+            "Unigram tokens 4008930 types 62481\n",
+            "=== 2/5 Calculating and sorting adjusted counts ===\n",
+            "Chain sizes: 1:749772 2:4672178176 3:8760334336\n",
+            "Statistics:\n",
+            "1 62481 D1=0.624963 D2=0.967568 D3+=1.32128\n",
+            "2 905932 D1=0.742376 D2=1.08513 D3+=1.36658\n",
+            "3 2382985 D1=0.837421 D2=1.16432 D3+=1.35272\n",
+            "Memory estimate for binary LM:\n",
+            "type    MB\n",
+            "probing 63 assuming -p 1.5\n",
+            "probing 68 assuming -r models -p 1.5\n",
+            "trie    25 without quantization\n",
+            "trie    14 assuming -q 8 -b 8 quantization \n",
+            "trie    24 assuming -a 22 array pointer compression\n",
+            "trie    12 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n",
+            "=== 3/5 Calculating and sorting initial probabilities ===\n",
+            "Chain sizes: 1:749772 2:14494912 3:47659700\n",
+            "=== 4/5 Calculating and writing order-interpolated probabilities ===\n",
+            "Chain sizes: 1:749772 2:14494912 3:47659700\n",
+            "=== 5/5 Writing ARPA model ===\n",
+            "Name:lmplz\tVmPeak:13295048 kB\tVmRSS:22284 kB\tRSSMax:3097284 kB\tuser:5.47198\tsys:1.30134\tCPU:6.77332\treal:6.84119\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/bin/lmplz \\\n",
+        "        -o 3 \\\n",
+        "        < /corpus/news-commentary-v8.ru-en.true.en \\\n",
+        "        > /corpus/news-commentary-v8.ru-en.arpa.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d55b7ef2-cd4e-46eb-89e6-b54c296d7bc2",
+      "metadata": {
+        "tags": [],
+        "id": "d55b7ef2-cd4e-46eb-89e6-b54c296d7bc2",
+        "outputId": "83e68b10-67b9-45a7-9b9a-bb83fb83b9e3"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Reading /corpus/news-commentary-v8.ru-en.arpa.en\n",
+            "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
+            "****************************************************************************************************\n",
+            "SUCCESS\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/bin/build_binary \\\n",
+        "        /corpus/news-commentary-v8.ru-en.arpa.en \\\n",
+        "        /corpus/news-commentary-v8.ru-en.blm.en\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "eb4849c8-e5ec-4edf-afac-0b2e128bed00",
+      "metadata": {
+        "id": "eb4849c8-e5ec-4edf-afac-0b2e128bed00"
+      },
+      "source": [
+        "### ARPA формат языковой модели\n",
+        "\n",
+        "В рамках данного формата подсчитаны все условные и полные вероятности по токенам в обучающей выборке. В формате:\n",
+        "1. Указываются все n-gram, которые запрашивались для обучения (в нашем случае 1/2/3-gram).\n",
+        "2. Указывается число n-gram, которое получилось после обучения (для каждого из 1/2/3).\n",
+        "3. Далее для каждой n-gram выдается:\n",
+        "    - Вероятность последнего токена в n-gram при условии предыдущих (если нет, то при условии токена \\<s> --- начала): $p(w_{n}|w_1, w_2, \\cdots w_{n-1})$.\n",
+        "    - Сама n-gram: $w_1, w_2, \\cdots w_n$.\n",
+        "    - Полная вероятность встретить n-gram: $p(w_1, w_2, \\cdots w_n)$"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "07920849-4efd-44fe-96e1-bf5ad6e168a4",
+      "metadata": {
+        "tags": [],
+        "id": "07920849-4efd-44fe-96e1-bf5ad6e168a4",
+        "outputId": "6adf4449-3456-499c-9a09-15bfb6d8a991"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\\data\\\n",
+            "ngram 1=62481\n",
+            "ngram 2=905932\n",
+            "ngram 3=2382985\n",
+            "\n",
+            "\\1-grams:\n",
+            "-5.970607\t<unk>\t0\n",
+            "0\t<s>\t-1.3166543\n",
+            "-3.831133\t</s>\t0\n",
+            "-4.735731\tSan\t-0.38199094\n",
+            "-5.8285656\tFRANCISCO\t-0.12937579\n",
+            "-2.2254546\t–\t-0.7769198\n",
+            "-4.735731\tIt\t-0.27971995\n",
+            "-2.4997213\thas\t-0.7978452\n",
+            "-3.5793376\tnever\t-0.29212165\n",
+            "-3.6418643\tbeen\t-0.2847993\n",
+            "-4.0782976\teasy\t-0.40742075\n",
+            "-2.0642972\tto\t-0.90560645\n",
+            "-2.6063344\thave\t-0.70169735\n",
+            "-2.3986692\ta\t-0.69319177\n"
+          ]
+        }
+      ],
+      "source": [
+        "!head -n 20 /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a17991e8-db30-4317-8d33-75616017630d",
+      "metadata": {
+        "id": "a17991e8-db30-4317-8d33-75616017630d"
+      },
+      "source": [
+        "### Работа с языковой моделью при помощи KenLM\n",
+        "\n",
+        "Языковые модели могут быть полезны в различных задачах NLP.\n",
+        "\n",
+        "Пример: использование перплексии для детекции машинной генерации."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "a97bf13b-a150-446a-908f-cef7a9e098f0",
+      "metadata": {
+        "tags": [],
+        "id": "a97bf13b-a150-446a-908f-cef7a9e098f0",
+        "outputId": "9666ab9a-7032-4005-fc5c-a64e505f33a4"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Loading the LM will be faster if you build a binary file.\n",
+            "Reading /mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en\n",
+            "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
+            "****************************************************************************************************\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "(-16.963727951049805, 671.8742423267146)"
+            ]
+          },
+          "execution_count": 54,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "language_model = kenlm.Model('/mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.arpa.en')\n",
+        "\n",
+        "sentence = 'What about your name ?'\n",
+        "language_model.score(sentence), language_model.perplexity(sentence)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f0b34d7c-0d1d-43e4-bf4c-21654e3ae9df",
+      "metadata": {
+        "tags": [],
+        "id": "f0b34d7c-0d1d-43e4-bf4c-21654e3ae9df",
+        "outputId": "91ad9fca-eab0-45af-e357-51b0b985a004"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "(-16.963727951049805, 671.8742423267146)"
+            ]
+          },
+          "execution_count": 55,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "language_model = kenlm.Model('/mnt/DATA/MSU2024NLP/corpus/news-commentary-v8.ru-en.blm.en')\n",
+        "\n",
+        "sentence = 'What about your name ?'\n",
+        "language_model.score(sentence), language_model.perplexity(sentence)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "4888d7a1-cf8d-4416-8204-28c2e8a7fcca",
+      "metadata": {
+        "tags": [],
+        "id": "4888d7a1-cf8d-4416-8204-28c2e8a7fcca"
+      },
+      "source": [
+        "## Обучение переводчика\n",
+        "\n",
+        "Для этого производится:\n",
+        "- выравнивание слов (с помощью GIZA++);\n",
+        "- извлечение и оценку фраз;\n",
+        "- создание лексической таблицы реранжирования.\n",
+        "\n",
+        "P.S. следующая ячейка работает порядка одного часа."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "2e946018-d91b-44cb-9fde-5b2e7ae7a293",
+      "metadata": {
+        "tags": [],
+        "id": "2e946018-d91b-44cb-9fde-5b2e7ae7a293"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/training/train-model.perl \\\n",
+        "        -root-dir /corpus/train \\\n",
+        "        -corpus /corpus/news-commentary-v8.ru-en.clean \\\n",
+        "        -f ru -e en \\\n",
+        "        -alignment grow-diag-final-and -reordering msd-bidirectional-fe \\\n",
+        "        -lm 0:3:/corpus/news-commentary-v8.ru-en.blm.en \\\n",
+        "        -external-bin-dir /opt/moses_tools/ \\\n",
+        "        -cores 4 \\\n",
+        "        --parallel -mgiza -mgiza-cpus 4 &> /corpus/training.log\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5636d121-0a17-4ce2-8b2d-26f523cda11f",
+      "metadata": {
+        "id": "5636d121-0a17-4ce2-8b2d-26f523cda11f"
+      },
+      "source": [
+        "После обучения модели, получаем следующий конфигурационный файл, который полностью описывает модель переводчика.\n",
+        "\n",
+        "Модель состоит из:\n",
+        "1. Таблицы вероятностей различных n-gram на языке источника и языке таргета.\n",
+        "2. Таблица перестановки слов:\n",
+        "    - wbe --- извлечение вероятнотней на базе отдельных слов внутри фраз;\n",
+        "    - msd --- последовательность перестановки слов в фразе;\n",
+        "    - bidirectional --- выбор как нужно вероятности считать слева на право или справа на лево (тут и те и те);\n",
+        "    - fe --- используем какие фразы, только исходного языка или и выходного тоже (тут оба);\n",
+        "    - allff --- строим много признаков для оценки вероятности.\n",
+        "3. Языкова модель выходного текста."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "5bcaba57-746d-47a3-8bb2-04046fd0e34a",
+      "metadata": {
+        "tags": [],
+        "id": "5bcaba57-746d-47a3-8bb2-04046fd0e34a",
+        "outputId": "37d9aeab-a32b-41c1-9a05-0cc1eac14d36"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "#########################\n",
+            "### MOSES CONFIG FILE ###\n",
+            "#########################\n",
+            "\n",
+            "# input factors\n",
+            "[input-factors]\n",
+            "0\n",
+            "\n",
+            "# mapping steps\n",
+            "[mapping]\n",
+            "0 T 0\n",
+            "\n",
+            "[distortion-limit]\n",
+            "6\n",
+            "\n",
+            "# feature functions\n",
+            "[feature]\n",
+            "UnknownWordPenalty\n",
+            "WordPenalty\n",
+            "PhrasePenalty\n",
+            "PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/corpus/train/model/phrase-table.gz input-factor=0 output-factor=0\n",
+            "LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz\n",
+            "Distortion\n",
+            "KENLM name=LM0 factor=0 path=/corpus/news-commentary-v8.ru-en.blm.en order=3\n",
+            "\n",
+            "# dense weights for feature functions\n",
+            "[weight]\n",
+            "# The default weights are NOT optimized for translation quality. You MUST tune the weights.\n",
+            "# Documentation for tuning is here: http://www.statmt.org/moses/?n=FactoredTraining.Tuning \n",
+            "UnknownWordPenalty0= 1\n",
+            "WordPenalty0= -1\n",
+            "PhrasePenalty0= 0.2\n",
+            "TranslationModel0= 0.2 0.2 0.2 0.2\n",
+            "LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3\n",
+            "Distortion0= 0.3\n",
+            "LM0= 0.5\n"
+          ]
+        }
+      ],
+      "source": [
+        "!cat /mnt/DATA/MSU2024NLP/corpus/train/model/moses.ini"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "93206383-89ee-4e5d-9916-8020d111c19b",
+      "metadata": {
+        "tags": [],
+        "id": "93206383-89ee-4e5d-9916-8020d111c19b",
+        "outputId": "6cbee374-2fe0-4fc1-e2f8-e3af5082de76"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "! ) и регулирующим ||| ! ) and in ||| 1 7.22626e-06 1 0.0174206 ||| 0-0 1-1 2-2 3-3 ||| 1 1 1 ||| |||\n",
+            "! ) и регулирующим органам , ||| ! ) and in regulators , ||| 1 1.19187e-07 1 0.00118438 ||| 0-0 1-1 2-2 3-3 4-4 5-5 ||| 1 1 1 ||| |||\n",
+            "! ) и регулирующим органам , которые ||| ! ) and in regulators , who ||| 1 2.3692e-08 1 0.00017379 ||| 0-0 1-1 2-2 3-3 4-4 5-5 6-6 ||| 1 1 1 ||| |||\n",
+            "! ) и регулирующим органам ||| ! ) and in regulators ||| 1 1.60583e-07 1 0.00239106 ||| 0-0 1-1 2-2 3-3 4-4 ||| 1 1 1 ||| |||\n",
+            "! ) людям ||| ! ) to people ||| 1 0.0335482 0.5 0.00886362 ||| 0-0 1-1 2-3 ||| 1 2 1 ||| |||\n",
+            "! ) людям ||| listen ! ) to people ||| 1 0.0335482 0.5 1.08136e-07 ||| 0-1 1-2 2-4 ||| 1 2 1 ||| |||\n",
+            "! ) людям о ||| ! ) to people about ||| 0.5 0.00783575 0.5 0.0012305 ||| 0-0 1-1 2-3 3-4 ||| 2 2 1 ||| |||\n",
+            "! ) людям о ||| listen ! ) to people about ||| 0.5 0.00783575 0.5 1.50121e-08 ||| 0-1 1-2 2-4 3-5 ||| 2 2 1 ||| |||\n",
+            "! ) людям о существующих ||| ! ) to people about ||| 0.5 4.81115e-07 0.5 0.0012305 ||| 0-0 1-1 2-3 3-4 ||| 2 2 1 ||| |||\n",
+            "! ) людям о существующих ||| listen ! ) to people about ||| 0.5 4.81115e-07 0.5 1.50121e-08 ||| 0-1 1-2 2-4 3-5 ||| 2 2 1 ||| |||\n",
+            "! ) людям о существующих рисках . ||| ! ) to people about risks . ||| 1 9.15623e-09 1 0.000708028 ||| 0-0 1-1 2-3 3-4 5-5 6-6 ||| 1 1 1 ||| |||\n",
+            "! ) людям о существующих рисках ||| ! ) to people about risks ||| 1 9.82981e-09 0.5 0.00076376 ||| 0-0 1-1 2-3 3-4 5-5 ||| 1 2 1 ||| |||\n",
+            "! ) людям о существующих рисках ||| listen ! ) to people about risks ||| 1 9.82981e-09 0.5 9.31787e-09 ||| 0-1 1-2 2-4 3-5 5-6 ||| 1 2 1 ||| |||\n",
+            "! ) могли ||| ! ) were allowed ||| 1 0.00342545 1 2.13705e-05 ||| 0-0 1-1 2-2 2-3 ||| 2 2 2 ||| |||\n",
+            "! ) могли свободно перемещаться ||| ! ) were allowed to move freely ||| 1 2.8914e-05 1 3.85126e-07 ||| 0-0 1-1 2-2 2-3 4-4 4-5 3-6 4-6 ||| 2 2 2 ||| |||\n",
+            "! ) может ||| ! ) can ||| 1 0.196422 1 0.148141 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
+            "! ) может заметить ||| ! ) can tell ||| 1 0.00070655 1 0.0019752 ||| 0-0 1-1 2-2 3-3 ||| 1 1 1 ||| |||\n",
+            "! ) может заметить разницу . ||| ! ) can tell the difference . ||| 1 1.46388e-05 1 6.80977e-05 ||| 0-0 1-1 2-2 3-3 4-4 4-5 5-6 ||| 1 1 1 ||| |||\n",
+            "! ) может заметить разницу ||| ! ) can tell the difference ||| 1 1.57156e-05 1 7.34579e-05 ||| 0-0 1-1 2-2 3-3 4-4 4-5 ||| 1 1 1 ||| |||\n",
+            "! ) можно ||| ! ) can be ||| 1 0.053651 1 0.0271419 ||| 0-0 1-1 2-2 2-3 ||| 1 1 1 ||| |||\n",
+            "! ) можно связать с ||| ! ) can be connected to the ||| 1 8.88438e-07 1 2.49973e-07 ||| 0-0 1-1 2-2 2-3 4-4 3-5 4-6 ||| 1 1 1 ||| |||\n",
+            "! , Google ||| ! , Google ||| 1 0.421594 1 0.299012 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
+            "! , Google и ||| ! , Google , and ||| 1 0.167997 1 0.00288072 ||| 0-0 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||\n",
+            "! , ||| ! , ||| 0.333333 0.629148 0.333333 0.400216 ||| 0-0 1-1 ||| 3 3 1 ||| |||\n",
+            "! , ||| ! ; ||| 1 0.117336 0.333333 0.00189467 ||| 0-0 1-1 ||| 1 3 1 ||| |||\n",
+            "! , ||| ! movement , ||| 1 0.629148 0.333333 3.96614e-05 ||| 0-0 1-2 ||| 1 3 1 ||| |||\n",
+            "! , и ||| ! movement , and ||| 1 0.497454 1 2.95386e-05 ||| 0-0 1-2 2-3 ||| 1 1 1 ||| |||\n",
+            "! , и его ||| ! movement , and his ||| 1 0.228702 1 1.04108e-05 ||| 0-0 1-2 2-3 3-4 ||| 1 1 1 ||| |||\n",
+            "! , при этом ||| ! ; nor ||| 1 9.34048e-05 1 1.80558e-05 ||| 0-0 1-1 2-2 3-2 ||| 1 1 1 ||| |||\n",
+            "! , при этом у них нет ||| ! ; nor do they have ||| 1 2.87476e-10 1 7.10103e-10 ||| 0-0 1-1 2-2 3-2 4-3 5-4 4-5 6-5 ||| 1 1 1 ||| |||\n",
+            "! , принимает ||| who ||| 0.000126183 3.82829e-09 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
+            "! , принимает решение ||| who ||| 0.000126183 1.09106e-12 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
+            "! , принимает решение о ||| who ||| 0.000126183 2.77272e-15 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
+            "! , принимает решение о том , ||| who ||| 0.000126183 2.10722e-18 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
+            "! , принимает решение о том ||| who ||| 0.000126183 1.04201e-17 1 0.0174419 ||| 2-0 ||| 7925 1 1 ||| |||\n",
+            "! - ||| ! , ||| 0.333333 0.00355096 0.5 0.0950936 ||| 0-0 1-1 ||| 3 2 1 ||| |||\n",
+            "! - ||| meantime , ||| 0.0136986 3.52027e-05 0.5 0.000275631 ||| 0-0 1-1 ||| 73 2 1 ||| |||\n",
+            "! - Министерства Иностранных Дел ||| ! --of the Foreign ||| 1 0.0017815 1 0.00103764 ||| 0-0 1-1 2-1 3-1 4-1 4-3 ||| 1 1 1 ||| |||\n",
+            "! - Министерства Иностранных Дел и ||| ! --of the Foreign and ||| 0.5 0.00140859 1 0.000772806 ||| 0-0 1-1 2-1 3-1 4-1 4-3 5-4 ||| 2 1 1 ||| |||\n",
+            "! - Министерства Иностранных Дел и Министерства ||| ! --of the Foreign and ||| 0.5 1.92977e-08 1 0.000772806 ||| 0-0 1-1 2-1 3-1 4-1 4-3 5-4 ||| 2 1 1 ||| |||\n",
+            "! - вскричала она . ||| ! , she cried . ||| 1 0.00029683 1 0.0100263 ||| 0-0 1-1 3-2 2-3 4-4 ||| 1 1 1 ||| |||\n",
+            "! - вскричала она ||| ! , she cried ||| 1 0.000318666 1 0.0108155 ||| 0-0 1-1 3-2 2-3 ||| 1 1 1 ||| |||\n",
+            "! - невидимая рука Адама Смита ||| ! -- Adam Smith &apos;s invisible hand ||| 1 7.57105e-05 1 1.91127e-05 ||| 0-0 1-1 4-2 5-3 1-4 2-5 3-6 ||| 1 1 1 ||| |||\n",
+            "! ? ! ||| ! ? ! ||| 1 0.655991 1 0.612445 ||| 0-0 1-1 2-2 ||| 1 1 1 ||| |||\n",
+            "! ? ||| ! ? ||| 1 0.77388 1 0.758011 ||| 0-0 1-1 ||| 1 1 1 ||| |||\n",
+            "! ||| ! doesn ||| 1 0.847666 0.00215983 0.000176055 ||| 0-0 ||| 1 463 1 ||| |||\n",
+            "! ||| ! doesn ’ t have ||| 1 0.847666 0.00215983 4.40182e-12 ||| 0-0 ||| 1 463 1 ||| |||\n",
+            "! ||| ! doesn ’ t ||| 1 0.847666 0.00215983 9.45753e-10 ||| 0-0 ||| 1 463 1 ||| |||\n",
+            "! ||| ! doesn ’ ||| 1 0.847666 0.00215983 2.58544e-06 ||| 0-0 ||| 1 463 1 ||| |||\n",
+            "! ||| ! is as good as dead ||| 1 0.847666 0.00215983 1.11895e-14 ||| 0-0 ||| 1 463 1 ||| |||\n"
+          ]
+        }
+      ],
+      "source": [
+        "extracted = ''\n",
+        "with gzip.open('/mnt/DATA/MSU2024NLP/corpus/train/model/phrase-table.gz','rt') as f:\n",
+        "    for _ in range(1000):\n",
+        "        extracted += next(f)\n",
+        "print('\\n'.join(extracted.split('\\n')[150:200]))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1ebf1f18-86e1-4273-a07b-6e3acd169bcb",
+      "metadata": {
+        "tags": [],
+        "id": "1ebf1f18-86e1-4273-a07b-6e3acd169bcb",
+        "outputId": "f1c03d00-1e09-4c12-cd76-dad7f5dd7f3a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "! ! ! ||| ! ! ! ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "! ! ||| ! ! ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857\n",
+            "! ! ||| ! ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "! &amp; quot ; ||| ! ” ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857\n",
+            "! &amp; quot ; – ||| ! ” is ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "! &amp; quot ; – требование , ||| ! ” is a demand ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "! &amp; quot ; – требование ||| ! ” is a demand ||| 0.6 0.2 0.2 0.2 0.2 0.6\n",
+            "! &amp; quot ||| ! ” ||| 0.6 0.2 0.2 0.2 0.2 0.6\n",
+            "! &quot; , ||| ! &quot; ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "! &quot; , а ||| ! &quot; rather ||| 0.6 0.2 0.2 0.6 0.2 0.2\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "extracted = ''\n",
+        "with gzip.open('/mnt/DATA/MSU2024NLP/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz','rt') as f:\n",
+        "    for _ in range(10):\n",
+        "        extracted += next(f)\n",
+        "print(extracted)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9d75022a-81f1-4689-be1d-69d8bd12e0ca",
+      "metadata": {
+        "tags": [],
+        "id": "9d75022a-81f1-4689-be1d-69d8bd12e0ca"
+      },
+      "source": [
+        "## Подготовка тестовой и валидационной выборки"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "312a5667-bf56-454f-a309-5f7495ea741d",
+      "metadata": {
+        "id": "312a5667-bf56-454f-a309-5f7495ea741d"
+      },
+      "source": [
+        "### Загрузка данных\n",
+        "\n",
+        "Полученные данные получить в `/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.<en/ru>`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "71a52c16-5829-4dfe-b443-90d41544b711",
+      "metadata": {
+        "tags": [],
+        "id": "71a52c16-5829-4dfe-b443-90d41544b711"
+      },
+      "outputs": [],
+      "source": [
+        "# !wget https://object.pouta.csc.fi/OPUS-Books/v1/moses/en-ru.txt.zip\n",
+        "# !unzip unzip en-ru.txt.zip"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "ec419640-a2aa-4c75-b2d4-2df0c86b9edc",
+      "metadata": {
+        "id": "ec419640-a2aa-4c75-b2d4-2df0c86b9edc"
+      },
+      "source": [
+        "### Готовим тестовые и валидационные данные по аналогии с обучающимися данными"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "aa484e56-e66f-4a21-bec1-c53efe60464d",
+      "metadata": {
+        "tags": [],
+        "id": "aa484e56-e66f-4a21-bec1-c53efe60464d",
+        "outputId": "a03a5bf2-8df0-4ec0-c295-c76aac7656f1"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Tokenizer Version 1.1\n",
+            "Language: en\n",
+            "Number of threads: 1\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/tokenizer/tokenizer.perl \\\n",
+        "        -l en \\\n",
+        "        < /corpus/testing/Books.en-ru.en \\\n",
+        "        > /corpus/testing/Books.en-ru.tok.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "ccad464c-d48f-412c-8c0d-fb2c3cb2b198",
+      "metadata": {
+        "tags": [],
+        "id": "ccad464c-d48f-412c-8c0d-fb2c3cb2b198",
+        "outputId": "15e5d66c-7a24-4bb8-8870-b2f2dcfbe939"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Tokenizer Version 1.1\n",
+            "Language: ru\n",
+            "Number of threads: 1\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/tokenizer/tokenizer.perl \\\n",
+        "        -l ru \\\n",
+        "        < /corpus/testing/Books.en-ru.ru \\\n",
+        "        > /corpus/testing/Books.en-ru.tok.ru\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "5ab3e995-4cfc-4398-b40d-dc7e96fdb238",
+      "metadata": {
+        "tags": [],
+        "id": "5ab3e995-4cfc-4398-b40d-dc7e96fdb238"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/truecase.perl \\\n",
+        "        --model /corpus/truecase-model.en \\\n",
+        "        < /corpus/testing/Books.en-ru.tok.en \\\n",
+        "        > /corpus/testing/Books.en-ru.true.en\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "558cb6ab-f0f0-43e6-a9eb-822aa9bcd35d",
+      "metadata": {
+        "tags": [],
+        "id": "558cb6ab-f0f0-43e6-a9eb-822aa9bcd35d"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/recaser/truecase.perl \\\n",
+        "        --model /corpus/truecase-model.ru \\\n",
+        "        < /corpus/testing/Books.en-ru.tok.ru \\\n",
+        "        > /corpus/testing/Books.en-ru.true.ru\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f421a934-abc1-44ab-a0a1-445671b5ed5b",
+      "metadata": {
+        "tags": [],
+        "id": "f421a934-abc1-44ab-a0a1-445671b5ed5b",
+        "outputId": "9e7da30f-b3f2-4724-8314-9393ff1dc655"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "17496 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c6e4e8ae-66f4-4499-9924-129f4ab2371e",
+      "metadata": {
+        "tags": [],
+        "id": "c6e4e8ae-66f4-4499-9924-129f4ab2371e",
+        "outputId": "caaac3c6-dc47-49f5-8d06-6c2309e642db"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "17496 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru\n"
+          ]
+        }
+      ],
+      "source": [
+        "!wc -l /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3f1be11d-03c6-45dd-93be-17ef1b68b6a0",
+      "metadata": {
+        "id": "3f1be11d-03c6-45dd-93be-17ef1b68b6a0"
+      },
+      "source": [
+        "### Делим на тестовую часть и на валидационную (нужен небольшой объем данных)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "49f97a30-6564-42a1-8d3d-fb10f458c3da",
+      "metadata": {
+        "tags": [],
+        "id": "49f97a30-6564-42a1-8d3d-fb10f458c3da"
+      },
+      "outputs": [],
+      "source": [
+        "!head -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.val.true.en\n",
+        "!head -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.val.true.ru\n",
+        "\n",
+        "!tail -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.en > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en\n",
+        "!tail -n 1000 /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.true.ru > /mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "95c90be5-b2e7-4d8e-a82c-7bd3dc25ed3a",
+      "metadata": {
+        "tags": [],
+        "id": "95c90be5-b2e7-4d8e-a82c-7bd3dc25ed3a"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/bin/moses \\\n",
+        "        -f /corpus/train/model/moses.ini \\\n",
+        "        -threads 4 -s 40 -dl 8 \\\n",
+        "        < /corpus/testing/Books.en-ru.test.true.ru \\\n",
+        "        > /corpus/testing/Books.en-ru.test.true.en.TRANSLATED\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f45ef53f-84e8-428f-b13f-c1dceb427c09",
+      "metadata": {
+        "tags": [],
+        "id": "f45ef53f-84e8-428f-b13f-c1dceb427c09",
+        "outputId": "75a36232-e36b-4a6b-a920-e8efe92cdbc2"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "отчего же не потушить свечу, когда смотреть больше не на что, когда гадко смотреть на все это?\n",
+            "why is not calm свечу when look more than that, when гадко look at all this?\n",
+            "\n",
+            "но как?\n",
+            "but how?\n",
+            "\n",
+            "зачем этот кондуктор пробежал по жердочке, зачем они кричат, эти молодые люди в том вагоне?\n",
+            "why this кондуктор пробежал on жердочке, why should they кричат, these young people is вагоне?\n",
+            "\n",
+            "зачем они говорят, зачем они смеются?\n",
+            "why should they say, why should they sneer?\n",
+            "\n",
+            "все неправда, все ложь, все обман, все зло!.. \"\n",
+            "it is untrue, all lies, all fraud, all evil!. \"\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru', 'r') as fin:\n",
+        "    with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en.TRANSLATED', 'r') as fout:\n",
+        "        for _ in range(5):\n",
+        "            text = MosesDetokenizer('ru').detokenize(next(fin).strip().split())  + '\\n' \\\n",
+        "                 + MosesDetokenizer('en').detokenize(next(fout).strip().split()) + '\\n'\n",
+        "            print(text)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d703529f-f91c-425e-8d51-bfc2a9a45bec",
+      "metadata": {
+        "tags": [],
+        "id": "d703529f-f91c-425e-8d51-bfc2a9a45bec",
+        "outputId": "a59f7cab-a796-4f2a-c7fe-be4480aba28a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n",
+            "BLEU = 7.83, 47.7/14.5/5.5/2.1 (BP=0.831, ratio=0.844, hyp_len=20892, ref_len=24754)\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/generic/multi-bleu.perl \\\n",
+        "        -lc /corpus/testing/Books.en-ru.test.true.en \\\n",
+        "        < /corpus/testing/Books.en-ru.test.true.en.TRANSLATED\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "2d8a4420-7a8a-4a27-a983-1b0819d8b588",
+      "metadata": {
+        "id": "2d8a4420-7a8a-4a27-a983-1b0819d8b588"
+      },
+      "source": [
+        "## Подбор гиперпараметров на валидации"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "228890f8-d0ce-4728-8f23-e8453afd03fe",
+      "metadata": {
+        "tags": [],
+        "id": "228890f8-d0ce-4728-8f23-e8453afd03fe"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/training/mert-moses.pl \\\n",
+        "        --working-dir /corpus/train/mert-dir/ \\\n",
+        "        --threads 4 \\\n",
+        "        /corpus/testing/Books.en-ru.val.true.ru \\\n",
+        "        /corpus/testing/Books.en-ru.val.true.en \\\n",
+        "        /opt/moses/bin/moses /corpus/train/model/moses.ini \\\n",
+        "        --mertdir /opt/moses/bin/ &> /corpus/mert.out\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3ad8d545-5921-4fb6-919b-f9ce1fde898b",
+      "metadata": {
+        "id": "3ad8d545-5921-4fb6-919b-f9ce1fde898b",
+        "outputId": "5b3923ef-b40e-416f-b672-128f7a7927d4"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "# MERT optimized configuration\n",
+            "# decoder /opt/moses/bin/moses\n",
+            "# BLEU 0.0654292 on dev /corpus/testing/Books.en-ru.val.true.ru\n",
+            "# We were before running iteration 26\n",
+            "# finished Thu Sep 26 23:19:46 UTC 2024\n",
+            "### MOSES CONFIG FILE ###\n",
+            "#########################\n",
+            "\n",
+            "# input factors\n",
+            "[input-factors]\n",
+            "0\n",
+            "\n",
+            "# mapping steps\n",
+            "[mapping]\n",
+            "0 T 0\n",
+            "\n",
+            "[distortion-limit]\n",
+            "6\n",
+            "\n",
+            "# feature functions\n",
+            "[feature]\n",
+            "UnknownWordPenalty\n",
+            "WordPenalty\n",
+            "PhrasePenalty\n",
+            "PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/corpus/train/model/phrase-table.gz input-factor=0 output-factor=0\n",
+            "LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz\n",
+            "Distortion\n",
+            "KENLM name=LM0 factor=0 path=/corpus/news-commentary-v8.ru-en.blm.en order=3\n",
+            "\n",
+            "# dense weights for feature functions\n",
+            "[weight]\n",
+            "\n",
+            "LexicalReordering0= 0.00137749 -0.0615252 0.122087 0.0459385 0.146033 0.162246\n",
+            "Distortion0= -0.0214977\n",
+            "LM0= 0.0788921\n",
+            "WordPenalty0= -0.204385\n",
+            "PhrasePenalty0= 0.0267022\n",
+            "TranslationModel0= 0.00329618 0.0880665 0.0237377 0.0142152\n",
+            "UnknownWordPenalty0= 1\n"
+          ]
+        }
+      ],
+      "source": [
+        "!cat /mnt/DATA/MSU2024NLP/corpus/train/mert-dir/moses.ini"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "81aad600-c036-47e6-aebf-78f755314525",
+      "metadata": {
+        "tags": [],
+        "id": "81aad600-c036-47e6-aebf-78f755314525"
+      },
+      "outputs": [],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/bin/moses \\\n",
+        "        -f /corpus/train/mert-dir/moses.ini \\\n",
+        "        -threads 4 -s 40 -dl 8 \\\n",
+        "        < /corpus/testing/Books.en-ru.test.true.ru \\\n",
+        "        > /corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bbdcc0d9-4160-47e3-bc01-4d55de3ec0fb",
+      "metadata": {
+        "tags": [],
+        "id": "bbdcc0d9-4160-47e3-bc01-4d55de3ec0fb",
+        "outputId": "9663cc99-1408-4508-a144-633db080355a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "отчего же не потушить свечу, когда смотреть больше не на что, когда гадко смотреть на все это?\n",
+            "why is no longer on свечу calm that, when you look at the same time, when гадко look at all this?\n",
+            "\n",
+            "но как?\n",
+            "but how?\n",
+            "\n",
+            "зачем этот кондуктор пробежал по жердочке, зачем они кричат, эти молодые люди в том вагоне?\n",
+            "why this кондуктор кричат пробежал жердочке on these young people, why they вагоне in?\n",
+            "\n",
+            "зачем они говорят, зачем они смеются?\n",
+            "why do they say, why should they sneer?\n",
+            "\n",
+            "все неправда, все ложь, все обман, все зло!.. \"\n",
+            "still, it is untrue, all of the lie, all of deception. \"all the many evils!\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru', 'r') as fin:\n",
+        "    with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT', 'r') as fout:\n",
+        "        for _ in range(5):\n",
+        "            text = MosesDetokenizer('ru').detokenize(next(fin).strip().split())  + '\\n' \\\n",
+        "                 + MosesDetokenizer('en').detokenize(next(fout).strip().split()) + '\\n'\n",
+        "            print(text)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "491ebe3d-f38f-4d75-8426-590ef6927060",
+      "metadata": {
+        "tags": [],
+        "id": "491ebe3d-f38f-4d75-8426-590ef6927060",
+        "outputId": "6e173281-964c-4725-9fad-e4de1ed47047"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n",
+            "BLEU = 6.87, 41.6/11.6/3.7/1.3 (BP=1.000, ratio=1.065, hyp_len=26361, ref_len=24754)\n"
+          ]
+        }
+      ],
+      "source": [
+        "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
+        "        /opt/moses/scripts/generic/multi-bleu.perl \\\n",
+        "        -lc /corpus/testing/Books.en-ru.test.true.en \\\n",
+        "        < /corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Прикладные применения простых моделей"
+      ],
+      "metadata": {
+        "id": "FtyO_b2sxzta"
+      },
+      "id": "FtyO_b2sxzta"
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Особености нейросетевого и статистического перевода на практике.\n",
+        "\n",
+        "\n",
+        "|                    | Нейросетевой перевод | Статистический перевод |\n",
+        "|--------------------|:--------------------:|:----------------------:|\n",
+        "| Смысл              |          +           |            +           |\n",
+        "| Связность          |          +           |           +-           |\n",
+        "| Скорость           |         +-           |            +           |\n",
+        "| Интерпретируемость |         +-           |            +           |\n",
+        "| Понимание          |          -           |            +           |"
+      ],
+      "metadata": {
+        "id": "yP9wQy7nx6iR"
+      },
+      "id": "yP9wQy7nx6iR"
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Использование \"линейных\" языковых моделей, тоже имеют свои плюсы и минусы.\n",
+        "\n",
+        "Часто используется в задачах:\n",
+        "1. Аппроксимация LLM при помощи KenLM моделей (некоторая линейная аппроксимация получается).\n",
+        "  - Разделение машинной генерации от человеского.\n",
+        "2. Построение модели описание данных."
+      ],
+      "metadata": {
+        "id": "I6lxvyTg13Qb"
+      },
+      "id": "I6lxvyTg13Qb"
     }
-   ],
-   "source": [
-    "with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.ru', 'r') as fin:\n",
-    "    with open('/mnt/DATA/MSU2024NLP/corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT', 'r') as fout:\n",
-    "        for _ in range(5):\n",
-    "            text = MosesDetokenizer('ru').detokenize(next(fin).strip().split())  + '\\n' \\\n",
-    "                 + MosesDetokenizer('en').detokenize(next(fout).strip().split()) + '\\n'\n",
-    "            print(text)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "491ebe3d-f38f-4d75-8426-590ef6927060",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "It is not advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n",
-      "BLEU = 6.87, 41.6/11.6/3.7/1.3 (BP=1.000, ratio=1.065, hyp_len=26361, ref_len=24754)\n"
-     ]
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.10"
+    },
+    "colab": {
+      "provenance": []
     }
-   ],
-   "source": [
-    "!docker run -v /mnt/DATA/MSU2024NLP/corpus:/corpus haukurp/moses-smt:1.1.0 /bin/bash -c \"\\\n",
-    "        /opt/moses/scripts/generic/multi-bleu.perl \\\n",
-    "        -lc /corpus/testing/Books.en-ru.test.true.en \\\n",
-    "        < /corpus/testing/Books.en-ru.test.true.en.TRANSLATED.MERT\""
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
   },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
\ No newline at end of file