From 18c4ea42f487a7e1f675e456ec57fd1df4e62ebb Mon Sep 17 00:00:00 2001 From: Muhammed Enes Ekinci <211805030@stu.adu.edu.tr> Date: Tue, 27 Aug 2024 21:31:31 +0300 Subject: [PATCH] readme --- README.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..c19deca --- /dev/null +++ b/README.md @@ -0,0 +1,39 @@ +# News Generation + +## Introduction + +This project is a part of the Teknofest 2024 Türkçe Doğal Dil İşleme competition. The aim of the project is to +generate news title and content from a given image. + +## Dataset + +The dataset is collected from the ![Sabah](https://www.sabah.com.tr/timeline/) news website. +The dataset consist of news titles, news content and images. The dataset is in Turkish Language. + +## Data-Preprocessing + +- Sample Data: + ![image](https://isbh.tmgrup.com.tr/sbh/2022/03/31/balikesirde-tarihi-binada-baslayan-ve-restorana-sicrayan-yangin-sonduruldu-1648686422986.jpeg) + + *** + + ```python + title = "Balıkesir’de tarihi bina yangında küle döndü" + word_index = {'Balıkesir’de': 9, 'tarihi': 5, 'bina': 3, 'yangında': 7, 'küle': 5, 'döndü': 6 } + tokens: [start_token, 9, 5, 3, 7, 5, 6, end_token] + ``` + + | Input | Output | + | ------------------------------------------- | --------- | + | Image + start_token | 9 | + | Image + start_token + 9 | 5 | + | Image + start_token + 9 + 5 | 3 | + | Image + start_token + 9 + 5 + 3 | 7 | + | Image + start_token + 9 + 5 + 3 + 7 | 5 | + | Image + start_token + 9 + 5 + 3 + 7 + 5 | 6 | + | Image + start_token + 9 + 5 + 3 + 7 + 5 + 6 | end_token | + +## Model + +The model is a combination of CNN and LSTM, where the image is fed to the Encoder(CNN) and the output of the CNN is +fed to the Decoder(LSTM) along with the input text.