Skip to content

Latest commit

ย 

History

History
334 lines (253 loc) ยท 17.9 KB

README.md

File metadata and controls

334 lines (253 loc) ยท 17.9 KB

๐Ÿ‘‹์•ˆ๋…•ํ•˜์„ธ์š”, '์ž์†Œ์„œ ์š”์•ฝ๋‹จ'์ž…๋‹ˆ๋‹ค.

ํŒ€์›: ๋ณ€๊ฐ€์€, ๊น€์ง€ํ™˜, ๋‚จ๊ฒฝํ˜„, ์ด๋Œ€ํ—Œ, ์ด์„ฑ๋ฏผ, ์ตœ๋ฏผ์žฌ
ํŒ€๋…ธ์…˜ํŽ˜์ด์ง€: Notion

๐Ÿ”Ž Table of Contents

  • About
  • Main Fuctions
  • Architecture
  • Setting
  • Engine
    • General_Summarization
    • Keysentence_Extraction
    • Question_Answering
  • Code_Architecture

๐Ÿ“ About

๐Ÿ’ก์ฃผ์ œ: ์ธ๊ณต์ง€๋Šฅ ํ™œ์šฉ ์ž๊ธฐ์†Œ๊ฐœ์„œ ์š”์•ฝ ํŽ˜์ด์ง€
#NLP #Text_Summarization #Question_Answering #Flask

  • ๊ฐœ์š”

    • ๋งค ์ฑ„์šฉ์‹œ ์ง€์›์ž๋Š” ๋™์ผํ•œ ์งˆ๋ฌธ์„ ์ œํ•œ๋œ ๊ธ€์ž ์ˆ˜ ๋‚ด์—์„œ ๋‹ต๋ณ€์„ ํ•˜๊ณ  ์žˆ๋‹ค.
    • ์‹ ์ž…์‚ฌ์› ์ฑ„์šฉ์€ ์ด๋ ฅ์„œ๋ณด๋‹ค ์ž๊ธฐ์†Œ๊ฐœ์„œ๋ฅผ ์ค‘์š”ํ•˜๊ฒŒ ํ‰๊ฐ€
      • ๊ฒฝ๋ ฅ์ด ์—†๋Š” ์‹ ์ž…์‚ฌ์›์˜ ์—…๋ฌด ์ˆ˜ํ–‰๋Šฅ๋ ฅ์€ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ง€์›์ž์˜ ์ž ์žฌ๋ ฅ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ๋กœ ์ž๊ธฐ์†Œ๊ฐœ์„œ๊ฐ€ ํฐ ๋น„์ค‘์„ ์ฐจ์ง€ํ•œ๋‹ค.

    โ‡’โ€˜์ž์†Œ์„œ ์š”์•ฝ๋‹จโ€™์€ ๊ธฐ์—… ์ž„์›์ง„์„ ๋Œ€์ƒ์œผ๋กœ ์ž๊ธฐ์†Œ๊ฐœ์„œ์˜ ๋‚ด์šฉ์„ ์š”์•ฝํ•˜์—ฌ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ „๋‹ฌํ•˜๊ณ , ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ธ๊ณต์ง€๋Šฅ ๊ธฐ๋ฐ˜์˜ ์ž๊ธฐ์†Œ๊ฐœ ์š”์•ฝ ํŽ˜์ด์ง€๋ฅผ ์ œ๊ณตํ•œ๋‹ค.


  • ๋žœ๋”ฉ ํŽ˜์ด์ง€

๋žœ๋”ฉํŽ˜์ด์ง€@3x height='200'

  • ์‹œ์—ฐ ์˜์ƒ
default.mp4

๐Ÿšฆ Main Fuctions

  • General Summarization: ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์ž๊ธฐ์†Œ๊ฐœ์„œ ์š”์•ฝ
  • Keysentence Extraction: ํ‚ค์›Œ๋“œ ์ค‘์‹ฌ์˜ ๋ฌธ์žฅ ์ถ”์ถœ
  • Question Answering: ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€ ์ œ๊ณต

๐Ÿ”ง Architecture

  • Frontend: HTML, CSS, JS
  • Engine : Pytorch
    • Kobart, Textrankr, Bert
  • Backend: Flask

๐Ÿ’พ Setting

  • Install modules
pip install -r requirements.txt 
  • Execute
flask run

โšก Engine

โœ… General Summarization_kobart

https://github.com/SKT-AI/KoBART

  • BART(Bidirectional and Auto-Regressive Transformers)๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ ์ผ๋ถ€์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด๋ฅผ ๋‹ค์‹œ ์›๋ฌธ์œผ๋กœ ๋ณต๊ตฌํ•˜๋Š” autoencoder์˜ ํ˜•ํƒœ๋กœ ํ•™์Šต์ด ๋ฉ๋‹ˆ๋‹ค.
    image width=650 height=550
  1. Bart๋Š” Transformer์˜ ๊ธฐ๋ณธ ์•„ํ‚คํ…์ฒ˜์ธ Encoder-Decoder๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค.
  2. ๋”ฐ๋ผ์„œ ์ฝ”๋“œ๋„ Encoder์™€ Decoder๋ฅผ ์ฐจ๋ก€๋กœ ํ†ต๊ณผํ•œ๋‹ค.
  3. Input data๋„ Encoder_input๊ณผ Decoder_input์„ ๋”ฐ๋กœ ์ค€๋น„ํ•ด์•ผํ•œ๋‹ค.
  4. ์–ด๋–ป๊ฒŒ input์„ ๋„ฃ์–ด์ฃผ๋ƒ์— ๋”ฐ๋ผ Task๋งˆ๋‹ค ํ•™์Šต/์ถ”๋ก  ๋ฐฉ๋ฒ•์ด ๊ฐˆ๋ฆฐ๋‹ค.
  • ํ•œ๊ตญ์–ด BART๋Š” ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ Text Infilling ๋…ธ์ด์ฆˆ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 40GB ์ด์ƒ์˜ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•œ ํ•œ๊ตญ์–ด encoder-decoder ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
    ์ด๋ฅผ ํ†ตํ•ด ๋„์ถœ๋œ KoBART-base๋ฅผ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ฐฑ๊ณผ ์ด์™ธ, ๋‰ด์Šค, ์ฑ…, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ v1.0(๋Œ€ํ™”, ๋‰ด์Šค, ...), ์ฒญ์™€๋Œ€ ๊ตญ๋ฏผ์ฒญ์› ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • KoBART๋ž€ ํŽ˜์ด์Šค๋ถ์—์„œ ๊ณต๊ฐœํ•œ BART๋ชจ๋ธ์„ SKT์—์„œ 40GB์ด์ƒ์˜ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋กœ ์‚ฌ์ „ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ์ด๋‹ค.
    BART๋Š” seq2seq ๋ชจ๋ธ์„ ์‚ฌ์ „ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ denoising autoencoder(DAE, ์žก์Œ์ œ๊ฑฐ ์˜คํ†  ์ธ์ฝ”๋”)๋กœ, ์ž„์˜์˜ noising function์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์†์ƒ์‹œํ‚จ ํ›„ ๋ชจ๋ธ์ด ์›๋ณธ ํ…์ŠคํŠธ๋ฅผ ์žฌ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค.
    BART๋Š” ๊ธฐ์กด BERT๋ชจ๋ธ๊ณผ GPT๋ฅผ ํ•ฉ์นœ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ, ์ด๋กœ ์ธํ•ด BERT์˜ Bidirectional ํŠน์ง•๊ณผ GPT์˜ Auto-Regressiveํ•œ ํŠน์ง•์„ ๋ชจ๋‘ ๊ฐ€์ง„๋‹ค. ๋•๋ถ„์— BART๋Š” ๊ธฐ์กด MLM๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ๋†’์€ ํ™œ์šฉ์„ฑ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. image width=650 height=550
    Fig.1 Bart๊ตฌ์กฐ

  • BART๋Š” ์†์ƒ๋œ Text๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Bidirectional ๋ชจ๋ธ๋กœ encodingํ•˜๊ณ  ์ •๋‹ต Text์— ๋Œ€ํ•œ likelihood๋ฅผ autoregressive ๋ชจ๋ธ๋กœ decodingํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค. BART์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ 5๊ฐ€์ง€ noising ๊ธฐ๋ฒ•์ด ์กด์žฌํ•œ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์†์ƒ๋œ Text๋ฅผ ์–ป๋Š”๋‹ค. image width=650 height=550
    Fig.2 Noising๊ธฐ๋ฒ•

  • BART๋Š” ์ž๊ธฐํšŒ๊ท€ ๋””์ฝ”๋”๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—, abstractive QA์™€ summarization๊ณผ ๊ฐ™์€ ์‹œํ€€์Šค ์ผ๋ฐ˜ํ™”(Sequence Generation) ํƒœ์Šคํฌ์— ์ง์ ‘์ ์œผ๋กœ ํŒŒ์ธํŠœ๋‹ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์ด๋ ฅ์„œ ์š”์•ฝ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด KoBART๋ชจ๋ธ์— ์ฑ„์šฉ๋ฉด์ ‘ ๋ฐ์ดํ„ฐ๋กœ ํŒŒ์ธํŠœ๋‹์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.(๋ฐ์ดํ„ฐ์…‹: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=realm&dataSetSn=71592)


์ฐธ๊ณ ๋ฌธํ—Œ [1] Mike Lewis์™ธ(2019), "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", ACL
[2] ์ˆ˜๋‹ค๋ฅด์‚ฐ ๋ผ๋น„์ฐฌ๋””๋ž€(2021), "๊ตฌ๊ธ€ BERT์˜ ์ •์„", ํ•œ๋น›๋ฏธ๋””์–ด

โœ… Keysentence Extraction_Textrankr

  • TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 2004๋…„ ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ PageRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค[1].
    PageRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ˆ˜์ง‘๋œ ์ธํ„ฐ๋„ท ๋ฌธ์„œ ๊ฐ๊ฐ์„ ๊ทธ๋ž˜ํ”„์˜ ๋…ธ๋“œ, ๋ฌธ์„œ ๋‚ด๋ถ€์˜ ๋งํฌ ์ •๋ณด๋ฅผ ๊ฐ„์„ ์œผ๋กœ ๊ฐ€์ •ํ•˜์—ฌ ๋ฐฉํ–ฅ์„ฑ์ด ์žˆ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด ๋ฌธ์„œ์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค[2]. ์กฐ๊ธˆ ๋” ์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด PageRank๋Š” ๊ฐ ์›นํŽ˜์ด์ง€๋งˆ๋‹ค ํ•˜์ดํผ๋งํฌ๊ฐ€ ์žˆ์„ ๋•Œ ์–ผ๋งˆ๋‚˜ ๋งํฌ๋ฅผ ๋ฐ›๋Š๋ƒ์— ๋”ฐ๋ผ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งํ•œ๋‹ค. ์ฆ‰, ํ•ด๋‹น ๋งํฌ๋ฅผ ํด๋ฆญํ•  ํ™•๋ฅ ๋กœ ๊ทธ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๊ฒƒ์ด๋‹ค.
  • TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ PageRank์˜ ๊ฐœ๋…์„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์— ์‘์šฉํ•œ ๊ฒƒ์œผ๋กœ ๋ฌธ์žฅ, ๋‹จ์–ด์™€ ๊ฐ™์€ ํŠน์ • ๋‹จ์œ„๋“ค ๊ฐ„์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๋ฌธ์„œ ๋‚ด์˜ ๊ฐ ๋ฌธ์žฅ์„ ๊ทธ๋ž˜ํ”„์˜ ์ •์ (vertex)์œผ๋กœ ๊ฐ€์ •ํ•˜๋Š” ๊ฒฝ์šฐ ์ค‘์š”ํ•œ ๋ฌธ์žฅ๋“ค์„ ์„ ๋ณ„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ ์š”์•ฝ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฒฐ๊ตญ, TextRank๋Š” ์•ž์„œ PageRank์—์„œ์˜ ํŽ˜์ด์ง€ ๊ฐœ๋…์„ ๋‹จ์–ด์˜ ๊ฐœ๋…์œผ๋กœ ๋ฐ”๊พผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ํ…์ŠคํŠธ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ธ€์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋ฌธ์žฅ๊ณผ ์–ผ๋งˆ๋งŒํผ์˜ ๊ด€๊ณ„๋ฅผ ๋งบ๊ณ  ์žˆ๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
์Šคํฌ๋ฆฐ์ƒท 2024-02-20 ์˜ค์ „ 11 22 32
  • ์œ„ ์ด๋ฏธ์ง€๋Š” ์ฃผ์–ด์ง„ ๊ธ€์— ๋Œ€ํ•ด ํ…์ŠคํŠธ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆฐ ์ƒ˜ํ”Œ ์ด๋ฏธ์ง€์ด๋‹ค. ๊ฐ ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์„ ์œผ๋กœ ์—ฐ๊ฒฐํ•œ ๊ฒƒ์ด๋‹ค.
    ๊ทธ๋ ‡๋‹ค๋ฉด ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ? ์•„๋ž˜์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋ฌธ์žฅ ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค. ๊ฐ ๋ฌธ์žฅ์ด ๋งˆ๋””๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.
์Šคํฌ๋ฆฐ์ƒท 2024-02-20 ์˜ค์ „ 11 28 52
์ •์„์› ์™ธ(2017) "TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์ฃผ์˜ ์ง‘์ค‘ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ฌธ์„œ ์š”์•ฝ" ๋…ผ๋ฌธ์—์„œ๋Š” TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ ๋ฌธ์žฅ์˜ ์ค‘์š”๋„๋ฅผ ๊ตฌํ•  ๋•Œ, ๋ฌธ์žฅ ๊ฐ„ ์ƒ๊ด€ํ–‰๋ ฌ์„ ์ด์šฉํ•˜์—ฌ ๊ตฌํ•˜์˜€๋‹ค. textrank1
  • ์ž…๋ ฅ ๋ฌธ์„œ์˜ ๊ฐ ๋ฌธ์žฅ๋“ค์— ๋Œ€ํ•ด ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ์ฒด์–ธ๋ฅ˜์™€ ์šฉ์–ธ๋ฅ˜์˜ TF-IDF๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฌธ์žฅ-๋‹จ์–ด ํ–‰๋ ฌ์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ ๋’ค ์ƒ์„ฑ๋œ ๋ฌธ์žฅ-๋‹จ์–ด ํ–‰๋ ฌ์˜ ์ „์น˜ ํ–‰๋ ฌ์„ ๊ตฌํ•˜์—ฌ ์„œ๋กœ ๊ณฑํ•ด์ฃผ๋ฉด ๋ฌธ์žฅ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„(correlation)์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ๋ฌธ์žฅ ๊ฐ„ ์ƒ๊ด€ํ–‰๋ ฌ์€ ๋ฌธ์žฅ ๊ฐ„์˜ ๊ฐ€์ค‘์น˜ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ณ , TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๊ฐ ๋ฌธ์žฅ์˜ ์ค‘์š”๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์ค‘์š”๋„ ์ˆœ์œผ๋กœ ๋ฌธ์žฅ๋“ค์„ ์ •๋ ฌํ•œ ๋’ค ์ƒ์œ„ n๊ฐœ์˜ ๋ฌธ์žฅ๋“ค์„ ์žฌ๋ฐฐ์น˜ํ•˜๋ฉด ์š”์•ฝ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค[3].

์ฐธ๊ณ ๋ฌธํ—Œ [1] ์ด์ƒ์˜ ์™ธ(2023), "TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ์ธ๊ณต์ง€๋Šฅ์„ ํ™œ์šฉํ•œ ๋ธŒ๋ ˆ์ธ์Šคํ† ๋ฐ", JPEE : Journal of practical engineering education = ์‹ค์ฒœ๊ณตํ•™๊ต์œก๋…ผ๋ฌธ์ง€, v.15 no.2, pp.509 - 517
[2] ๋ฐฐ์›์‹๊ณผ ์ฐจ์ •์›(2010), "TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•œ ๋ฌธ์„œ ๋ฒ”์ฃผํ™”", ์ •๋ณด๊ณผํ•™ํšŒ๋…ผ๋ฌธ์ง€. Journal of KIISE. ์ปดํ“จํŒ…์˜ ์‹ค์ œ ๋ฐ ๋ ˆํ„ฐ, v.16 no.1, pp.110-114
[3] ์ •์„์› ์™ธ(2017), "TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์ฃผ์˜์ง‘์ค‘ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ฌธ์„œ ์š”์•ฝ", ํ•œ๊ตญ์–ด์ •๋ณดํ•™ํšŒ 2017๋…„๋„ ์ œ29ํšŒ ํ•œ๊ธ€๋ฐํ•œ๊ตญ์–ด์ •๋ณด์ฒ˜๋ฆฌํ•™์ˆ ๋Œ€ํšŒ, pp.47 - 50

โœ… Question Answering_Bert (์›๋ฆฌ)

  • Question Answering model์€ ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ํŠน์ •ํ•œ ์งˆ์˜์— ๊ด€๋ จ๋œ ์ •๋‹ต์„ ์ž์—ฐ์–ด๋กœ ์ž๋™์œผ๋กœ ์ถœ๋ ฅํ•˜๋Š” ์‹œ์Šคํ…œ์ด๋‹ค. [1]

  • QA ์ž‘์—…์€ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ๊ณผ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜์—ฌ ๋‹ต๋ณ€์„ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ธ๋ฐ, ๋‹ต๋ณ€์„ ์ฐพ๋Š” ๋ฐฉ์‹์— ๋”ฐ๋ผ ์ถ”์ถœํ˜•(extractive), ์ถ”์ƒํ˜•(abstractive) ์œผ๋กœ ๋‚˜๋‰œ๋‹ค. ๋ฌธ์žฅ ๋‚ด์—์„œ ์งˆ๋ฌธ์— ํ•ด๋‹นํ•˜๋Š” ๋‹ต๋ณ€์„ ์ฐพ์•„๋‚ด๋Š”์ง€ / ์ฃผ์–ด์ง„ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ง์ ‘ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์˜ ์ฐจ์ด์ด๋‹ค. QA ์ž‘์—…์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด, ๊ตฌ๊ธ€์—์„œ ๊ฐœ๋ฐœํ•œ NLP ์ฒ˜๋ฆฌ ๋ชจ๋ธ์ธ BERT ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์˜€๋‹ค.

  • BERT ๋ชจ๋ธ์€ ์‚ฌ์ „ ํ›ˆ๋ จ ์–ธ์–ด ๋ชจ๋ธ๋กœ์„œ, ํŠน์ • ๋ถ„์•ผ์— ๊ตญํ•œ๋œ ๊ธฐ์ˆ ์ด ์•„๋‹ˆ๋ผ ๋ชจ๋“  ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ฒ”์šฉ Language Model์ด๋‹ค. Transformer architerture์—์„œ encoder๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1. Input
  • Input์€ Token Embedding + Segment Embedding + Position Embedding 3๊ฐ€์ง€ ์ž„๋ฒ ๋”ฉ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค. -Token Embedding : Word Piece ์ž„๋ฒ ๋”ฉ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. Char ๋‹จ์œ„๋กœ ์ž„๋ฒ ๋”ฉ ํ›„ ๋“ฑ์žฅ ๋นˆ๋„์— ๋”ฐ๋ผ sub-word ๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค. -Segment Embedding : Sentence Embedding์œผ๋กœ, ํ† ํฐ ์‹œํ‚จ ๋‹จ์–ด๋“ค์„ ๋‹ค์‹œ ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์œผ๋กœ ๋งŒ๋“œ๋Š” ์ž‘์—…์ด๋‹ค. ๊ตฌ๋ถ„์ž [SEP] ๋ฅผ ํ†ตํ•ด ๋ฌธ์žฅ์„ ๊ตฌ๋ถ„ํ•˜๊ณ  ๋‘ ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ Segment ๋กœ ์ง€์ •ํ•œ๋‹ค. -Position Embedding : ์ž…๋ ฅ ํ† ํฐ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ํ† ํฐ ์ˆœ์„œ๋Œ€๋กœ ์ธ์ฝ”๋”ฉํ•œ๋‹ค.
2. Pre-Training
  • MLM(masked Language Model) : ์ž…๋ ฅ ๋ฌธ์žฅ์—์„œ ์ž„์˜๋กœ ํ† ํฐ์„ ๋ฒ„๋ฆฌ๊ณ (Mask) ๊ทธ ํ† ํฐ์„ ๋งž์ถ”๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

  • NSP(Next Sentence Prediction) : ๋‘ ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‘ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๋‘ ๋ฌธ์žฅ ๊ฐ„ ๊ด€๋ จ์ด ๊ณ ๋ ค๋˜์•ผ ํ•˜๋Š” NLI์™€ QA์˜ ํŒŒ์ธ ํŠœ๋‹์„ ์œ„ํ•ด ๋‘ ๋ฌธ์žฅ์˜ ์—ฐ๊ด€์„ ๋งž์ถ”๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

3. Fine-tuning

  • Bert๋Š” fine-tuning ๋‹จ๊ณ„์—์„œ pre-training ๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฐ NLP task๋งˆ๋‹ค fine-tuning ํ›„ Bert ๋ชจ๋ธ์„ transfer learning์‹œ์ผœ ์„ฑ๋Šฅ์„ ํ™•์ธํ•œ๋‹ค.

  • BERT ๊ธฐ๋ฐ˜ QA ๋ชจ๋ธ์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

  1. ์งˆ๋ฌธ๊ณผ ๋ฌธ๋งฅ์„ ํ•จ๊ป˜ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋‹ต๋ณ€์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹
  2. ์งˆ๋ฌธ๊ณผ ๋ฌธ๋งฅ์„ ๊ฐ๊ฐ ๋”ฐ๋กœ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๊ฐ๊ฐ์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‹ต๋ณ€์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹ ๋ณธ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” question๊ณผ reference๋ฅผ ๋™์‹œ์— input ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ extractive question answering ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค.
์ฐธ๊ณ ๋ฌธํ—Œ [1] ๊ถŒ์„ธ๋ฆฐ, et al. "์˜๋ฃŒ ๊ด€๋ จ ์งˆ์˜์‘๋‹ต์„ ์œ„ํ•œ BERT ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด QA ๋ชจ๋ธ." Proceedings of KIIT Conference. 2022.
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] ์†ก๋‹ค์—ฐ, ์กฐ์ƒํ˜„, and ๊ถŒํ˜์ฒ . "ํ•œ๊ตญ์–ด ๋ฌธ๋ฒ• QA ์‹œ์Šคํ…œ์„ ์œ„ํ•œ BERT ๊ธฐ๋ฐ˜ ์‹œํ€€์Šค ๋ถ„๋ฅ˜๋ชจ๋ธ." ํ•œ๊ตญ์ •๋ณด๊ณผํ•™ํšŒ ํ•™์ˆ ๋ฐœํ‘œ๋…ผ๋ฌธ์ง‘ (2021): 754-756.

๐Ÿ“‘ Code Architecture

โœ”๏ธ app.py

  • app.py ์—์„œ๋Š” ์—”์ง„์„ ์ดˆ๊ธฐ ์„ธํŒ…ํ•˜๊ณ , flask ์„œ๋ฒ„ ์„ค์ •์„ ํ†ตํ•ด ํ”„๋ก ํŠธ์™€ ํ†ต์‹ ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ์—”์ง„์—์„œ ๊ฐ€์žฅ ์ค‘์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ์ฝ”๋“œ ํŒŒ์ผ์ด๋ฉฐ, ๋ฐฑ์—”๋“œ์—์„œ ์š”์ฒญ์ด ๋“ค์–ด์˜ฌ ๊ฒฝ์šฐ, ๊ทธ์— ๋งž๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•œ๋‹ค.
  • flask_cors๋ฅผ ํ™œ์šฉํ•˜์—ฌ CORS ์ด์Šˆ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.
from flask import Flask,render_template,request, jsonify
from flask_cors import CORS
import torch
from sum_model import summarize_model
from ext import textrank_summarize
from qa_model import get_qa_model
	
app = Flask(__name__)
cors = CORS(app)
  • flask ์•ฑ์„ ์ƒ์„ฑํ•˜๊ณ  ๋‹ค์Œ ๊ฒฝ๋กœ์— ๋Œ€ํ•œ ๋ผ์šฐํŒ…์„ ์„ค์ •ํ•œ๋‹ค.
    • '/': ํ™ˆ ํŽ˜์ด์ง€๋ฅผ ๋ Œ๋”๋ง
    • '/sum': ์š”์•ฝ ํŽ˜์ด์ง€๋ฅผ ๋ Œ๋”๋ง
    • '/sum/gsummarize': ์ผ๋ฐ˜์ ์ธ ์š”์•ฝ์„ ์ˆ˜ํ–‰
    • '/sum/key': ํ‚ค์›Œ๋“œ ์ค‘์‹ฌ์˜ ๋ฌธ์žฅ ์ถ”์ถœ์„ ์ˆ˜ํ–‰
    • '/sum/qa': ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ œ๊ณต
#home
@app.route('/')	
def home():
	return render_template('home.html')
	
#summary page
@app.route('/sum')	
def index():
	return render_template('index.html')
	
#general summarization
@app.route('/sum/gsummarize', methods=['POST'])
def gsummarize():
	try:
		data = request.get_json(force=True)
		context = data['context']
		gsum = summarize_model(context)
		response = jsonify({'gsum': gsum})
	
		except Exception as e:
			response = jsonify({'error': str(e)})	
		return response
	
	
# keysentence extraction
@app.route('/sum/key',methods=['POST'])
def key():
	try:
		data = request.get_json(force=True)
		context = data['context']
		keytext = textrank_summarize(context,1) #๋ฌธ์žฅ ์ˆ˜ ์กฐ์ ˆ ํ•„์š”
		response = jsonify({'keytext': keytext})
	
		except Exception as e:
			response = jsonify({'error': str(e)})	
			
		return response
		
#qa
@app.route('/sum/qa', methods=['POST'])
def qa_endpoint():
	try:
		data = request.get_json(force=True)
	
		context = data['context']
		question = data['question']
		if question == "":
			response = jsonify({'error': '์งˆ๋ฌธ์„ ์ž…๋ ฅํ•ด์ฃผ์„ธ์š”.'})
			return response
	
			to_predict = [{"context": context, "qas": [{"question": question, "id": "0"}]}]
	
			result = qa_model.predict(to_predict)
	
			answer = result[0][0]['answer'][0]
			answer = "์ ์ ˆํ•œ ๋‹ต๋ณ€์„ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค." if answer == '' else answer
	
			response = jsonify({'answer': answer})
	
		except Exception as e:
			response = jsonify({'error': str(e)})
	
		return response
  • ์ง€์ •๋œ ๋ชจ๋ธ์˜ ๊ฒฝ๋กœ๋กœ๋ถ€ํ„ฐ CUDA๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๋ฉฐ, ํ•ด๋‹น ๋ชจ๋ธ์€ 'qa_model' ๋ณ€์ˆ˜์— ์ €์žฅ๋œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ํฌํŠธ '5000'์—์„œ flask ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค.
if __name__ == '__main__':
	model_path = 'model/checkpoint-1119-epoch-1'
	qa_model = get_qa_model(model_path, use_cuda=False)
	
	app.run(host='127.0.0.1',port=5000,debug=True)
	

โœ”๏ธ ext.py

  • ํ‚ค์›Œ๋“œ ์ค‘์‹ฌ์œผ๋กœ ์ค‘์š”๋ฌธ์žฅ์„ ์ถ”์ถœํ•˜๋Š”๋Š” ๋ชจ๋“ˆ์ด๋‹ค.
  • ์‚ฌ์šฉ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ:
    konlpy: ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
    textrankr: TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•œ ํ…์ŠคํŠธ ์š”์•ฝ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from typing import List
from konlpy.tag import Okt
from textrankr import TextRank
	
class OktTokenizer:
okt: Okt = Okt()

def __call__(self, text: str) -> List[str]:
	tokens: List[str] = self.okt.phrases(text)
	return tokens
	
def textrank_summarize(text: str, num_sentences: int, verbose: bool = True) -> str:
	mytokenizer: OktTokenizer = OktTokenizer()
	textrank: TextRank = TextRank(mytokenizer)
	summarized: str = textrank.summarize(text, num_sentences, verbose)
	return summarized

โœ”๏ธ qa_model.py

  • ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ œ๊ณตํ•˜๋Š” ๋ชจ๋“ˆ์ด๋‹ค.
  • ์‚ฌ์šฉ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ:
    • simpletransformers: ๊ฐ„ํŽธํ•œ ์‚ฌ์šฉ์„ ์œ„ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ๋ž˜ํ•‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import simpletransformers
	
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
	
def get_qa_model(model_path, use_cuda=False):
	print('Loading model from', model_path)
	qa_model = QuestionAnsweringModel('bert', model_path, use_cuda=use_cuda)
	return qa_model

โœ”๏ธ sum_model.py

  • ์ž๊ธฐ์†Œ๊ฐœ์„œ๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ์ƒ์„ฑ์š”์•ฝํ•˜๋Š” ๋ชจ๋“ˆ์ด๋‹ค.
  • ์‚ฌ์šฉ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ:
    • transformers: Hugging Face์˜ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
def summarize_model(text: str, verbose: bool = True) -> str:
	tokenizer = PreTrainedTokenizerFast.from_pretrained('digit82/kobart-summarization')
	model = BartForConditionalGeneration.from_pretrained('digit82/kobart-summarization')
	input_ids = tokenizer.encode(text, return_tensors="pt")
	    summary_text_ids = model.generate(
	        input_ids=input_ids,
	        bos_token_id=model.config.bos_token_id,
	        eos_token_id=model.config.eos_token_id,
	        length_penalty=2.0,
	        max_length=102,
	        min_length=20,
	        num_beams=4,
	    )
	    summarized_text = tokenizer.decode(summary_text_ids[0], skip_special_tokens=True)
	    
	    return summarized_text