Skip to content

πŸ“μžμ†Œμ„œ μš”μ•½λ‹¨: μžκΈ°μ†Œκ°œμ„œ μš”μ•½ νŽ˜μ΄μ§€

Notifications You must be signed in to change notification settings

GaeunB/ai-summary-web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‘‹μ•ˆλ…•ν•˜μ„Έμš”, 'μžμ†Œμ„œ μš”μ•½λ‹¨'μž…λ‹ˆλ‹€.

νŒ€μ›: 변가은, κΉ€μ§€ν™˜, λ‚¨κ²½ν˜„, μ΄λŒ€ν—Œ, 이성민, 졜민재
νŒ€λ…Έμ…˜νŽ˜μ΄μ§€: Notion

πŸ”Ž Table of Contents

  • About
  • Main Fuctions
  • Architecture
  • Setting
  • Engine
    • General_Summarization
    • Keysentence_Extraction
    • Question_Answering
  • Code_Architecture

πŸ“ About

πŸ’‘μ£Όμ œ: 인곡지λŠ₯ ν™œμš© μžκΈ°μ†Œκ°œμ„œ μš”μ•½ νŽ˜μ΄μ§€
#NLP #Text_Summarization #Question_Answering #Flask

  • κ°œμš”

    • 맀 μ±„μš©μ‹œ μ§€μ›μžλŠ” λ™μΌν•œ μ§ˆλ¬Έμ„ μ œν•œλœ κΈ€μž 수 λ‚΄μ—μ„œ 닡변을 ν•˜κ³  μžˆλ‹€.
    • μ‹ μž…μ‚¬μ› μ±„μš©μ€ 이λ ₯μ„œλ³΄λ‹€ μžκΈ°μ†Œκ°œμ„œλ₯Ό μ€‘μš”ν•˜κ²Œ 평가
      • κ²½λ ₯이 μ—†λŠ” μ‹ μž…μ‚¬μ›μ˜ 업무 μˆ˜ν–‰λŠ₯λ ₯은 λΉ„μŠ·ν•˜κΈ° λ•Œλ¬Έμ— μ§€μ›μžμ˜ 잠재λ ₯을 νŒŒμ•…ν•˜κΈ° μœ„ν•œ λ„κ΅¬λ‘œ μžκΈ°μ†Œκ°œμ„œκ°€ 큰 비쀑을 μ°¨μ§€ν•œλ‹€.

    β‡’β€˜μžμ†Œμ„œ μš”μ•½λ‹¨β€™μ€ κΈ°μ—… μž„μ›μ§„μ„ λŒ€μƒμœΌλ‘œ μžκΈ°μ†Œκ°œμ„œμ˜ λ‚΄μš©μ„ μš”μ•½ν•˜μ—¬ 효과적으둜 μ‚¬μš©μžμ—κ²Œ μ „λ‹¬ν•˜κ³ , μƒν˜Έμž‘μš©ν•  수 μžˆλŠ” 인곡지λŠ₯ 기반의 μžκΈ°μ†Œκ°œ μš”μ•½ νŽ˜μ΄μ§€λ₯Ό μ œκ³΅ν•œλ‹€.


  • λžœλ”© νŽ˜μ΄μ§€

λžœλ”©νŽ˜μ΄μ§€@3x height='200'

  • μ‹œμ—° μ˜μƒ
default.mp4

🚦 Main Fuctions

  • General Summarization: μžμ—°μ–΄ 처리 λͺ¨λΈμ„ ν™œμš©ν•œ μžκΈ°μ†Œκ°œμ„œ μš”μ•½
  • Keysentence Extraction: ν‚€μ›Œλ“œ μ€‘μ‹¬μ˜ λ¬Έμž₯ μΆ”μΆœ
  • Question Answering: μ§ˆλ¬Έμ— λŒ€ν•œ λ‹΅λ³€ 제곡

πŸ”§ Architecture

  • Frontend: HTML, CSS, JS
  • Engine : Pytorch
    • Kobart, Textrankr, Bert
  • Backend: Flask

πŸ’Ύ Setting

  • Install modules
pip install -r requirements.txt 
  • Execute
flask run

⚑ Engine

βœ… General Summarization_kobart

https://github.com/SKT-AI/KoBART

  • BART(Bidirectional and Auto-Regressive Transformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” autoencoder의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€.
    image width=650 height=550
  1. BartλŠ” Transformer의 κΈ°λ³Έ μ•„ν‚€ν…μ²˜μΈ Encoder-Decoderꡬ쑰λ₯Ό κ°–κ³  μžˆλ‹€.
  2. λ”°λΌμ„œ μ½”λ“œλ„ Encoder와 Decoderλ₯Ό μ°¨λ‘€λ‘œ ν†΅κ³Όν•œλ‹€.
  3. Input data도 Encoder_inputκ³Ό Decoder_input을 λ”°λ‘œ μ€€λΉ„ν•΄μ•Όν•œλ‹€.
  4. μ–΄λ–»κ²Œ input을 넣어주냐에 따라 Taskλ§ˆλ‹€ ν•™μŠ΅/μΆ”λ‘  방법이 κ°ˆλ¦°λ‹€.
  • ν•œκ΅­μ–΄ BARTλŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ Text Infilling λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 40GB μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ encoder-decoder μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
    이λ₯Ό 톡해 λ„μΆœλœ KoBART-baseλ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ v1.0(λŒ€ν™”, λ‰΄μŠ€, ...), μ²­μ™€λŒ€ ꡭ민청원 λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

  • KoBARTλž€ νŽ˜μ΄μŠ€λΆμ—μ„œ κ³΅κ°œν•œ BARTλͺ¨λΈμ„ SKTμ—μ„œ 40GBμ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈλ‘œ 사전 ν•™μŠ΅μ‹œν‚¨ λͺ¨λΈμ΄λ‹€.
    BARTλŠ” seq2seq λͺ¨λΈμ„ μ‚¬μ „ν•™μŠ΅ν•˜κΈ° μœ„ν•œ denoising autoencoder(DAE, 작음제거 μ˜€ν†  인코더)둜, μž„μ˜μ˜ noising function으둜 ν…μŠ€νŠΈλ₯Ό μ†μƒμ‹œν‚¨ ν›„ λͺ¨λΈμ΄ 원본 ν…μŠ€νŠΈλ₯Ό μž¬κ΅¬μΆ•ν•˜λŠ” λ°©μ‹μœΌλ‘œ ν•™μŠ΅μ΄ μ§„ν–‰λœλ‹€.
    BARTλŠ” κΈ°μ‘΄ BERTλͺ¨λΈκ³Ό GPTλ₯Ό ν•©μΉœ ꡬ쑰λ₯Ό 가지고 μžˆλŠ”λ°, 이둜 인해 BERT의 Bidirectional νŠΉμ§•κ³Ό GPT의 Auto-Regressiveν•œ νŠΉμ§•μ„ λͺ¨λ‘ 가진닀. 덕뢄에 BARTλŠ” κΈ°μ‘΄ MLMλͺ¨λΈλ“€μ— λΉ„ν•΄ λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ 높은 ν™œμš©μ„±μ„ λ‚˜νƒ€λ‚Έλ‹€. image width=650 height=550
    Fig.1 Bartꡬ쑰

  • BARTλŠ” μ†μƒλœ Textλ₯Ό μž…λ ₯으둜 λ°›μ•„ Bidirectional λͺ¨λΈλ‘œ encodingν•˜κ³  μ •λ‹΅ Text에 λŒ€ν•œ likelihoodλ₯Ό autoregressive λͺ¨λΈλ‘œ decodingν•˜μ—¬ κ³„μ‚°ν•œλ‹€. BARTμ—μ„œλŠ” λ‹€μŒκ³Ό 같은 5가지 noising 기법이 μ‘΄μž¬ν•œλ©°, 이λ₯Ό 톡해 μ†μƒλœ Textλ₯Ό μ–»λŠ”λ‹€. image width=650 height=550
    Fig.2 Noising기법

  • BARTλŠ” μžκΈ°νšŒκ·€ 디코더λ₯Ό κ°–κΈ° λ•Œλ¬Έμ—, abstractive QA와 summarizationκ³Ό 같은 μ‹œν€€μŠ€ μΌλ°˜ν™”(Sequence Generation) νƒœμŠ€ν¬μ— μ§μ ‘μ μœΌλ‘œ νŒŒμΈνŠœλ‹ 될 수 μžˆλ‹€. 이번 ν”„λ‘œμ νŠΈμ—μ„œλŠ” 이λ ₯μ„œ μš”μ•½ κΈ°λŠ₯을 μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ KoBARTλͺ¨λΈμ— μ±„μš©λ©΄μ ‘ λ°μ΄ν„°λ‘œ νŒŒμΈνŠœλ‹μ„ μ§„ν–‰ν•˜μ˜€λ‹€.(데이터셋: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=realm&dataSetSn=71592)


μ°Έκ³ λ¬Έν—Œ [1] Mike Lewisμ™Έ(2019), "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", ACL
[2] μˆ˜λ‹€λ₯΄μ‚° λΌλΉ„μ°¬λ””λž€(2021), "ꡬ글 BERT의 정석", ν•œλΉ›λ―Έλ””μ–΄

βœ… Keysentence Extraction_Textrankr

  • TextRank μ•Œκ³ λ¦¬μ¦˜μ€ 2004λ…„ κ΅¬κΈ€μ—μ„œ λ°œν‘œν•œ PageRank μ•Œκ³ λ¦¬μ¦˜μ„ 기반으둜 ν•œ μ•Œκ³ λ¦¬μ¦˜μ΄λ‹€[1].
    PageRank μ•Œκ³ λ¦¬μ¦˜μ€ μˆ˜μ§‘λœ 인터넷 λ¬Έμ„œ 각각을 κ·Έλž˜ν”„μ˜ λ…Έλ“œ, λ¬Έμ„œ λ‚΄λΆ€μ˜ 링크 정보λ₯Ό κ°„μ„ μœΌλ‘œ κ°€μ •ν•˜μ—¬ λ°©ν–₯성이 μžˆλŠ” κ·Έλž˜ν”„λ₯Ό λ§Œλ“€μ–΄ λ¬Έμ„œμ˜ μ€‘μš”λ„λ₯Ό κ³„μ‚°ν•œλ‹€[2]. 쑰금 더 μ‰½κ²Œ λ§ν•˜μžλ©΄ PageRankλŠ” 각 μ›ΉνŽ˜μ΄μ§€λ§ˆλ‹€ ν•˜μ΄νΌλ§ν¬κ°€ μžˆμ„ λ•Œ μ–Όλ§ˆλ‚˜ 링크λ₯Ό λ°›λŠλƒμ— 따라 μˆœμœ„λ₯Ό λ§€κΈ°λŠ” μ•Œκ³ λ¦¬μ¦˜μ„ λ§ν•œλ‹€. 즉, ν•΄λ‹Ή 링크λ₯Ό 클릭할 ν™•λ₯ λ‘œ κ·Έ μˆœμœ„λ₯Ό λ§€κΈ°λŠ” 것이닀.
  • TextRank μ•Œκ³ λ¦¬μ¦˜μ€ PageRank의 κ°œλ…μ„ μžμ—°μ–΄ μ²˜λ¦¬μ— μ‘μš©ν•œ κ²ƒμœΌλ‘œ λ¬Έμž₯, 단어와 같은 νŠΉμ • λ‹¨μœ„λ“€ κ°„μ˜ μ€‘μš”λ„λ₯Ό κ³„μ‚°ν•˜λŠ” μ•Œκ³ λ¦¬μ¦˜μ΄λ‹€. λ¬Έμ„œ λ‚΄μ˜ 각 λ¬Έμž₯을 κ·Έλž˜ν”„μ˜ 정점(vertex)으둜 κ°€μ •ν•˜λŠ” 경우 μ€‘μš”ν•œ λ¬Έμž₯듀을 선별할 수 있으며, 이λ₯Ό 톡해 λ¬Έμ„œ μš”μ•½μ΄ κ°€λŠ₯ν•˜λ‹€. κ²°κ΅­, TextRankλŠ” μ•žμ„œ PageRankμ—μ„œμ˜ νŽ˜μ΄μ§€ κ°œλ…μ„ λ‹¨μ–΄μ˜ κ°œλ…μœΌλ‘œ λ°”κΎΌ μ•Œκ³ λ¦¬μ¦˜μ΄λ‹€. ν…μŠ€νŠΈλ‘œ 이루어진 κΈ€μ—μ„œ νŠΉμ • 단어가 λ‹€λ₯Έ λ¬Έμž₯κ³Ό μ–Όλ§ˆλ§ŒνΌμ˜ 관계λ₯Ό λ§Ίκ³  μžˆλŠ”μ§€λ₯Ό κ³„μ‚°ν•œλ‹€.
μŠ€ν¬λ¦°μƒ· 2024-02-20 μ˜€μ „ 11 22 32
  • μœ„ μ΄λ―Έμ§€λŠ” 주어진 글에 λŒ€ν•΄ ν…μŠ€νŠΈ κ°„ 관계λ₯Ό λ‚˜νƒ€λ‚Έ κ·Έλž˜ν”„λ‘œ κ·Έλ¦° μƒ˜ν”Œ 이미지이닀. 각 λ¬Έμž₯μ—μ„œ 단어 κ°„μ˜ 관계λ₯Ό μ„ μœΌλ‘œ μ—°κ²°ν•œ 것이닀.
    κ·Έλ ‡λ‹€λ©΄ 핡심 λ¬Έμž₯을 μΆ”μΆœν•˜κΈ° μœ„ν•΄μ„œλŠ” μ–΄λ–»κ²Œ ν•΄μ•Ό ν• κΉŒ? μ•„λž˜μ˜ 그림처럼 λ¬Έμž₯ κ·Έλž˜ν”„λ₯Ό λ§Œλ“€μ–΄μ•Ό ν•œλ‹€. 각 λ¬Έμž₯이 λ§ˆλ””κ°€ λ˜λŠ” 것이닀.
μŠ€ν¬λ¦°μƒ· 2024-02-20 μ˜€μ „ 11 28 52
정석원 μ™Έ(2017) "TextRank μ•Œκ³ λ¦¬μ¦˜κ³Ό 주의 집쀑 μˆœν™˜ 신경망을 μ΄μš©ν•œ ν•˜μ΄λΈŒλ¦¬λ“œ λ¬Έμ„œ μš”μ•½" λ…Όλ¬Έμ—μ„œλŠ” TextRank μ•Œκ³ λ¦¬μ¦˜μ€ 각 λ¬Έμž₯의 μ€‘μš”λ„λ₯Ό ꡬ할 λ•Œ, λ¬Έμž₯ κ°„ 상관행렬을 μ΄μš©ν•˜μ—¬ κ΅¬ν•˜μ˜€λ‹€. textrank1
  • μž…λ ₯ λ¬Έμ„œμ˜ 각 λ¬Έμž₯듀에 λŒ€ν•΄ ν˜•νƒœμ†Œ 뢄석을 μˆ˜ν–‰ν•˜κ³ , 체언λ₯˜μ™€ μš©μ–Έλ₯˜μ˜ TF-IDFλ₯Ό κ³„μ‚°ν•˜μ—¬ λ¬Έμž₯-단어 행렬을 μƒμ„±ν•œλ‹€. κ·Έ λ’€ μƒμ„±λœ λ¬Έμž₯-단어 ν–‰λ ¬μ˜ μ „μΉ˜ 행렬을 κ΅¬ν•˜μ—¬ μ„œλ‘œ κ³±ν•΄μ£Όλ©΄ λ¬Έμž₯ κ°„μ˜ 상관관계(correlation)을 λ‚˜νƒ€λ‚΄λŠ” 행렬을 얻을 수 μžˆλ‹€. μ΄λ ‡κ²Œ κ΅¬ν•œ λ¬Έμž₯ κ°„ 상관행렬은 λ¬Έμž₯ κ°„μ˜ κ°€μ€‘μΉ˜ κ·Έλž˜ν”„λ‘œ λ‚˜νƒ€λ‚Ό 수 있고, TextRank μ•Œκ³ λ¦¬μ¦˜μ„ 톡해 각 λ¬Έμž₯의 μ€‘μš”λ„λ₯Ό ꡬ할 수 μžˆλ‹€. μ΄λ ‡κ²Œ κ΅¬ν•œ μ€‘μš”λ„ 순으둜 λ¬Έμž₯듀을 μ •λ ¬ν•œ λ’€ μƒμœ„ n개의 λ¬Έμž₯듀을 μž¬λ°°μΉ˜ν•˜λ©΄ μš”μ•½ κ²°κ³Όλ₯Ό 얻을 수 μžˆλ‹€[3].

μ°Έκ³ λ¬Έν—Œ [1] μ΄μƒμ˜ μ™Έ(2023), "TextRank μ•Œκ³ λ¦¬μ¦˜ 및 인곡지λŠ₯을 ν™œμš©ν•œ λΈŒλ ˆμΈμŠ€ν† λ°", JPEE : Journal of practical engineering education = μ‹€μ²œκ³΅ν•™κ΅μœ‘λ…Όλ¬Έμ§€, v.15 no.2, pp.509 - 517
[2] 배원식과 차정원(2010), "TextRank μ•Œκ³ λ¦¬μ¦˜μ„ μ΄μš©ν•œ λ¬Έμ„œ λ²”μ£Όν™”", μ •λ³΄κ³Όν•™νšŒλ…Όλ¬Έμ§€. Journal of KIISE. μ»΄ν“¨νŒ…μ˜ μ‹€μ œ 및 λ ˆν„°, v.16 no.1, pp.110-114
[3] 정석원 μ™Έ(2017), "TextRank μ•Œκ³ λ¦¬μ¦˜κ³Ό μ£Όμ˜μ§‘μ€‘ μˆœν™˜ 신경망을 μ΄μš©ν•œ ν•˜μ΄λΈŒλ¦¬λ“œ λ¬Έμ„œ μš”μ•½", ν•œκ΅­μ–΄μ •λ³΄ν•™νšŒ 2017년도 제29회 ν•œκΈ€λ°ν•œκ΅­μ–΄μ •λ³΄μ²˜λ¦¬ν•™μˆ λŒ€νšŒ, pp.47 - 50

βœ… Question Answering_Bert (원리)

  • Question Answering model은 μ‚¬μš©μžλ‘œλΆ€ν„° 받은 νŠΉμ •ν•œ μ§ˆμ˜μ— κ΄€λ ¨λœ 정닡을 μžμ—°μ–΄λ‘œ μžλ™μœΌλ‘œ 좜λ ₯ν•˜λŠ” μ‹œμŠ€ν…œμ΄λ‹€. [1]

  • QA μž‘μ—…μ€ 주어진 질문과 λ¬Έλ§₯을 μ΄ν•΄ν•˜μ—¬ 닡변을 μ œμ‹œν•˜λŠ” 것이 λͺ©ν‘œμΈλ°, 닡변을 μ°ΎλŠ” 방식에 따라 μΆ”μΆœν˜•(extractive), μΆ”μƒν˜•(abstractive) 으둜 λ‚˜λ‰œλ‹€. λ¬Έμž₯ λ‚΄μ—μ„œ μ§ˆλ¬Έμ— ν•΄λ‹Ήν•˜λŠ” 닡변을 μ°Ύμ•„λ‚΄λŠ”μ§€ / 주어진 μ§ˆλ¬Έμ— λŒ€ν•œ 닡변을 직접 μƒμ„±ν•˜λŠ” λ°©μ‹μ˜ 차이이닀. QA μž‘μ—…μ„ μ§„ν–‰ν•˜κΈ° μœ„ν•΄, κ΅¬κΈ€μ—μ„œ κ°œλ°œν•œ NLP 처리 λͺ¨λΈμΈ BERT λͺ¨λΈμ„ ν™œμš©ν•˜μ˜€λ‹€.

  • BERT λͺ¨λΈμ€ 사전 ν›ˆλ ¨ μ–Έμ–΄ λͺ¨λΈλ‘œμ„œ, νŠΉμ • 뢄야에 κ΅­ν•œλœ 기술이 μ•„λ‹ˆλΌ λͺ¨λ“  μžμ—°μ–΄ 처리 λΆ„μ•Όμ—μ„œ 쒋은 μ„±λŠ₯을 λ‚΄λŠ” λ²”μš© Language Model이닀. Transformer architertureμ—μ„œ encoder만 μ‚¬μš©ν•œλ‹€λŠ” νŠΉμ§•μ΄ μžˆλ‹€.κ΅¬μ‘°λŠ” λ‹€μŒκ³Ό κ°™λ‹€.

1. Input
  • Input은 Token Embedding + Segment Embedding + Position Embedding 3가지 μž„λ² λ”©μ„ κ²°ν•©ν•œ λ°©μ‹μœΌλ‘œ μ§„ν–‰ν•œλ‹€. -Token Embedding : Word Piece μž„λ² λ”© 방식을 μ‚¬μš©ν•œλ‹€. Char λ‹¨μœ„λ‘œ μž„λ² λ”© ν›„ λ“±μž₯ λΉˆλ„μ— 따라 sub-word 둜 κ΅¬λΆ„ν•œλ‹€. -Segment Embedding : Sentence Embedding으둜, 토큰 μ‹œν‚¨ 단어듀을 λ‹€μ‹œ ν•˜λ‚˜μ˜ λ¬Έμž₯으둜 λ§Œλ“œλŠ” μž‘μ—…μ΄λ‹€. κ΅¬λΆ„μž [SEP] λ₯Ό 톡해 λ¬Έμž₯을 κ΅¬λΆ„ν•˜κ³  두 λ¬Έμž₯을 ν•˜λ‚˜μ˜ Segment 둜 μ§€μ •ν•œλ‹€. -Position Embedding : μž…λ ₯ ν† ν°μ˜ μœ„μΉ˜ 정보λ₯Ό κ³ λ €ν•˜μ§€ μ•Šκ³ , 토큰 μˆœμ„œλŒ€λ‘œ μΈμ½”λ”©ν•œλ‹€.
2. Pre-Training
  • MLM(masked Language Model) : μž…λ ₯ λ¬Έμž₯μ—μ„œ μž„μ˜λ‘œ 토큰을 버리고(Mask) κ·Έ 토큰을 λ§žμΆ”λŠ” λ°©μ‹μœΌλ‘œ ν•™μŠ΅μ„ μ§„ν–‰ν•œλ‹€.

  • NSP(Next Sentence Prediction) : 두 λ¬Έμž₯이 μ£Όμ–΄μ‘Œμ„ λ•Œ, 두 λ¬Έμž₯의 μˆœμ„œλ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 방식이닀. 두 λ¬Έμž₯ κ°„ 관련이 κ³ λ €λ˜μ•Ό ν•˜λŠ” NLI와 QA의 파인 νŠœλ‹μ„ μœ„ν•΄ 두 λ¬Έμž₯의 연관을 λ§žμΆ”λŠ” ν•™μŠ΅μ„ μ§„ν–‰ν•œλ‹€.

3. Fine-tuning

  • BertλŠ” fine-tuning λ‹¨κ³„μ—μ„œ pre-training κ³Ό 거의 λ™μΌν•œ ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ₯Ό μ‚¬μš©ν•œλ‹€. 각 NLP taskλ§ˆλ‹€ fine-tuning ν›„ Bert λͺ¨λΈμ„ transfer learningμ‹œμΌœ μ„±λŠ₯을 ν™•μΈν•œλ‹€.

  • BERT 기반 QA λͺ¨λΈμ€ 크게 두 가지 μ ‘κ·Ό 방식을 μ‚¬μš©ν•œλ‹€.

  1. 질문과 λ¬Έλ§₯을 ν•¨κ»˜ μž…λ ₯으둜 λ°›μ•„ 닡변을 μ˜ˆμΈ‘ν•˜λŠ” 방식
  2. 질문과 λ¬Έλ§₯을 각각 λ”°λ‘œ μž…λ ₯으둜 λ°›μ•„ 각각에 λŒ€ν•œ μž„λ² λ”©μ„ μƒμ„±ν•˜κ³ , 이λ₯Ό λ‹€μ–‘ν•œ λ°©μ‹μœΌλ‘œ κ²°ν•©ν•˜μ—¬ 닡변을 μ˜ˆμΈ‘ν•˜λŠ” 방식 λ³Έ ν”„λ‘œμ νŠΈμ—μ„œλŠ” questionκ³Ό referenceλ₯Ό λ™μ‹œμ— input 으둜 μ‚¬μš©ν•˜μ—¬ extractive question answering λͺ¨λΈμ„ κ΅¬μ„±ν•˜μ˜€λ‹€.
μ°Έκ³ λ¬Έν—Œ [1] κΆŒμ„Έλ¦°, et al. "의료 κ΄€λ ¨ μ§ˆμ˜μ‘λ‹΅μ„ μœ„ν•œ BERT 기반 ν•œκ΅­μ–΄ QA λͺ¨λΈ." Proceedings of KIIT Conference. 2022.
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] 솑닀연, μ‘°μƒν˜„, and ꢌ혁철. "ν•œκ΅­μ–΄ 문법 QA μ‹œμŠ€ν…œμ„ μœ„ν•œ BERT 기반 μ‹œν€€μŠ€ λΆ„λ₯˜λͺ¨λΈ." ν•œκ΅­μ •λ³΄κ³Όν•™νšŒ ν•™μˆ λ°œν‘œλ…Όλ¬Έμ§‘ (2021): 754-756.

πŸ“‘ Code Architecture

βœ”οΈ app.py

  • app.py μ—μ„œλŠ” 엔진을 초기 μ„ΈνŒ…ν•˜κ³ , flask μ„œλ²„ 섀정을 톡해 ν”„λ‘ νŠΈμ™€ 톡신을 ν•  수 μžˆλ„λ‘ ν•œλ‹€. μ—”μ§„μ—μ„œ κ°€μž₯ μ€‘μ μœΌλ‘œ μ‹€ν–‰λ˜λŠ” μ½”λ“œ 파일이며, λ°±μ—”λ“œμ—μ„œ μš”μ²­μ΄ λ“€μ–΄μ˜¬ 경우, 그에 λ§žλŠ” μž‘μ—…μ„ μˆ˜ν–‰ν•˜μ—¬ μ²˜λ¦¬ν•œλ‹€.
  • flask_corsλ₯Ό ν™œμš©ν•˜μ—¬ CORS 이슈λ₯Ό ν•΄κ²°ν•œλ‹€.
from flask import Flask,render_template,request, jsonify
from flask_cors import CORS
import torch
from sum_model import summarize_model
from ext import textrank_summarize
from qa_model import get_qa_model
	
app = Flask(__name__)
cors = CORS(app)
  • flask 앱을 μƒμ„±ν•˜κ³  λ‹€μŒ κ²½λ‘œμ— λŒ€ν•œ λΌμš°νŒ…μ„ μ„€μ •ν•œλ‹€.
    • '/': ν™ˆ νŽ˜μ΄μ§€λ₯Ό λ Œλ”λ§
    • '/sum': μš”μ•½ νŽ˜μ΄μ§€λ₯Ό λ Œλ”λ§
    • '/sum/gsummarize': 일반적인 μš”μ•½μ„ μˆ˜ν–‰
    • '/sum/key': ν‚€μ›Œλ“œ μ€‘μ‹¬μ˜ λ¬Έμž₯ μΆ”μΆœμ„ μˆ˜ν–‰
    • '/sum/qa': μ§ˆλ¬Έμ— λŒ€ν•œ 닡변을 제곡
#home
@app.route('/')	
def home():
	return render_template('home.html')
	
#summary page
@app.route('/sum')	
def index():
	return render_template('index.html')
	
#general summarization
@app.route('/sum/gsummarize', methods=['POST'])
def gsummarize():
	try:
		data = request.get_json(force=True)
		context = data['context']
		gsum = summarize_model(context)
		response = jsonify({'gsum': gsum})
	
		except Exception as e:
			response = jsonify({'error': str(e)})	
		return response
	
	
# keysentence extraction
@app.route('/sum/key',methods=['POST'])
def key():
	try:
		data = request.get_json(force=True)
		context = data['context']
		keytext = textrank_summarize(context,1) #λ¬Έμž₯ 수 쑰절 ν•„μš”
		response = jsonify({'keytext': keytext})
	
		except Exception as e:
			response = jsonify({'error': str(e)})	
			
		return response
		
#qa
@app.route('/sum/qa', methods=['POST'])
def qa_endpoint():
	try:
		data = request.get_json(force=True)
	
		context = data['context']
		question = data['question']
		if question == "":
			response = jsonify({'error': 'μ§ˆλ¬Έμ„ μž…λ ₯ν•΄μ£Όμ„Έμš”.'})
			return response
	
			to_predict = [{"context": context, "qas": [{"question": question, "id": "0"}]}]
	
			result = qa_model.predict(to_predict)
	
			answer = result[0][0]['answer'][0]
			answer = "μ μ ˆν•œ 닡변을 찾을 수 μ—†μŠ΅λ‹ˆλ‹€." if answer == '' else answer
	
			response = jsonify({'answer': answer})
	
		except Exception as e:
			response = jsonify({'error': str(e)})
	
		return response
  • μ§€μ •λœ λͺ¨λΈμ˜ κ²½λ‘œλ‘œλΆ€ν„° CUDAλ₯Ό μ‚¬μš©ν•˜μ§€ μ•Šκ³  λͺ¨λΈμ„ λ‘œλ“œν•˜λ©°, ν•΄λ‹Ή λͺ¨λΈμ€ 'qa_model' λ³€μˆ˜μ— μ €μž₯λœλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, 포트 '5000'μ—μ„œ flask μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ„ μ‹€ν–‰μ‹œν‚¨λ‹€.
if __name__ == '__main__':
	model_path = 'model/checkpoint-1119-epoch-1'
	qa_model = get_qa_model(model_path, use_cuda=False)
	
	app.run(host='127.0.0.1',port=5000,debug=True)
	

βœ”οΈ ext.py

  • ν‚€μ›Œλ“œ μ€‘μ‹¬μœΌλ‘œ μ€‘μš”λ¬Έμž₯을 μΆ”μΆœν•˜λŠ”λŠ” λͺ¨λ“ˆμ΄λ‹€.
  • μ‚¬μš©λœ 라이브러리:
    konlpy: ν•œκ΅­μ–΄ μžμ—°μ–΄ 처리λ₯Ό μœ„ν•œ 라이브러리
    textrankr: TextRank μ•Œκ³ λ¦¬μ¦˜μ„ ν™œμš©ν•œ ν…μŠ€νŠΈ μš”μ•½μ„ μ œκ³΅ν•˜λŠ” 라이브러리
from typing import List
from konlpy.tag import Okt
from textrankr import TextRank
	
class OktTokenizer:
okt: Okt = Okt()

def __call__(self, text: str) -> List[str]:
	tokens: List[str] = self.okt.phrases(text)
	return tokens
	
def textrank_summarize(text: str, num_sentences: int, verbose: bool = True) -> str:
	mytokenizer: OktTokenizer = OktTokenizer()
	textrank: TextRank = TextRank(mytokenizer)
	summarized: str = textrank.summarize(text, num_sentences, verbose)
	return summarized

βœ”οΈ qa_model.py

  • μ§ˆλ¬Έμ— λŒ€ν•œ 닡변을 μ œκ³΅ν•˜λŠ” λͺ¨λ“ˆμ΄λ‹€.
  • μ‚¬μš©λœ 라이브러리:
    • simpletransformers: κ°„νŽΈν•œ μ‚¬μš©μ„ μœ„ν•œ 트랜슀포머 λͺ¨λΈ λž˜ν•‘ 라이브러리
import simpletransformers
	
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
	
def get_qa_model(model_path, use_cuda=False):
	print('Loading model from', model_path)
	qa_model = QuestionAnsweringModel('bert', model_path, use_cuda=use_cuda)
	return qa_model

βœ”οΈ sum_model.py

  • μžκΈ°μ†Œκ°œμ„œλ₯Ό ν¬κ΄„μ μœΌλ‘œ μƒμ„±μš”μ•½ν•˜λŠ” λͺ¨λ“ˆμ΄λ‹€.
  • μ‚¬μš©λœ 라이브러리:
    • transformers: Hugging Face의 트랜슀포머 λͺ¨λΈ 라이브러리
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
def summarize_model(text: str, verbose: bool = True) -> str:
	tokenizer = PreTrainedTokenizerFast.from_pretrained('digit82/kobart-summarization')
	model = BartForConditionalGeneration.from_pretrained('digit82/kobart-summarization')
	input_ids = tokenizer.encode(text, return_tensors="pt")
	    summary_text_ids = model.generate(
	        input_ids=input_ids,
	        bos_token_id=model.config.bos_token_id,
	        eos_token_id=model.config.eos_token_id,
	        length_penalty=2.0,
	        max_length=102,
	        min_length=20,
	        num_beams=4,
	    )
	    summarized_text = tokenizer.decode(summary_text_ids[0], skip_special_tokens=True)
	    
	    return summarized_text

About

πŸ“μžμ†Œμ„œ μš”μ•½λ‹¨: μžκΈ°μ†Œκ°œμ„œ μš”μ•½ νŽ˜μ΄μ§€

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published