Skip to content

A utility for storing and reading files for Korean LM training ๐Ÿ’พ

License

Notifications You must be signed in to change notification settings

monologg/ko_lm_dataformat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

91 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ko_lm_dataformat

PyPI License Code style: black

  • ํ•œ๊ตญ์–ด ์–ธ์–ด๋ชจ๋ธ์šฉ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ, ๋กœ๋”ฉํ•˜๊ธฐ ์œ„ํ•œ ์œ ํ‹ธ๋ฆฌํ‹ฐ

    • zstandard, ultrajson ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ, ์••์ถ• ์†๋„ ๊ฐœ์„ 
    • ๋ฌธ์„œ์— ๋Œ€ํ•œ ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ๋„ ํ•จ๊ป˜ ์ €์žฅ
  • ์ฝ”๋“œ๋Š” EleutherAI์—์„œ ์‚ฌ์šฉํ•˜๋Š” lm_dataformat๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ œ์ž‘

    • ์ผ๋ถ€ ๋ฒ„๊ทธ ์ˆ˜์ •
    • ํ•œ๊ตญ์–ด์— ๋งž๊ฒŒ ๊ธฐ๋Šฅ ์ถ”๊ฐ€ ๋ฐ ์ˆ˜์ • (sentence splitter, text cleaner)

Installation

pip3 install ko_lm_dataformat

Usage

1. Write Data

1.1. Archive

import ko_lm_dataformat as kldf

ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter

1.2. Adding data

  • meta ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ (e.g. ์ œ๋ชฉ, url)
  • ํ•˜๋‚˜์˜ document๊ฐ€ ๋“ค์–ด์˜จ๋‹ค๊ณ  ๊ฐ€์ • (str ์ด ์•„๋‹Œ List[str] ๋กœ ๋“ค์–ด์˜ค๊ฒŒ ๋˜๋ฉด ์—ฌ๋Ÿฌ ๊ฐœ์˜ sentence๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๊ฑธ๋กœ ์ทจ๊ธ‰)
  • split_sent=True์ด๋ฉด document๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ List[str] ์œผ๋กœ ์ €์žฅ
  • clean_sent=True์ด๋ฉด NFC Normalize, control char ์ œ๊ฑฐ, whitespace cleanup ์ ์šฉ
for doc in doc_lst:
    ar.add_data(
        data=doc,
        meta={
          "source": "kowiki",
          "meta_key_1": [othermetadata, otherrandomstuff],
          "meta_key_2": True
        },
        split_sent=False,
        clean_sent=False,
    )

# remember to commit at the end!
ar.commit()

2. Read Data

  • rdr.stream_data(get_meta=True)๋กœ ํ•  ์‹œ (doc, meta) ์˜ ํŠœํ”Œ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
import ko_lm_dataformat as kldf

rdr = kldf.Reader("output_dir")

for data in rdr.stream_data(get_meta=False):
  print(data)
  # "๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด, ์–ธ์–ด๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„์˜ ์‚ถ์„ ๋ฏธ์ (็พŽ็š„)์œผ๋กœ ํ˜•์ƒํ™”ํ•œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ...."


for data in rdr.stream_data(get_meta=True):
  print(data)
  # ("๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด, ์–ธ์–ด๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„์˜ ์‚ถ์„ ๋ฏธ์ (็พŽ็š„)์œผ๋กœ ํ˜•์ƒํ™”ํ•œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ....", {"source": "kowiki", ...})

About

A utility for storing and reading files for Korean LM training ๐Ÿ’พ

Resources

License

Stars

Watchers

Forks

Packages

No packages published