Skip to content

Database of over 100k news titles from Poland for analysis and ML purposes.

License

Notifications You must be signed in to change notification settings

avrland/polishNewsTitleDatabase

Repository files navigation

polishNewsTitleDatabase

Polish news titles database for analysis and ML purposes.

  • scrapped between 2020 and 2022 year (yeah, covid and war are included)
  • titles.txt - titles database line by line in raw .txt
  • collected mostly from Google News aggregator using GoogleNews Python library
  • no obvious duplicates
Line 1: Polka straciła 36 tys. zł: napastnik wykiwał zarówno ją, jak i bank
Line 2: Chrome 86 na Androida pozwoli zaplanować pobieranie. Można już testować
Line 3: Poczta Polska i cyfrowa rewolucja. Identyfikacja RFID przyspieszy wysyłki
Line 4: GOG GALAXY 2.0 łączy siły z Epic Games Store. Jest wreszcie oficjalna integracja
...
File Size is: 8.677 MB
Titles amount: 114952
Amount of words totally: 1213788

Some stats

News title lenght (characters) - boxplot charts

This is an image This is an image

Outliers (1315) removed via IQR method.

News title lenght (words) - boxplot charts

This is an image This is an image

Outliers (1691) removed via IQR method.

Characters vs words

This is an image This is an image

Outliers removed via IQR method.

Wordcloud

This is most used 30 words in database. I manually removed short words who brings no any meaning and context.

Tags used to collect newses:

 newsTags = [ "swiat", "koronawirus", "pis", "polska", "sport", "apple", "samsung", "technologia", "COVID-19", "amazon", "wojna", "google", "gospodarka", "chiny", "rozrywka", "nauka"]

About

Database of over 100k news titles from Poland for analysis and ML purposes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published