Polish news titles database for analysis and ML purposes.
- scrapped between 2020 and 2022 year (yeah, covid and war are included)
- titles.txt - titles database line by line in raw .txt
- collected mostly from Google News aggregator using GoogleNews Python library
- no obvious duplicates
Line 1: Polka straciła 36 tys. zł: napastnik wykiwał zarówno ją, jak i bank
Line 2: Chrome 86 na Androida pozwoli zaplanować pobieranie. Można już testować
Line 3: Poczta Polska i cyfrowa rewolucja. Identyfikacja RFID przyspieszy wysyłki
Line 4: GOG GALAXY 2.0 łączy siły z Epic Games Store. Jest wreszcie oficjalna integracja
...
File Size is: 8.677 MB
Titles amount: 114952
Amount of words totally: 1213788
Outliers (1315) removed via IQR method.
Outliers (1691) removed via IQR method.
Outliers removed via IQR method.
This is most used 30 words in database. I manually removed short words who brings no any meaning and context.
newsTags = [ "swiat", "koronawirus", "pis", "polska", "sport", "apple", "samsung", "technologia", "COVID-19", "amazon", "wojna", "google", "gospodarka", "chiny", "rozrywka", "nauka"]
- You can check how data was fetched in Google Colab notepad.
- You can also try working with stats in Google Colab notepad