polishNewsTitleDatabase

Polish news titles database for analysis and ML purposes.

scrapped between 2020 and 2022 year (yeah, covid and war are included)
titles.txt - titles database line by line in raw .txt
collected mostly from Google News aggregator using GoogleNews Python library
no obvious duplicates

Line 1: Polka straciła 36 tys. zł: napastnik wykiwał zarówno ją, jak i bank
Line 2: Chrome 86 na Androida pozwoli zaplanować pobieranie. Można już testować
Line 3: Poczta Polska i cyfrowa rewolucja. Identyfikacja RFID przyspieszy wysyłki
Line 4: GOG GALAXY 2.0 łączy siły z Epic Games Store. Jest wreszcie oficjalna integracja
...
File Size is: 8.677 MB
Titles amount: 114952
Amount of words totally: 1213788

Some stats

News title lenght (characters) - boxplot charts

Outliers (1315) removed via IQR method.

News title lenght (words) - boxplot charts

Outliers (1691) removed via IQR method.

Characters vs words

Outliers removed via IQR method.

Wordcloud

This is most used 30 words in database. I manually removed short words who brings no any meaning and context.

Tags used to collect newses:

 newsTags = [ "swiat", "koronawirus", "pis", "polska", "sport", "apple", "samsung", "technologia", "COVID-19", "amazon", "wojna", "google", "gospodarka", "chiny", "rozrywka", "nauka"]

You can check how data was fetched in Google Colab notepad.
You can also try working with stats in Google Colab notepad

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
docs		docs
misc_scripts		misc_scripts
GoogleNews_scrapper_to_textfile.ipynb		GoogleNews_scrapper_to_textfile.ipynb
LICENSE		LICENSE
README.md		README.md
Stats_and_visualization.ipynb		Stats_and_visualization.ipynb
titles.txt		titles.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polishNewsTitleDatabase

Some stats

News title lenght (characters) - boxplot charts

News title lenght (words) - boxplot charts

Characters vs words

Wordcloud

Tags used to collect newses:

About

Releases

Packages

Languages

License

avrland/polishNewsTitleDatabase

Folders and files

Latest commit

History

Repository files navigation

polishNewsTitleDatabase

Some stats

News title lenght (characters) - boxplot charts

News title lenght (words) - boxplot charts

Characters vs words

Wordcloud

Tags used to collect newses:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages