Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand 4cat to other languages (Finnish) #470

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
b3e43de
#5 added processors for wikipedia networks Finnish only
anitabraida Sep 13, 2024
d73c13a
#7 added link to the finnish version of spacy
anitabraida Sep 13, 2024
defbab0
#7 added processor for linguistic extraction for finnish text
anitabraida Sep 13, 2024
d43f790
#7 added processor get_entities for finnish text
anitabraida Sep 13, 2024
0330b1c
#7 added processor get_nouns for finnish text
anitabraida Sep 13, 2024
762b1d2
#6 added neologisms in finnish processor (added wordlist for issue #2…
anitabraida Sep 13, 2024
8afc6d7
#7 updated tokenisation processors to allow finnish wordlist in filte…
anitabraida Sep 13, 2024
34517f9
Update README.md
anitabraida Sep 13, 2024
eef33bc
Update README.md
anitabraida Sep 16, 2024
ac6c836
Update README.md
anitabraida Sep 16, 2024
1336fd9
#8 language identification method added
anitabraida Sep 27, 2024
7864294
Merge branch 'master' of github.com:uh-dcm/4cat_fi
anitabraida Sep 27, 2024
0ed43e1
#8 added option for user to modify the dataset language
anitabraida Sep 27, 2024
319ef30
#10 changed compatibility for linguistic extractor
anitabraida Sep 27, 2024
92e95ef
#10 changed language compatibility
anitabraida Sep 27, 2024
c7028f7
#10 changed compatibilites for presets
anitabraida Sep 27, 2024
ea6066c
#10 changed compatibilites for network processors
anitabraida Sep 27, 2024
2325c44
#2 hatespeech wordlist regex
anitabraida Sep 28, 2024
3dfede3
#1 lexical filter for finnish and #10 changed compatibility
anitabraida Sep 28, 2024
56282cc
#11 reference added
anitabraida Sep 28, 2024
01c106a
added "punkt_tab" to avoid errors
anitabraida Oct 8, 2024
480513a
#8 Errors with previous wrapper. Using fasttext (light) model directl…
anitabraida Oct 11, 2024
f521499
#8 children datasets inherit language from parent dataset
anitabraida Oct 11, 2024
2a22d5b
#1 added lexical filter with lemmatization
anitabraida Oct 11, 2024
c7091b4
#10 changed compatibilities for hatebase
anitabraida Oct 11, 2024
83b37a1
#6 added language
anitabraida Oct 11, 2024
f7a8e8a
#7 added lemmatization for finnish text and fixed compatibilities
anitabraida Oct 11, 2024
9f149d2
#8 changed to more reasonable number
anitabraida Oct 15, 2024
27e31fd
#8
anitabraida Oct 16, 2024
f4716b4
Merge remote-tracking branch 'upstream/master'
anitabraida Oct 17, 2024
801c112
SpaCy added again
anitabraida Oct 18, 2024
e691208
Updated to requirements: removed language detection, combined process…
anitabraida Nov 2, 2024
afe9804
#1 combined filter according to requirements and updated hatespeech w…
anitabraida Nov 4, 2024
c8c2e04
Merge remote-tracking branch 'upstream/master'
anitabraida Nov 4, 2024
2b3f832
Fixed some compatibility issues
anitabraida Nov 11, 2024
6d8267a
fixed descriptions
anitabraida Nov 13, 2024
0ff8c95
Merge remote-tracking branch 'upstream/master'
anitabraida Nov 15, 2024
74edaba
fixed tokenization error
anitabraida Nov 20, 2024
dd1a9ea
Fixed tokenise issues, modified lexical_filter
anitabraida Nov 20, 2024
0cb371c
Merge remote-tracking branch 'upstream/master'
anitabraida Nov 20, 2024
eac9e38
Merge branch 'digitalmethodsinitiative:master' into master
anitabraida Nov 28, 2024
e9e9313
Merge branch 'digitalmethodsinitiative:master' into master
anitabraida Dec 10, 2024
d68f28b
Merge branch 'digitalmethodsinitiative:master' into master
anitabraida Dec 11, 2024
6cff740
Update README.md
anitabraida Dec 11, 2024
87af9d6
Merge branch 'digitalmethodsinitiative:master' into master
anitabraida Dec 11, 2024
6bff0ed
Update README.md
anitabraida Dec 11, 2024
33adcc6
Update README.md
anitabraida Dec 11, 2024
73fced0
Remove language compatibility from top_hatebase.py
anitabraida Dec 12, 2024
7889935
Merge branch 'digitalmethodsinitiative:master' into master
anitabraida Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion backend/lib/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ def after_process(self):

if not self.dataset.is_finished():
self.dataset.finish()

self.dataset.remove_staging_areas()

# see if we have anything else lined up to run next
Expand Down
350 changes: 350 additions & 0 deletions common/assets/wordlists/finnish_hatespeech.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
\bsuomi\b \bsuomalaisille\b
\blyhytkasvui
\bseksuaali .* \btaipumu
\bbabbe\b
\bryssitel
\bautismi
\bfag\b
gay(?!e)
pilakuv
kansallispu(?!isto)
\bmustalai
muslim
\bsynskada
\bjeesu
\byhteiskunnan\b \belät
\bkuulovamma
\bpois\b \bsuomesta\b
\bromsk
laestadi
\bhor
\bhelluntai
\bliikuntaestei
\brampa\b
\bfrikyrk
syndrooma
\blesta
\bortodoks
\bbimie
\btransihmi
\bdvärg
\bbinai
\b(?<!hanka)lahko
\bmenkää\b \bsinne\b \bmistä\b
\btakaisin\b \bsinne\b \bmistä\b
\bhomppel
\brasism
\bkristit
\bprofet
\btransmie
\bheikkolahjai
(?<!asenne)vammai
\blespo
manne(?!n\b|r|lin\b)
\bsukupuol .* \bilmaisu
\bmutakep
hetero
\bsyntyperä .* \bvuoksi\b
\betni
\bterrorist
\buskonlahko
\brullstol
\bjutku
\bkemali\b
\bsaame
\bsekt\b
\bliikuntaeste
\bmulatti
\bvähemmistö
\bvihariko
\bhindu
\bkuulapä
\bsvedu
\bähl
\bmoske
\bvenäläi .* \bhuor
\bsukupuol .* \bidentiteet
\bpride
\bmulatt
\bdöv
\bääri .* \bislam
\bseparei
\btoisuskoiset\b
\bateist
\bsexuell .* \bläggnin
\bteckenspråk
burka(?!r)
rotu(?!aari)
\bromani
\bimaami
\bbibel
\blesbo
\bmustolainen\b
\bjutsku
\bpolttopul
\btransperso
\bliikuntarajoit
\bOdin
\bkrishna
\blestoi
\badventis
\bcp .* \bvamma
\bmusse\b
\btvångsrörels
rollaattor
\bLold
\bslöja
\bklux\b
\bsomppu
\boireyhtym
\bblatte\b
\bmongo
\bwhite\b
\bmutakuon
\bkoranen
pyörätuol
\bsocialfall
\bänkyt
\bminorite
\bfags\b
rullatuol
\bpainu .* \bkotimaa
\bpatriot
\baivovamma
\butlänning .* \bhat
\bmustanaam
\bmormooni
\bjuutalai
\bbög
\bseta
\bmutapä
\binva
\bfunktionshindra
\bneger
\bfikus\b
\btaliban
\bihonväri
\blakrits
\bkehityshäiriö
\bjesus\b
\bmaitonaama\b
\bal\b \bqaida
\bmutuainen\b
\bapartheid
\barbeit\b \bmacht\b \bfrei
\bvirolai .* \bhuor
\bhuivipä
\bpilotti
\b(?<!sade)kuuro
\bbin\b \bladen
\bsomali
\bjehova
\bnaispari
\bjälkeenjään
\bburkha
\bliikuntavamma
\bliikuntakyvyt
\bkehar
jumala(?!uta\b)
\bvapaa \bajattelij
\brodun\b
\bmaahanmuuttaj
\bkristinusko
\bhurri
\bbi-seksu
\bnekru
\bsikhi
\btransseksu
puhevi(?!est)
invaliid
\btillbaka\b \btill\b \bhemland
\bturbaan
seurakun
\bantisem
\bsnedög
\bkarvakä
\basperger
\barab\b
\bfinnjäv
\bkinkke
\bmottagningscentral
\bfrämlingsfient
sompu(?!järv)
uskovai
\bvinosilm
\bapa\b
\bulkomaalaisvastai
huntu(?!s\b)
\bmenkää\b \bkotimaahan\b
\bhatbrott
\bflykting
\btransvesti
\bmiespari
\bnigger
mykkä(?!puhelu)
\bpelastusarmeija
\bcp .* \bskada
\bvenakko\b
\befterbliven\b
\bprofeet
\brenras
\bsukupuol .* \bvaihdo
lepak(?!komie)
\bhakkors
\bkatoli
\bsukupuol .* \bsuuntautu\b
kääpiö(?!villa|nautser|pin)
\bauschwitz\b
\bhintti
\bfatwa
\bhuonokuulo
\bkoraani
\brukou
\bpakkoliik
\brodullinen\b
\buskonto
\bfasis
\bmonikulttuuri
\bjud
\bvajaaälyi
\bturban
\bsyrjintä\b
\bennakkoluulo
transu
\bharhaop
\bsyjivä
\befterbliv
\bnatsi
\bautism
\brasist
\btattar
\basberger
\bideologi
\bgo\b \bhome\b
\binvandra
\brättipä
\barjalai
syndrom
\bviittomakiel
\bfördom
\bbisexu
\bvajuk
\bzigenar
\bkehitystaso
\bheil\b
jew(?!el|elry)
\bmutiai
\btorakka
\bapina
kinkki(?!nen)
\bpalata\b \bsinne\b \bmistä\b
\bkinuk
\bkuulolait
\bstammar\b
luterilai
\bregistrera .* \bparförh
\bmuukalaisviha
\bvääräuskoi
\bhitler
\bhakaris
\bvajak
\btakaisin\b \bafrikkaan\b
\bsosiaalipumm
\bdysfasia
\but\b \bur\b \bskåpet\b
\bvakaumu
\bmolotov \bkoktail
\bbananplockare\b
\bnegro
lestadio
\bvapaakirk
\bneeker
\bviharyhm
\btransnai
\bryssittel
\bhinttari
\braamat
\bstum\b
\bras\b
\bsvartskalle\b
\bkainalosauv
\bfrälsningsarm
\bvammaseksi\b
\breligion
transsu
\bnahkapä
\bhunnu
\btalfel\b
\bal\b \bgaida
\byöntimo\b
\berityislaps
\bbiseksu
\blessar
\bkryck
\bautist
\bsepari
\bkyynärsauv
\bopaskoir
\bskinnskall
\bvalkonaam
\bmustilai
\bberikare\b
\bnazist\b
\bdiskriminer
\bböne
\bsukupuol .* \bkorjau
\bbuddha
\bfolkdräkt
\bkehitysvamma
\bromaani
\bislam
\bnokikep
\bulkomaalai .* \bviha
\buskonno
\brekisteröi .* \bparisuh
\bpainua\b \bsinne\b \bmistä\b
\bsharia
\bhuononäkö
\bseksuaal .* \bsuuntautu
\bvajaamieli
\bnäkövamma
\bvalkolai
\bpakolai
\bortodox
\bflato\b
\bmolotov \bcocktail
\bpatriootti\b
\ballah
\bracis
\bqueer\b
\brollator
\bhudfärg
\bhomo
\bvastaanottokesku
\bmormon
pilapiirro
\b(?<!kyy)ryssä
\btransvest
\bulos\b \bkaapista\b
\bflata\b
\bblind
\bbapdis
\bosama\b
\bkristendom
troende
sokea
\bsyrjityksi\b
\bmetodis
\btakaisin .* \bkotimaa
\bkristilli
\bromane
\bvajaakuntoi
\brajoittei
\butvecklingsstörd
\bpuhevamma
invalid
\bpuppel
\bfägär
\bneukku\b
skin(?!nari)
\bvammanen\b
Loading