Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several non-English words made it into the list #187

Open
lserni opened this issue Jan 6, 2024 · 4 comments
Open

Several non-English words made it into the list #187

lserni opened this issue Jan 6, 2024 · 4 comments

Comments

@lserni
Copy link

lserni commented Jan 6, 2024

German words ending in -schen: boeschen, goschen, groschen, guldengroschen, hamantaschen, hanschen, kischen, leschen, mariengroschen, menschen, neugkroschen, neugroschen, silbergroschen.

German words ending in -ung: anschauung, aufklarung, delundung, aufklrung, gelandesprung, geldesprung, gelndesprung (the last four are also spelled wrongly), gterdmerung (probably ASCII-filtered from a wrongly-spelled "Götterdämmerung"), kaolikung, lautverschiebung, quellung, quersprung, sturmabteilung, verwanderung, vorstellung.

German-sounding place names that end in -berg: aaberg, amberg, arlberg, baden-wtemberg (should be Baden-Wurttenberg), bamberg, beberg, bemberg, berg, bloxberg, bromberg, bundaberg, clayberg, cohberg, desberg, drakensberg, dusenberg, egeberg, ehrenberg, eisenberg (probably Heisenberg), faberg, feinberg, fineberg, flamberg, floeberg, freberg, frederiksberg, freudberg, friedberg, fromberg, ginsberg, ginzberg, godesberg, goldberg, goldenberg, gomberg, greenberg, grosberg, gruenberg, grunberg, gutenberg, guttenberg, hamberg, hardenberg, hedberg, heidelberg, heisenberg, hertberg, herzberg, hollenberg, houlberg, ingaberg, ingeberg, inselberg, judenberg, kapfenberg, knigsberg (probably Konigsberg), koenigsberg, konigsberg, kornberg, landenberg, lansberg, lederberg, lemberg, lichtenberg, lindberg, lindeberg, lundberg, marshallberg, memberg, mengelberg, moberg, mollberg, mossberg, msterberg (probably Musterberg), muhlenberg, nberg (probably Nueberg), newberg, nyberg, nieberg, noonberg, nuremberg, oberg, overberg, ramberg, rehnberg, reichenberg, rydberg, romberg, rosenberg, rotberg, rothberg, rothenberg, schberg, schoenberg, schonberg, schulberg, shimberg, shinberg, shirberg, sjoberg, slosberg, solberg, spitzenberg, steinberg, sternberg, stormberg, strasberg, strindberg, stromberg, sundberg, svedberg, taberg, tamberg, tanberg, tannenberg, tuneberg, vandenberg, venusberg, vilberg, vorarlberg, waterberg, wattenberg, weinberg, weisberg, weissberg, westberg, wittenberg, wtemberg, wurttemberg,

Swedish-sounding names ending in -borg: aalborg, bjneborg, carlsborg, friborg, goteborg, gteborg, helsingborg, hsingborg, ingaborg, ingeborg, kreymborg, lindsborg, seaborg, swedeborg, swedenborg, valborg, viborg, vyborg, volborg, wiborg.

German words with -rsch- or -wasser: goldwasser, kirschwasser, beterschap, borsch, borsches, bursch, burschenschaft, burschenschaften, clairschach, clairschacher, dauerschlaf, hersch, herschel, herscher, hirsch, hirschfeld, kirsches, kirschner, kursch, lautverschiebung, moersch.

German words that might be considered controversial: sieg, heil, hitler, mein, fuehrer, fuhrer, gott, mit, uns, SchutzStaffel.

French sounding words containing "aux": aboideaux, aboiteaux, agneaux, auxf, auxier, auxil, auxvasse, bandeaux, bateaux, batteaux, beaux, beaux-arts, beaux-esprits, beauxite (should be bauxite), boyaux, boisseaux, bordereaux, boudreaux, capiteaux, carpeaux, castrop-rauxel, chalumeaux, chapeaux, chateaux, cheneaux, chevaux, chevaux-de-frise, ciseaux, clervaux, clitoridauxe, colauxe, coteaux, couteaux, cryptoglaux, cristineaux, dermatauxe, eaux, enterauxe, esquimaux, fabliaux, faux, fauxbourdon, faux-bourdon, faux-na, flambeaux, fricandeaux, gateaux, glaux, hanotaux, hemiauxin, hepatauxe, jambeaux, jouhaux, kastrop-rauxel, knisteneaux, kristinaux, lascaux, laux, malraux, mantappeaux, manteaux, margaux, margeaux, marivaux, mastauxe, maux, meraux, michaux, myelauxe, morceaux, moureaux, nephrauxe, nouveaux, oophorauxe, paravauxite, pauxi, plateaux, portmanteaux, proces-verbaux, prostatauxe, radeaux, raveaux, reseaux, rinceaux, roncevaux, rondeaux, rouleaux, salteaux, splenauxe, subbureaux, tableaux, thibodaux, tonneaux, torteaux, trichauxis, trousseaux, trumeaux, vassaux, vauxite, veneaux, vitraux, wibaux, bureaux.

Loan words that are probably okay but, strictly, still not English: brehmsstrahlung, weltanschauung, volkerwanderung, ubermensch, borscht, borschts, kirsch, meerschaum, meerschaums, Messerschmitt, Rorschach, bordeaux.

East Asian words: Wa-palaung, Telukbetung, bagong, Ronggeng.

Chinese city name: Tzekung, Kaolikung

Korean name: Kyung, Kyaung, Keung.

Kyrgyz name: Issyk-Kul

Icelandic name: Jokul (should be Jokull)

Not sure if this counts: mallangong (Australian name for the platypus), wobbegong (Australian name of the carpet shark)

@erondpowell
Copy link

erondpowell commented Jan 6, 2024

@lserni What was your methodology for collecting this?

Funny timing, as I sat down today to write a script to remove every single non-word / non-english-word from this list.

In addition to the language stuff you pointed out, I have noticed a number of other non-english words and 'artifacts' of English. Gutteral sounds. etc.

@lserni
Copy link
Author

lserni commented Jan 11, 2024

I stumbled across one word due to a misspelling (it might have been 'kischen'), I realized it was German and started looking for that suffix ("-schen"), then one thing lead to another (e.g. finding "kirschen" led me to look for "kirsch" and so on).

@erondpowell
Copy link

erondpowell commented Jan 11, 2024

@lserni Gotcha. So from a processing/scripting standpoint, it was more or less a "manual/organic" search process?

I plan to programmatically parse this to get a pure list of "guaranteed English" words.

I've thought up two general strategies so far:

  1. Use chatGPT to ask if xxxxx is an English word (I've tried this before and maybe my prompt was bad, but ChatGPT wasn't great at that task... English "borrows" a lot of words and chatGPT seemed to not know where to draw the boundary.)

  2. Write a script to lookup every single word in an existing dictionary (Google, Oxford, Wiktionary, etc) and keep the words that had entries.

Please lmk if you have any ideas 💯

@thelabcat
Copy link

I tried using some python dictionary modules, but even online database ones are missing some words I know are real.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants