-
Notifications
You must be signed in to change notification settings - Fork 0
/
anki-cli.py
executable file
·2351 lines (1930 loc) · 84.1 KB
/
anki-cli.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env /home/chdavis/git/anki-cli/.venv/bin/python
"""Anki CLI - fetch online definitions and add cards to Anki vocabulary decks
Based on this API: https://github.com/FooSoft/anki-connect/
A note on searching for declined / conjugated forms of words:
It would be nice to confirm that the content fetched corresponds to the term
searched, rather than a declined form. However, each dictionary provider does
this differently, and not even consistently within a given language, as it may
depend on the part of speech of the term. So, the user simply needs to be beware
that if the definition shows a different canonical form, then they should
re-search for the canonical form, and then add that term instead. (This should
be clearly visible if this has happened, because the search term, if present,
will be highlighted in the displayed text.)
For example, in Dutch, searching for 'geoormerkt' (a past participle) will
return the definition for 'oormerken' (the infinitive). In that case, you'd
rather not add that card, but rather re-search for 'oormerken', now that you
know what the base form is, and add the latter as a new card instead.
A note regarding text-only (non-HTML) cards:
Using text-only cards (non-HTML) implies that when you want to use the Anki GUI
to edit a card, then you should be using the source editor (Ctrl-Shift-X),
rather than the WYSIWYG/rich-text editor.
Even with the source editor, if you ever edit a card in the Anki GUI and it
contains an ampersand `&`, eg `R&D` , in the front or back fields, then it'll be
automatically HTML-encoded anew as `R&D` in the source. That means your text
searches for 'R&D' won't find that match.
If you re-view that card from this CLI, the source text can be fixed/updated
anew.
Cards should to use the CSS style: `white-space: pre-wrap;` to enable wrapping
of raw text.
Note, only the note type (model) called 'Basic' is supported. We assume that is
has the standard field names 'Front' and 'Back' Any other cards won't be
displayed
Note on Duplicate detection:
Android and Desktop apps: detects dupes across the same note type, not the same
deck. Desktop will allow you to see what the dupes are, ie if they're in a diff
deck. Android doesn't, though, so you might create dupes there when adding new
(empty) cards. That's ok. Once you get back to the CLIan, and dequeue the
empties, the existing card will be detected.
"""
# Backlog/TODO
# Put this into its own repo (so that vscode uses just one venv per workspace/repo)
# And enable the debugger
# See the extension that tries to work around duplicate detection across the note type:
# https://ankiweb.net/shared/info/1587955871
# But, What about Android? Can just add the duplicate, and let this script figure it out later when dequeueing empties ...
# TODO rethink/refactor the sanity of this state machine in the main loop
# Add support for wiktionary? (has more words, eg botvieren)
# (But doesn't always have IPA?) ?
# Note that a word might also list homonyms in other langs. How to restrict to a given lang?
# eg via ? https://github.com/Suyash458/WiktionaryParser
# Consider making that its own Anki add-on, independent of this CLI
# Consider putting the menu at the top of the screen, since I focus on the top left to see the words anyway
# But then I'd still need to keep the search/input line at the bottom, due to the sequence of printing
# TODO card type dependency on 'Basic' :
# But rather than depend on 'Front' and 'Back', maybe we could generalize this to get the get_card()['question'] and ...['answer']
# Those are the rendered versions, which contains whatever necessary fields are defined by the card type, the rendered versions (full of HTML/CSS).
# So, we should try to detect if it's already normalized, as it never will be. For that we'd have to check the raw field content in the note (eg Front or Back)
# logging.error() should also go to the screen, somehow ...
# Maybe wait until I think of a better way to manage the UI
# Text UI libs? eg ncurses, etc ?
# https://www.willmcgugan.com/blog/tech/post/building-rich-terminal-dashboards/
# Make the 'o' command open whatever the source URL was (not just woorden.org)
# BUG no NL results from FD (from FreeDictionary)
# Why does EN work when NL doesn't?
# If Woorden is often unavailable, make this configurable in the menu (rather than hard-coded)?
# Use this freeDictionary API, so as to need less regex parsing
# https://github.com/Max-Zhenzhera/python-freeDictionaryAPI/
# Background thread to keep cached data up-to-date, eg when cached values need to be uncached/refreshed.
# Else eg the desk screen has to make many slow API calls
# Or, is there a way/an API call to get all the counts of new/learning/reviewing from all decks in one call?
# See getDeckStats which gives new_count, learn_count, review_count and "name", for each deck object
# TODO make a class for a Card ?
# Or at least wrap it in my `dictd` class (from startup.py), so that it could be:
# card.fields.front.value (instead of the verbose syntax)
#
# Easiest to just use:
# https://docs.python.org/3/library/dataclasses.html
# from dataclasses import dataclass
# @dataclass
# class Card:
# front: str
# back: str
# ...
# So, we don't have to keep digging into card['fields']...
# But maybe I need some accessors ... or a constructor to breadown the card['fields']... Or maybe a 'match' statement?
# Make a stringified version of the card, for logging, with just these fields:
# 'cardId' 'note' 'deckName' 'interval' ['fields']['front']['value']
# logging.debug(...)
# Also make a class to represent the ResultSet, so that we don't have to separately maintain card_ids_i ?
# Needs bidirectional/adhoc traversal, like a doubly linked list.
# (A iter() only allows forward traversal. And `deque` is for consuming elements out of the list.)
# I don't really need all the overhead of a doubly linked list (the list won't be modified, just deleted)
# I just need bidirectional iteration (maybe useful in general?)
# But a generator iterator might not work, since it's just freezing the state of a single function call.
# But I need to differentiate between next() vs prev()
# And what would happen if the underlying list were modified?
# Should I require the underlying DS to be a tuple for simplicity?
# Is there an existing PyPi for an iterator that has a prev() ?
# The 'u' 'update' (for normalization) should also prompt with a diff, before making the change (since there's no undo)
# Just like the 'r' replace function already does.
# Logging:
# Modifying it to send WARNING level messages also to logging.StreamHandler()
# And INFO also to the StreamHandler when in debug mode
# In highlighter() highlight `query` and `term` in diff colors
# BUG why does beep() not beep unless in debug mode ?
# TODO Address Pylance issues,
# eg type hints
# And then define types for defs
# Consider adding a GPT command/prompt to ask adhoc questions about this / other cards ?
# Also when no results found.
# Could use the embeddings to find synonyms, for example (and note which I have locally?)
# Might have to use stop tokens to limit the response to one line ?
# Or rather than customize it for one service, make a command (!) to pipe to a shell command
# Doesn't vim also have something like that?
# And save the last command in readline (or read from bash history?)
# Then use the chatgpt.py script to receive content piped in, along with a question on the CLI
# Make sure to also include the term (since it's not part of the def content)
# TODO
# Think about how to add multiples webservices for a single deck/lang (?)
# Like how we switch decks currently?
# Eg beyond a dictionary, what about extra (web) services for:
# synonyms, pronunciation, etymology, etc, or just allowing for multiple search providers
# Maybe just:
# { lang: en, dict: dictionary.com, syn/thes: somesynservice.com, ipa: some ipa service, etym: etymonline.com, ...}
# Get IPA from Wiktionary (rather than FreeDictionary)?
# And maybe later think about how to combine/concat these also to the same anki card ...
# Is there an API for FD? Doesn't seem like it.
# cf get_url()
# Add nl-specific etymology? (Wiktionary has some of this)
# https://etymologiebank.nl/
# using Wiktionary would enable mapping from conjugated forms eg postgevat because it links to the infinitives eg postvatten
# (but then search again locally first, since I might already have that verb)
# FR: Or use a diff source, eg TV5
# https://langue-francaise.tv5monde.com/decouvrir/dictionnaire/f/franc
# Add DWDS for better German defs (API?). But get IPA pronunciation elsewhere
# (eg FreeDictionary or Wiktionary)
# TODO consider switching to curses lib
# https://docs.python.org/3/howto/curses.html#curses-howto
# Alternative libs: urwid prompt_toolkit blessings npyscreen
# And then I can use something like a progress bar for showing the timeout on the http requests ...
# But, does anki-connect even process client requests async?
# Replace colors with `termcolor` lib?
# TODO consider colorama here?
# Run the queries needed to update the menu in a separate thread, update the UI quicker
# And maybe also the sync() since it should just be fire-and-forget (but then update empties count)
# TODO make menu rendering async, or just make all external queries async?
# Migrate from urrlib to httpx (or aiohttp) to use async
# Since I'd also like to try to make formatted text versions for other
# languages, maybe regex-based rendering isn't the most sustainable approach.
# Replace regex doc parsing with eg
# https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/
# And use CSS selectors to extract content more robustly
# use BeautifulSoup?
# Convert HTML to Markdown?
# Consider library html2text
# Would an XSLT, per source, make sense for the HTML def content?
# eg this lib: https://lxml.de/
# https://www.w3schools.com/xml/xsl_intro.asp
# Consider alternative add-ons for Anki (for creating new cards using online dicts)
# https://ankiweb.net/shared/info/1807206748
# https://github.com/finalion/WordQuery
# All add-ons:
# https://ankiweb.net/shared/add-ons/
# Repo/Packaging:
# figure out how to package deps (eg readchar) and test it again after removing local install of readchar
# Move this dir to its own repo
# http://manpages.ubuntu.com/manpages/git-filter-repo
# or
# http://manpages.ubuntu.com/manpages/git-filter-branch
# Stemming for search?
# Or add the inflected forms to the card? as a new field?
# Most useful for langs that you don't know so well.
# (because those matches would be more important than just matching in the desc somewhere)
# Worst case, the online dictionary solves this anyway, so then I'll realize that I searched the wrong card.
# So, it's just one extra manual search. Maybe not worth optimizing. But more interesting for highlighting.
# Enable searching for plural forms on the back of cards:
# Find/remove/update all cards that have a pipe char | in the Verbuigingen/Vervoegingen:
# So that I can also search/find (not just highlight) eg bestek|ken without the pipe char
# Search: back:*Ver*gingen:*|* => 2585 cards
# Make a parser to grab and process it, like what's in the render() already, but then also replace it in the description.
# Maybe copy out some things from render() that should be permanent into it's own def
# And then update the card (like we did before to remove HTML from 'front')
# TODO: store the 2-letter lang-code in the desk description in Anki, eg:
# lang=nl
# lang:nl
# See smaller, inline TODOs below ...
# Note, that regex search in Anki is supported from 2.1.24+ onward
# https://apps.ankiweb.net/
# https://docs.ankiweb.net/searching.html
# https://docs.rs/regex/1.3.9/regex/#syntax
################################################################################
import argparse
import copy
import datetime
import difflib
import enum
import functools
import html
import json
import logging
import math
import os
import pprint
import random
import readline
import socket
import subprocess
import sys
import tempfile
import textwrap
import time
from typing import Optional
from urllib import request, parse
from urllib.error import HTTPError, URLError
# External dependencies
import addict
import autopage
import bs4 # BeautifulSoup
# NB, the pip package is called iso-639 (with "-").
# And this is TODO DEPRECATED
# DEPRECATION: iso-639 is being installed using the legacy 'setup.py install'
# method, because it does not have a 'pyproject.toml' and the 'wheel' package is
# not installed. pip 23.1 will enforce this behavior change. A possible
# replacement is to enable the '--use-pep517' option. Discussion can be found at
# https://github.com/pypa/pip/issues/8559
# Alternatively, try: https://pypi.org/project/pycountry/
import iso639 # Map e.g. 'de' to 'german', as required by SnowballStemmer
import pyperclip
import readchar # For reading single key-press commands
# The override for `re` is necessary for wildcard searches, due to extra
# interpolation. # Otherwise 're' raises an exception. Search for 'regex' below.
# https://learnbyexample.github.io/py_regular_expressions/gotchas.html
# https://docs.python.org/3/library/re.html#re.sub
# "Unknown escapes of ASCII letters are reserved for future use and treated as
# errors."
import regex as re
import unidecode
from nltk.stem.snowball import SnowballStemmer
p = print
pp = pprint.PrettyPrinter(indent=4)
################################################################################
# Color codes:
# https://stackoverflow.com/a/33206814/256856
COLOR = dict()
COLOR['DN'] = "\033[0;00m" # default normal (depends on terminal color theme)
COLOR['DD'] = "\033[0;02m" # default dim
COLOR['DB'] = "\033[1;02m" # default bold (and dim)
COLOR['RN'] = "\033[0;31m" # red normal
COLOR['RB'] = "\033[1;31m" # red bold
COLOR['RU'] = "\033[4;31m" # red underline
COLOR['GN'] = "\033[0;32m" # green
COLOR['GB'] = "\033[1;32m"
COLOR['GU'] = "\033[4;32m"
COLOR['YN'] = "\033[0;33m" # yellow
COLOR['YB'] = "\033[1;33m"
COLOR['YU'] = "\033[4;33m"
COLOR['BN'] = "\033[0;34m" # blue
COLOR['BB'] = "\033[1;34m"
COLOR['BU'] = "\033[4;34m"
COLOR['MN'] = "\033[0;35m" # magenta
COLOR['MB'] = "\033[1;35m"
COLOR['CN'] = "\033[0;36m" # cyan
COLOR['CB'] = "\033[1;36m"
COLOR['WN'] = "\033[0;37m" # white
COLOR['WB'] = "\033[1;37m"
COLOR['YL'] = "\033[0;93m" # yellow light(er)
# Abstract colors concepts / use cases
COLOR['NONE'] = COLOR['DN'] # default normal (depends on terminal color theme)
COLOR['COMM'] = COLOR['WB'] # commands, menu items, hot keys/shortcuts
COLOR['FAIL'] = COLOR['RN'] # errors, failures, alerts, urgency
COLOR['WARN'] = COLOR['YB'] # warnings, attention, CTAs
COLOR['INFO'] = COLOR['DD'] # info, debug, low prio
COLOR['OKOK'] = COLOR['GB'] # ok, success, affirmation
COLOR['VALS'] = COLOR['GN'] # values, variables
COLOR['HIGH'] = COLOR['YL'] # highlights, search keywords,
# The addict.Dict allows dotted access to dict keys as attributes, eg C.WARN
C = addict.Dict(COLOR)
def W(color:str, string:str=''):
"""Wrap a color around an string, and then reset the color back to default
eg W(C.WARN, 'Warning message')
"""
return color + string + C.NONE
class Key(enum.StrEnum):
# The Esc key is doubled, since it's is a modifier and isn't accepted solo
ESC_ESC = '\x1b\x1b'
CTRL_C = '\x03'
CTRL_D = '\x04'
CTRL_P = '\x10'
CTRL_W = '\x17'
UP = '\x1b[A'
DEL = '\x1b[3~'
def assert_anki(retry=True):
"""Ping anki-connect to check if it's running, else launch anki
NB, Anki is a singleton, so this wouldn't launch multiples
"""
port = 8765
host = 'localhost'
try:
socket.create_connection((host, port), timeout=1).close()
return True
except (ConnectionRefusedError, socket.timeout):
if not retry:
msg = (
'Failed to connect to Anki. '
'Make sure that Anki is running, '
'and using the anki-connect add-on.'
)
logging.warning(msg)
sys.exit(msg)
# If you use os.system to background Anki here, it would launch, but it will
# not understand redirecting stdout/stderr to a log file.
# Output from Anki/add-ons would interfere with our CLI output on stdout.
cmd = ['env', 'ANKI_WAYLAND=1', 'anki']
dir_path = os.path.dirname(os.path.realpath(__file__))
with open(os.path.join(dir_path, 'anki.log'), 'a') as log_file:
subprocess.Popen(cmd, stdout=log_file, stderr=log_file)
time.sleep(1.0)
# Try one last time
return assert_anki(retry=False)
def invoke(action, **params):
"""Send a request to Anki desktop via the API for the anki-connect add-on
Details:
https://github.com/FooSoft/anki-connect/
"""
struct = { 'action': action, 'params': params, 'version': 6 }
reqJson = json.dumps(struct).encode('utf-8')
logging.debug(b'invoke:' + reqJson, stacklevel=2)
req = request.Request('http://localhost:8765', reqJson)
try:
response = json.load(request.urlopen(req))
if options.debug:
# Simplify some debug logging
result_log = copy.deepcopy(response['result'])
if isinstance(result_log, dict):
result_log = [ result_log ]
if isinstance(result_log, list):
if len(result_log) > 10:
result_log = 'len:' + str(len(result_log))
else:
for obj in result_log:
if not isinstance(obj, dict): continue
for field in ('question', 'answer', 'css'):
if field in obj: obj[field] = '<...>'
if 'fields' in obj and 'Back' in obj['fields']:
obj['fields']['Back']['value'] = '<...>'
logging.debug('result:\n' + pp.pformat(result_log), stacklevel=2)
error = response['error']
if error is not None:
beep(3)
logging.error('error:\n' + str(error), stacklevel=2)
logging.error('result:\n' + pp.pformat(response['result']), stacklevel=2)
return None
else:
return response['result']
except (ConnectionRefusedError, URLError) as e:
if assert_anki():
# Retry the request
return invoke(action, **params)
else:
return None
def get_deck_names():
names = sorted(invoke('deckNames'))
# Filter out sub-decks ?
names = [ i for i in names if i != 'Default' and not '::' in i]
return names
def renderer(
string,
query='',
*,
term='',
deck=None,
):
"""For displaying (normalized) definition entries on the console/CLI"""
# Prepend term in canonical format, for display only
if term:
hr = '─' * len(term)
string = '\n'.join(['', term, hr, string])
string = wrapper(string)
# Ensure one newline at the end
string = re.sub(r'\n*$', '\n', string)
string = highlighter(string, query, term=term, deck=deck)
return string
def normalizer(
string,
*,
term=None,
):
"""Converts HTML to text, for saving in Anki DB"""
# Specific to woorden.org
# Before unescaping HTML entities: Replace (< and >) with ( and )
string = re.sub(r'<|《', '(', string)
string = re.sub(r'>|》', ')', string)
string = re.sub(r' ', ' ', string)
# Other superfluous chars:
string = re.sub(r'《/?em》|«|»', '', string)
# Replace HTML entities with unicode chars (for IPA symbols, etc)
string = html.unescape(string)
# Remove tags that are usually inside the IPA/phonetic markup
string = re.sub(r'</?a\s+.*?>', '', string)
# Replace IPA stress marks that are not commonly represented in fonts.
# IPA Primary Stress Mark (Unicode U+02C8) ie the [ˈ] character => apostrophe [']
# IPA Secondary Stress Mark (Unicode U+02CC) ie the [ˌ] character => comma [,]
# IPA Long vowel length (Unicode U+02D0) ie the [ː] character => colon [:]
# eg for the NL word "apostrof", change the IPA: [ ˌapɔsˈtrɔf ] => [ ,apɔs'trɔf ]
string = re.sub(r'\u02C8', "'", string)
string = re.sub(r'\u02CC', ",", string)
string = re.sub(r'\u02D0', ":", string)
# Remove numeric references like [3]; we probably don't have the footnotes anyway
string = re.sub(r'\[\d+\]', '', string)
# NL-specific (or specific to woorden.org).
# Segregate topical category names e.g. 'informeel' .
# Definitions in plain text will often have the tags already stripped out.
# So, also use this manually curated list.
# spell-checker:disable
categories = [
*[]
# These are just suffixes that mean "study of a(ny) field"
,r'\S+kunde'
,r'\S+ografie'
,r'\S+ologie'
,r'\S+onomie'
,r'\S*techniek'
,r'financ[a-z]+'
,'algemeen'
,'ambacht'
,'anatomie'
,'architectuur'
,'cinema'
,'commercie'
,'computers?'
,'constructie'
,'culinair'
,'defensie'
,'educatie'
,'electriciteit'
,'electronica'
,'formeel'
,'geschiedenis'
,'handel'
,'informatica'
,'informeel'
,'internet'
,'juridisch'
,'kunst'
,'landbouw'
,'medisch'
,'metselen'
,'muziek'
,'ouderwets'
,'politiek'
,'religie'
,'scheepvaart'
,'slang'
,'speelgoed'
,'sport'
,'spreektaal'
,'taal'
,'technisch'
,'theater'
,'transport'
,'verkeer'
,'verouderd'
,'visserij'
,'vulgair'
]
# spell-checker:enable
# If we still have the HTML tags, then we can see if this topic category is
# new to us. Optionally, it can then be manually added to the list above.
# Otherwise, they wouldn't be detected in old cards, if it's not already in
# [brackets] .
for match in re.findall(r'<sup>([a-z]+?)</sup>', string) :
category = match
# logging.debug(f'{category=}')
# If this is a known category, just format it as such.
# (We're doing a regex match here; a category name might be a regex.)
string = re.sub(r'<sup>(\w+)</sup>', r'[\1]', string)
if any([ re.search(c, category, re.IGNORECASE) for c in categories ]):
...
else:
# Notify, so you can (manually) add this one to the 'categories'
# list above.
print(f"\nNew category [" + W(C.WARN, category) + "]\n",)
beep()
# time.sleep(5)
# Replace remaining <sup> tags
string = re.sub(r'<sup>', r'^', string)
# Specific to: PONS Großwörterbuch Deutsch als Fremdsprache
string = re.sub('<span class="illustration">', '\n', string)
# Specific to fr.thefreedictionary.com (Maxipoche 2014 © Larousse 2013)
string = re.sub('<span class="Ant">', '\nantonyme: ', string)
string = re.sub('<span class="Syn">', '\nsynonyme: ', string)
# Specific to en.thefreedictionary.com
# (American Heritage® Dictionary of the English Language)
string = re.sub(r'<span class="pron".*?</span>', '', string)
# Replace headings that just break up the word into syl·la·bles,
# since we get that from IPA already
string = re.sub(r'<h2>.*?·.*?</h2>', '', string)
# For each new part-of-speech block
string = re.sub(r'<div class="pseg">', '\n\n', string)
# Add spaces around em dash — for readability
string = re.sub(r'(\S)—(\S)', r'\1 — \2', string)
# HTML-specific:
# Remove span/font tags, so that the text can stay on one line
string = re.sub(r'<span\s+.*?>', '', string)
string = re.sub(r'<font\s+.*?>', '', string)
# These HTML tags <i> <b> <u> <em> are usually used inline and should not
# have a line break (below, we replace remaining tags with \n ...)
string = re.sub(r'<(i|b|u|em)>', '', string)
string = re.sub(r'<br\s*/?>', '\n\n', string)
string = re.sub(r'<hr.*?>', '\n\n___\n\n', string)
# Headings on their own line, by replacing the closing tag with \n
string = re.sub(r'</h\d>\s*', '\n', string)
# Tables, with \n\n between rows
string = re.sub(r'<td.*?>', '', string)
string = re.sub(r'<tr.*?>', '\n\n', string)
# Replace remaining opening tags with a newline, since usually a new section
string = re.sub(r'<[^/].*?>', '\n', string)
# Remove remaining (closing) tags
string = re.sub(r'<.*?>', '', string)
# Segregate pre-defined topical category names
# Wrap in '[]', the names of topical fields.
# (when it's last (and not first) on the line)
categories_re = '|'.join(categories)
string = re.sub(f'(?m)(?<!^)\\s+({categories_re})$', r' [\1]', string)
# Non-HTML-specific:
# Collapse sequences of space/tab chars
string = re.sub(r'\t', ' ', string)
string = re.sub(r' {2,}', ' ', string)
# NL-specific (or specific to woorden.org)
string = re.sub(r'Toon alle vervoegingen', '', string)
# Remove hover tip on IPA pronunciation
string = re.sub(r'(?s)<a class="?help"? .*?>', '', string)
# Ensure headings begin on their own line
# (also covers plural forms, eg "Synoniemen")
string = re.sub(
r'(?m)(?<!^)(Afbreekpatroon|Uitspraak|Vervoeging|Verbuiging|Synoniem|Antoniem)',
r'\n\1',
string
)
# NL-specific: Newlines (just one) before example `phrases in backticks`
# (but not *after*, else you'd get single commas on a line, etc)
string = re.sub(r'(?m)(?:\n*)(`.*?`)', r'\n\1', string)
# One, and only one, newline \n after colon :
# (but only if the colon : is not already inside of a (short) parenthetical)
string = re.sub(r'(?m):([\s\n]+)(?![^(]{,20}\))', r':\n', string)
# Remove seperators in plurals (eg in the section: "Verbuigingen")
string = re.sub(r'\|', '', string)
# Ensure 1) and 2) sections start a new paragraph
string = re.sub(r'(?m)^(\d+\))', r'\n\n\1', string)
# Ensure new sections start a new paragraph, eg I. II. III. IV.
string = re.sub(r'(?m)^(I{1,3}V?\s+)', r'\n\n\1', string)
# DE-specific:
# Ensure new sections start a new paragraph, eg I. II. III. IV.
string = re.sub(r'\s+(I{1,3}V?\.)', r'\n\n\1', string)
# New paragraph for each definition on the card, marked by eg: ...; 1. ...
string = re.sub(r';\s*(\d+\. +)', r'\n\n\1', string)
string = re.sub(r'(?m)^\s*(\d+\. +)', r'\n\n\1', string)
# And sub-definitions, also indented, marked by eg: a) or b)
string = re.sub(r';?\s+([a-z]\) +)', r'\n \1', string)
# Newline after /slashes/ often used as context, if at the start of the line
string = re.sub(r'(?m)^\s*(/.*?/)\s*', r'\1\n', string)
# Max 2x newlines in a row
string = re.sub(r'(\s*\n\s*){3,}', '\n\n', string)
# Delete leading/trailing space on each line
string = re.sub(r'(?m)^ +', '', string)
string = re.sub(r'(?m) +$', '', string)
# Delete leading space on the entry as a whole
string = re.sub(r'^\s+', '', string)
# Strip redundant term at start of card, if it's a whole word, non-prefix
if term:
string = re.sub(r'^\s*' + term + r'\s+', r'', string)
# Delete trailing space, and add canonical final newline
string = re.sub(r'\s*$', '', string)
if string != '':
string = string + '\n'
return string
def highlighter(
string,
query,
*,
term='',
deck=None,
):
# Map wildcard search chars to regex syntax
query = re.sub(r'[.]', r'\.', query)
query = re.sub(r'[_]', r'.', query)
# Even though this is a raw string, the '\' needs to be escaped, because
# the 're' module raises an exception for any escape sequences that are
# not valid in a standard string. (The 'regex' module doesn't.)
# https://learnbyexample.github.io/py_regular_expressions/gotchas.html
# https://docs.python.org/3/library/re.html#re.sub
# "Unknown escapes of ASCII letters are reserved for future use and treated
# as errors."
query = re.sub(r'[*]', r'[^ ]*', query)
# Terms to highlight
highlights = { query }
# Collapse double letters in the search term
# eg ledemaat => ledemat
# So that can now also match 'ledematen'
# This is because the examples in the 'back' field will include declined forms
collapsed = re.sub(r'(.)\1', r'\1', query)
if collapsed != query:
highlights.add(collapsed)
if term:
# Also highlight the canonical form, in case the search query was different
highlights.add(term)
term_or_query = term or query
if term_or_query != unidecode.unidecode(term_or_query):
highlights.add(unidecode.unidecode(term_or_query))
logging.debug(f'{term_or_query=}')
# TODO also factor out the stemming (separate from highlighting, since lang-specific)
# NB, this stemming isn't that reliable, eg
# fr/fendre => 'fendr' (but should be 'fend')
# Map e.g. 'de' to 'german', as required by SnowballStemmer
if deck and deck in iso639.languages.part1:
lang = iso639.languages.get(part1=deck).name.lower()
stemmer = SnowballStemmer(lang)
stem = stemmer.stem(term_or_query)
if stem != term_or_query:
highlights.add(stem)
logging.debug(f'{stem=}')
# Language/source-specific extraction of inflected forms
if deck == 'nl':
# Hack stemming, assuming -en suffix, but not for short words like 'een'
# For cases: verb infinitives, or plural nouns without singular
# eg ski-ën, hersen-en
highlights.add( re.sub(r'\b(.{2,})en\b', r'\1', term_or_query) )
# And adjectives/nouns like vicieus/vicieuze or reus/reuze or keus/keuze
if term_or_query.endswith('eus') :
highlights.add( re.sub(r'eus$', r'euz\\S*', term_or_query) )
# Find given inflections listed in the definition/entry
matches = []
# Theoretically, we could avoid a double loop here, but this makes it
# easier to read. There can be multiple inflections in one line (eg
# prijzen), so it's easier to have two loops.
inflections = re.findall(
r'(?m)^\s*(?:Vervoegingen|Verbuigingen):\s*(.*?)\s*$',
string
)
for inflection in inflections:
# There is not always a parenthetical part-of-speech after the
# inflection of plurals. Sometimes it's just eol (eg "nederlaag") .
# So, it ends either with eol $ or open paren (
match = re.findall(r'(?s)(?:\)|^)\s*(.+?)\s*(?:\(|$)', inflection)
matches += match
for match in matches:
# Remove separators, e.g. in "Verbuigingen: uitlaatgas|sen (...)"
match = re.sub(r'\|', '', match)
# If past participle, remove the 'is' or 'heeft'
# Sometimes as eg:
# uitrusten: 'is, heeft uitgerust' or 'heeft, is uitgerust'
match = re.sub(r'^(is|heeft)(,\s+(is|heeft))?\s+', '', match)
# And the reflexive portion 'zich' isn't necessary, eg: "begeven"
match = re.sub(r'\bzich\b', '', match)
# This is for descriptions with a placeholder char like:
# "kind": "Verbuigingen: -eren" => "kinderen"
# "homo": "'s" => "homo's"
match = re.sub(r"^[-'~]", term_or_query, match)
# plural nouns with multiple declensions, CSV
# eg waarde => waarden, waardes
if ',' in match:
highlights.update(re.split(r',\s*', match))
match = ''
# Collapse spaces, and trim
match = re.sub(r'\s+', ' ', match)
match = match.strip()
# Hack stemming for infinitive forms with a consonant change in
# simple past tense:
# dreef => drij(ven) => drij(f)
# koos => kie(zen) => kie(s)
if term_or_query.endswith('ven') and match.endswith('f'):
highlights.add( re.sub(r'ven$', '', term_or_query) + 'f' )
if term_or_query.endswith('zen') and match.endswith('s'):
highlights.add( re.sub(r'zen$', '', term_or_query) + 's' )
# Allow separable verbs to be separated, in both directions.
# ineenstorten => 'stortte ineen'
# BUG capture canonical forms that end with known prepositions
# (make a list)
# eg teruggaan op => ging terug op (doesn't work here)
# We should maybe just remove the trailing preposition
# (if it was also a trailing word in the 'front')
if separable := re.findall(r'^(\S+)\s+(\S+)$', match):
# NB, the `pre` is anchored with \b because the prepositions
# are short and there would otherwise be many false positive
# matches
# eg stortte, ineen
(conjugated, pre), = separable
highlights.add( f'{conjugated}.*?\\b{pre}\\b' )
highlights.add( f'\\b{pre}\\b.*?{conjugated}' )
# eg storten
base = re.sub(f'^{pre}', '', term_or_query)
highlights.add( f'{base}.*?\\b{pre}\\b' )
highlights.add( f'\\b{pre}\\b.*?{base}' )
# eg stort
stem = re.sub(f'en$', '', base)
highlights.add( f'{stem}.*?\\b{pre}\\b' )
highlights.add( f'\\b{pre}\\b.*?{stem}' )
match = ''
if match:
highlights.add(match)
elif deck == 'de':
# TODO irregular forms? where could we get them from?
# DE: <gehst, ging, ist gegangen> gehen
# Could also get the conjugations via the section (online):
# Collins German Verb Tables (and for French, English)
# Or try Verbix? (API? Other APIs online for inflected forms?)
...
elif deck == 'fr':
# Test on eg céder (since also the accent changes in conjugation)
highlights.add( re.sub(r'\b(.{2,})(er|re|ir)\b', r'\1', term_or_query) )
logging.debug(f'{highlights=}')
# Sort the highlight terms so that the longest are first.
# Since inflections might be prefixes.
# i.e. this will prefer matching 'kinderen' before 'kind'
highlight_re = '|'.join(reversed(sorted(highlights, key=len)))
# Highlight accent-insensitive:
# Start on a copy without accents:
string_decoded = unidecode.unidecode(string)
# NB, the string length will be the same if accents are simply removed.
# However, chars like the German 'ß' could make the decoded longer.
# So, first test if it's safe to use this position-based approach:
if len(string) == len(string_decoded):
# And the terms to highlight need to be normalized then too:
highlight_re_decoded = unidecode.unidecode(highlight_re)
# Get all match position intervals (half-open intervals)
i = re.finditer(f"(?i:{highlight_re_decoded})", string_decoded)
spans = [m.span() for m in i]
l = list(string)
# Process the string back-to-front, since inserting changes indexes
for t in reversed(spans):
x,y = t
# Also, here, y before x, since back-to-front
l.insert(y, C.NONE)
l.insert(x, C.HIGH)
string = ''.join(l)
else:
# We can't do accent-insensitive hightlighting.
# Just do case-insensitive highlighting.
# NB, the (?i:...) doesn't create a group.
# That's why ({highlight}) needs it's own parens here.
string = re.sub(
f"(?i:({highlight_re}))",
C.HIGH + r'\1' + C.NONE,
string
)
return string
def get_url(term, *, lang):
"""Get a dict of source URL(s) for a given query term/expression"""
quoted = parse.quote(term) # URL quoting
# TODO could perhaps generalize this further into a list or (per-lang) providers
# That would provide both source URLs, as well as parsing rules for the response
url = {}
url['google'] = f'https://google.com/search?q={quoted}'
url['freedictionary'] = f'https://{lang}.thefreedictionary.com/{quoted}'
url['wiktionary'] = f'https://{lang}.wiktionary.org/wiki/{quoted}'
# TODO add lang-specific dicts ?
# TODO add a default per language, eg 'nl' aliases to 'woorden'
return url
def search(term, *, lang):
"""...
"""
obj = {}
if lang == 'nl':
content = search_woorden(term)
obj['definition'] = content
return obj
obj = search_thefreedictionary(term, lang=lang)
return obj
def search_anki(
query,
*,
deck,
wild=False,
field='front',
browse=False,
term='',
):
"""Local search of Anki"""
# If term contains whitespace, either must quote the whole thing, or replace
# spaces:
search_query = re.sub(r' ', '_', query) # For Anki searches
# TODO accent-insensitive search?
# eg exploit should find geëxploiteerd
# It should be possible with Anki's non-combining mode: nc:geëxploiteerd
# https://docs.ankiweb.net/#/searching
# But doesn't seem to work
# Or see how it's being done inside this add-on:
# https://ankiweb.net/shared/info/1924690148
search_terms = [search_query]
# Collapse double letters \p{L} into a disjunction, eg: (NL-specific)
# This implies that the user should, when in doubt, use double chars to search
# deck:nl (front:maaken OR front:maken)
# or use a re: (but that doesn't seem to work)
# BUG: this isn't a proper Combination (maths), so it misses some cases
# TODO consider a stemming library here?
if deck == 'nl':
while True:
next_term = re.sub(r'(\p{L})\1', r'\1', search_query, count=1)
if next_term == search_query:
break
search_terms += [next_term]
search_query = next_term
if field:
if wild:
# Wrap *stars* around (each) term.
# Note, only necessary if using 'field', since it's default otherwise
search_terms = map(lambda x: f'*{x}*', search_terms)
search_terms = map(lambda x: f'"{field}:{x}"', search_terms)
# Regex search of declinations:
# This doesn't really work, since the text in the 'back' field isn't
# consistent. Sometimes there's a parenthetical expression after the
# declination, sometimes not. So, I can't anchor the end of it, which
# means it's the same as just a wildcard search across the whole back.
# eg 'Verbuigingen.*{term}', and that's not any more specific than just
# searching the whole back ...
# if field == 'front' and deck == 'nl':
# # Note, Anki needs the term in the query that uses "re:" to be wrapped in double quotes (also in the GUI)
# terms = [*terms, f'"back:re:(?s)(Verbuiging|Vervoeging)(en)?:( |\s|<.*?>|heeft|is)*{term}\\b"' ]