Example usage:
bash itc.sh
or
./itc.sh
or
export PYTHONSTARTUP=itc.py
python3
or
bash-3.2$ python3
Python 3.5.1 (default, Jan 22 2016, 08:54:32)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from itctk import *
>>> doc = itc()
tidak_words = doc.find('tidak')
for sent in doc: # for each sentence in doc
for word in sent: # for each word in that sentence
print(word)
look for words with parts-of-speech.
Do not use upper case in lookup()
because sentences are in lower case.
When you call lookup_c()
the result is auto dumped. Use a variable instead, such as ss
.
e.g. to search for "aku makan", "aku minum" etc.
and show the results line by line
ss = lookup("aku/prp \w+/vb")
When you call lookup_c()
the result is auto dumped. Use a variable instead, such as ss
.
ss = lookup_c("NEG PRP VB")
regular expressions can be used, e.g. look for constructions with 12 NNP and show the results line by line
ss = lookup_c("(NNP ){12}"))
e.g. look for constructions which has any POS preceded by NEG and followed by VB such as NEG PRP VB, NEG JJ VB etc.
ss = lookup_c("NEG \w+ VB")
e.g. look for adverbs and adjectives, the adverbs precede the adjectives
parts = set()
for sent in ss:
... part=re.search("\w+/RB \w+/JJ", str(sent))
... if part:
... parts.add(part.group(0))
dump(parts)
a_sentence.pos()
a_sentence.text()
doc.lexicon
doc.pos
POS_TAGSET
e.g. look for the description and examples of part-of-speech "CC":
POS_TAGSET["CC"].desc
POS_TAGSET["CC"].ex
doc.pos_list()
doc.word_list()
e.g. look for words "penge...kan" such as "pengecekan"
doc.find("^penge.+kan$")
to show sentence by sentence
dump(doc.find("^penge.+kan$"))
e.g. look for words "penge...kan" such as "pengecekan"
doc.find_word("^penge.+kan$")
to show line by line
dump(doc.find_word("^penge.+kan$"))
print(doc.text())
.
|- LICENSE
|- README.md
|- TODO.md
|- docs
| |-- ...
|- requirements.txt
|- itctk
| |-- __init__.py
| |-- ...
|- test
| |-- __init__.py
| |-- ....
|- setup.py
Project maintainers: Le Tuan Anh [email protected], David Moeljadi [email protected]