GitHub - saffsd/geniatagger: - part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text -

saffsd / geniatagger Public

Notifications You must be signed in to change notification settings
Fork 16
Star 22

- part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text -

www-tsujii.is.s.u-tokyo.ac.jp/genia/tagger/

View license

22 stars 16 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models_chunking		models_chunking
models_medline		models_medline
models_named_entity		models_named_entity
morphdic		morphdic
LICENSE		LICENSE
Makefile		Makefile
README		README
bidir.cpp		bidir.cpp
chunking.cpp		chunking.cpp
common.h		common.h
main.cpp		main.cpp
maxent.cpp		maxent.cpp
maxent.h		maxent.h
morph.cpp		morph.cpp
namedentity.cpp		namedentity.cpp
postag.cpp		postag.cpp
tokenize.cpp		tokenize.cpp

Repository files navigation


                     GENIA Tagger

                   Tsujii laboratory
                  University of Tokyo


1. How to build 

  You need gcc to build the tagger.

  % make

  if you encounter errors with hash, try commenting out "#define USE_HASH_MAP"
  in "maxent.h".


2. How to use

  Prepare a text file containing one sentence per line, then

  % ./geniatagger < TEXTFILE > TAGGEDTEXT

  The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags,
  and named entity (NE) tags in the following tab-separated format.

  word1	base1	POStag1	chunktag1 NEtag1
  word2	base2	POStag2	chunktag2 NEtag2
    :     :        :       :

  Chunks and named entities are represented in the IOB2 format (B for BEGIN,
  I for INSIDE, and O for OUTSIDE). The named entity tagger is trained on
  the NLPBA data set [3], which contains five semantic classes (DNA, RNA,
  cell_line, cell_type, and PROTEIN).


 2.1 Tokenization

  This tagger tokenizes the sentence with almost the same policy as the
  upenn tokenizer (http://www.cis.upenn.edu/~treebank/tokenizer.sed). 

  If you don't want the tagger to perform tokenization, use -nt option.
  In that case, the input sentences should be already tokenized by your
  own tokenizer with white spaces. 


References

[1] Yoshimasa Tsuruoka and Jun'ichi Tsujii, Bidirectional Inference with
    the Easiest-First Strategy for Tagging Sequence Data, Proceedings of
    HLT/EMNLP 2005, pp. 467-474.

[2] Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John
    McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust
    Part-of-Speech Tagger for Biomedical Text, Advances in Informatics -
    10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005

[3] http://research.nii.ac.jp/~collier/workshops/JNLPBA04st.htm

Please send questions and comments to [email protected].
---------------------------------
Tsujii laboratory,
Department of Computer Science,
University of Tokyo
http://www-tsujii.is.s.u-tokyo.ac.jp/