14 Mart 2018 Çarşamba

Stanford Core NLP Notları

Stanford Core NLP için CLI ve API kullanımıyla ilgili notlar;

-Annotators
-Regexner (POS tagging üzerinden pattern matching)
-CLI: Farklı durumlar için örnekler
-Web sunucusu üzerinde kullanımı (singleton instance)
-Performans : Model karşılaştırmaları
-POS Tags (Universal)
-Dependency Tags (Universal)
-Diğer Notlar (extension...)


Annotators

ssplit, tokenize, pos, lemma, depparse, parse, ner, regexner, dcoref, cleanxml, coref, mention

Tün annotator listesi ve detayları

ssplit  : Metni cümlelere bölme.
ssplit.eolonly: true (Satır sonunu cümle sonu olarak kabul etme / new line)
Diğer ssplit tanımları

tokenize : kelimelere bölme
tokenize.whitespace: true (false olduğunda cumleleri json da ayrı ayrı listeliyor [sexpr and parser...])
Diğer tokenizer tanımları

pos  : part of speech
https://nlp.stanford.edu/software/tagger.shtml
https://nlp.stanford.edu/software/pos-tagger-faq.html

POS Tagger traning örneği
http://renien.com/blog/training-stanford-pos-tagger/

POS Tagger model alternatifleri:

Core NLP dağıtımı ile birlikte verilen 2 farklı İngilizce POS tagger bulunuyor:

A bi-directional dependency network tagger in edu/stanford/nlp/models/pos-tagger/english-left3words/english-bidirectional-distsim.tagger.
-Its accuracy was 97.32% on Penn Treebank WSJ secs. 22-24.

A model using only left second-order sequence information and similar but less unknown words and lexical features as the previous model in edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
-This tagger runs a lot faster, and is recommended for general use.
 -Its accuracy was 96.92% on Penn Treebank WSJ secs. 22-24.

Diğer modeller:
For English, there are models trained on WSJ PTB, which are useful for the purposes of academic comparisons.

There are also models titled "english" which are trained on WSJ with additional training data, which are more useful for general purpose text.

What is the difference between "english" and "wsj"?

The models with "english" in the name are trained on additional text corresponding to the same data the "english" parser models are trained on, with the exception of instead using WSJ 0-18.

The main class for users to run, train, and test the part of speech tagger.
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html

lemma : Kelimenin çekimsiz halini almak için. (tried / trying /tries : try)

true case:
Recognizes the “true” case of tokens (how it would be capitalized in well-edited text) where this information was lost, e.g., all upper case text.

parse (Constituency Parsing):
Kullandığı modeller; englishPCFG.ser, englishSR.ser, englishFactored.ser

depparse (Dependecy Parsing): Neural Network Dependency Parser
Kullandığı model; english_UD.gz

ner: (Named Entity Recognition)
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC)
and numerical (MONEY, NUMBER, DATE, TIME, DURATION, SET) entities.
idiom / expression bulmak için custom named entity recognition
http://nlp.stanford.edu/software/regexner.html
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/crf-faq.html

coref
The CorefAnnotator finds mentions of the same entity in a text, such as when “Theresa May” and “she” refer to the same person.
The annotator implements both pronominal and nominal coreference resolution.
https://stanfordnlp.github.io/CoreNLP/coref.html
//coref çok ram yiyor
the coreference module operates over an entire document.

dcoref (requires : edu/stanford/nlp/models/dcoref/demonyms.txt)
Implements mention detection and both pronominal and nominal coreference resolution

natlog
Marks quantifier scope and token polarity, according to natural logic semantics
For example, for the sentence “all cats have tails”, the annotator would mark all as a quantifier with subject scope [1, 2) and object scope [2, 4).
In addition, it would mark cats as a downward-polarity token, and all other tokens as upwards polarity.

openie
Extracts open-domain relation triples, representing a subject, a relation, and the object of the relation.
This is useful for (1) relation extraction tasks where there is limited or no training data, and it is easy to extract the information required from such open domain triples; and, (2) when speed is essential.

relation
Stanford relation extractor is a Java implementation to find relations between two entities.

udfeats
Labels tokens with their Universal Dependencies universal part of speech (UPOS) and features.

Regexner

POS etiketlerini kullanarak kurallara uyan token listesini sorgulamak için kullanılır.

https://stanfordnlp.github.io/CoreNLP/regexner.html
custom ner dosyası formati (searchform/normliazed) eklemek icin

Stanford NER CRF FAQ
https://nlp.stanford.edu/software/crf-faq.html

https://nlp.stanford.edu/software/tokensregex.shtml

https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html


Farklı şekillerde kurallar tanımlanabilir.
Bunlardan biri her bir satırı bir kurala denk gelecek şekilde "phrase" benzeri yapı tanımı.
Diğeri json formatında açık belirtim:

json formatında rule tanımlama
https://github.com/stanfordnlp/CoreNLP/issues/200
List sentences = ...;
 CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), file1, file2,...);
 for (CoreMap sentence:sentences) {
   List matched = extractor.extractExpressions(sentence);
   ...
 }
 https://stackoverflow.com/questions/31966910/tokensregex-using-a-group-captured-inside-an-annotation-as-an-argument-to-the/31973495#31973495


Diğer linkler
https://flystarhe.github.io/2016/11/07/stanford-tokens-regex/
http://stackoverflow.com/questions/14689717/is-it-possible-to-get-a-set-of-a-specific-named-entity-tokens-that-comprise-a-ph?rq=1
https://stackoverflow.com/questions/43942476/load-custom-ner-model-stanford-corenlp

https://nlp.stanford.edu/nlp/javadoc/javanlp-3.6.0/edu/stanford/nlp/ie/regexp/RegexNERSequenceClassifier.html

https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/ling/tokensregex/matcher/TrieMapMatcher.java
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatcher.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/MultiPatternMatcher.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/RegexNERAnnotator.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html

http://www.massapi.com/class/co/CoreMap-4.html
http://book2s.com/java/api/edu/stanford/nlp/ling/tokensregex/tokensequencepattern/compile-1.html
https://github.com/stanfordnlp/CoreNLP/blob/master/itest/src/edu/stanford/nlp/ling/tokensregex/TokenSequenceMatcherITest.java

Regex examples
[{tag:/NN|NNS/}] [{tag:/IN.*/} & !{word:/about/}] [{tag:/NN|NNS/}] > ([A-Za-z]+/NN) ([A-Za-z]+/IN) ([A-Za-z]+/NN)

CLI Kullanımı (Command line interface)

-Java 64 bit olarak kurulmalı. Bütün annotator'lar kullanıldığında 4GB'ye kadar kullanıma çıkabiliyor.

-Dosyaya çıktı almak için en sona dosya adı eklenebilir.

-wget ile servis üzerinden kullanmak
D:\Projects\nlp\wget-1.17.1-win32>wget --post-data "The quick brown fox jumped over the lazy dog. This is another sentence to take care of." "localhost:9000/?properties={'tokenize.whitespace': 'true', 'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref', 'outputFormat': 'json'}"

-Satır sonunu cümle ayracı olarak kullanmak için: sentenceDelimiter newline

-Kullanılan annotation türüne göre farklı bellek alt limiti gereklidir.
Sadece POS için -Xmx??m ile bu belirtilebilir. Örn: -Xmx500m
Dependency parsing (depparse) min -Xmx1g gerekli. parse (neural netword model) daha az. 
İşlenen belgenin büyüklüğüne göre daha fazla bellek kullanımı gerekebilir.

-Modeller'de kütüphanenin çalıştırıldığı dizinde olmalı.

-"*" parametresi ile dizindeki tüm jar dosyalarının yüklenmesi sağlanır. Ya da sadece gerekli jar dosyaları belirtilir.

-Parametreleri ayrı ayrı belirtmek yerine oluşturulan tanım dosyasını belirtmek için:
-props sampleProps.properties

Diğer detaylar:

CLI örnekleri:





Web sunucusu üzerinde kullanım


Modellerin yüklenmesi uzun sürdüğünden toplu işlemlerde ya da Web ortamındaki isteklerde kullanmak üzere "singleton" olarak önceden hazırlanmış bir sınıf örneği kullanılmalıdır.
Kullanılan annotator'lara göre farklı bir nesne örneği hazırlayan sınıf:

https://gist.github.com/mehmetilker/451fdfd427cd13b2a081d3e5fcc39c48

-IoC container üzerinde singleton olarak eklenmeli: "services.AddSingleton();"

-CoreNLP içinde "pooling" kullandığı için farklı StanfordCoreNLP sınıfı örneklerinde daha önce oluşturulan annotator'lar tekrar kullanılır. Örneğin POS tagging annotator için tekrar model belleğe yüklenmez.

"An object for keeping track of Annotators. Typical use is to allow multiple pipelines to share any Annotators in common.
For example, if multiple pipelines exist, and they both need a ParserAnnotator, it would be bad to load two such Annotators into memory.
Instead, an AnnotatorPool will only create one Annotator and allow both pipelines to share it."
Detaylar:

 -Bazı isteklerde NER (named entity recognition)  çalıştırmamak örnekteki sınıf ile mümkün.

Çalışma zamanında tanımları değiştirmek için:
https://stackoverflow.com/questions/29408588/changing-corenlp-settings-at-runtime/29412556#29412556

Performans

Is your tagger slow?

Some people also use the Stanford Parser (englishPCGF) as just a POS tagger. It's a quite accurate POS tagger, and so this is okay if you don't care about speed.

But, if you do, it's not a good idea. Use the Stanford POS tagger
https://nlp.stanford.edu/software/pos-tagger-faq.shtml

https://nlp.stanford.edu/software/lex-parser.shtml

In applications, we nearly always use the english-left3words-distsim.tagger model, and we suggest you do too. It's nearly as accurate (96.97% accuracy vs. 97.32% on the standard WSJ22-24 test set)
and is an order of magnitude faster.


Hız ve tutarlılık üzerine ölçüm:
100.000 cümle parse işlemi (benim testler):
(sql e yazma dahil)
englishPCGF.ser.gz > (parse) 75 dk.
englshSR.ser.gz >    (parse) 45 dk
nndep/english_UD.gz (depparse) > 31 dk.
1000 cümle sadece parse:
englishPCGF.ser.gz > (parse) 40-50 sn. > 500MB-650MB RAM
2 core (parse.nthreads=2) 30 sn. > 600MB-850MB RAM
englshSR.ser.gz >    (parse) 17-22 sn
nndep/english_UD.gz (depparse) > 6-8 sn.

Farklı modellerin karşılaştırılması
english_UD, englishRNN, englishFactored, englishPCFG etc...
https://stackoverflow.com/questions/36844102/stanford-parser-models

POS Tags (Universal)

Tüm liste
http://universaldependencies.org/docs/u/pos/index.html

Noun örneği:
Universal: http://universaldependencies.org/docs/u/pos/NOUN.html
English: http://universaldependencies.org/docs/en/pos/NOUN.html
verb: http://universaldependencies.org/docs/en/pos/VERB.html

Features - Temel POS etiketleri harici tanım gereken tipler için
http://universaldependencies.org/docs/u/feat/index.html
The features listed here distinguish additional lexical and grammatical properties of words, not covered by the POS tags.

Dependency Tags (Universal)

Universal Dependency Tags: http://universaldependencies.org/treebanks/en/index.html

S Tree (phrase structure trees ) parse tags: http://web.mit.edu/6.863/www/PennTreebankTags.html
Clause level, phrase level, word level

Typed dependency (grammatical relations)
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/EnglishGrammaticalRelations.html

Diğer Notlar

wordnet de tanımlı collocation ları bulmak için
örnek kullanım: https://github.com/lihait/CollocationFinder
http://www-nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/CollocationFinder.html

//edu.stanford.nlp.trees.CollocationFinder a = new edu.stanford.nlp.trees.CollocationFinder();



Hiç yorum yok: