100 NLP Papers

  转载自:http://goo.gl/rX8M7

  This page lists up 100 important NLP papers which I read in 2011.

Corpus Design
  1. Ide, N., Suderman, K. The American National Corpus First Release. Proc. of LREC 2004, pp.1681-84.

    – an American-English ver. of British National Corpus (BNC)
    – American National Corpus (ANC) – written 3M tokens and spoken (transcription) 8M tokens, made available in 2003, getting feedback, will be extended to 10M words
    – Encoded in XCES (XML Corpus Encoding Standard), a “stand-off” annotation + merged ver.
    – Tagset: CLAWS C5, C7, used GATE (Penn Tags) and Multext tagger (EAGLES)
    – Developing a query program AQS (ANC Query System) in Java
    – Set up a web site for authors to submit their text

    Morphological Analysis and POS Tagging
  2. Shunsuke Mori and Makoto Nagao. Word extraction from corpora and its part-of-speech estimation using distributional analysis. COLING 96.

    – Unknown word discovery & POS induction by distributional analysis
    — a word may belong to more than one POS
    – left-or-right one-character “environment” of the type (string) in a large corpus
    – \sum p(pos | str) p(pos) -> the environment of a substring will be closer to that of POS if the string is POS
    -> closeness (the minimum distance) is “word measure”
    – POSs categries for unknown words: サ変動詞, 名詞, ラ行動詞, 形容詞, 形容動詞
    – Evaluated on precision / recall of extracted words

  3. Hwee Tou Ng and Jin Kiat Low. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?, EMNLP 2004.

    – one-at-a-time (segmenation -> PoS tagging)
    – segmentation
    — features: character (+- 3), character types (+- 3), 2-gram, word, and punctuation -> average F = 95.1%
    — dynamic programming decoding (for testing), can consider word-based features
    – PoS tagging
    — average F = only 84.1%; but character based one achieved
    – all-at-one (at the same time)
    — + character features is the best (seg: F=95.2% PoS: 91.9%), but one-at-a-time + character is a worthwhile compromise
    – (trained on CTB ver. 3.0 + maxent)

  4. Graham Neubig, Yosuke Nakata, Shinsuke Mori. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. , ACL 2011.

    – Previous: structure model, assumes interaction and PoS, but “gains provided by structured prediction can be largely recovered by using a richer feature set.”
    – Proposed: two-level pointwise classification (boundary 0/1) using char. n-gram and char. type n-gram, as well as dictionary features
    – + Domain adaptation by Partial annotation
    – Experiment on BCCWJ, comparing MeCab and proposed (by logistic regression)
    – 2-LR outperforms JOINT (MeCab) and almost same as 2-CRF
    – sentence-based and word-based partial annotation (choose ones with small top1 and top2 probability difference)

  5. Dipanjan Das, Slav Petrov. Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections, ACL 2011.

    – To build a PoS tagger (on universal PoS tag set) for resource scarce languages (using parallel data and English supervised PoS tagger)
    – a bilingual graph of trigrams (in foreign language) and word types (in English), using similarity statistics and word alignment
    – Projection and label propagation
    – PoS tagger was trained on feature-based HMM (Berg-Kirkpatrick et al. 2010)
    – Experiment: used 8 Indo-European languages from Europarl and ODS UN dataset
    – Full model achieved 10.4 point increase from state-of-the art and 16.7% from vanilla HMM

  6. Yue Zhang and Stephen Clark. Joint Word Segmentation and POS Tagging using a Single Perceptron.
    ALC 2008.

    segmentation and PoS tagging is performed simultaneously, using a single generalized averaged perceptron
    – perceptron: build a feature vector for a whole sentence
    – standard beam-search decoder resulted in low segmentation accuracy
    – possible caues: whether or not put PoS tags to partial (end-of-sentence) words?
    – keep an agenda per each character, recording B best segmented candidates
    – Experiment: Chinese Treebank, higher segmentation F-value and tag accuracy
    – NN-VV, VV-NN, NR-NN mistakes are often

  7. Yue Zhang and Stephen ClarkChinese Segmentation with a Word-Based Perceptron Algorithm, ACL 2007.
    – segmentation of know words can also be ambiguous, context dependent
    – (averaged) perceptron discriminative learning algorithm, creating a global feature vector for a sentence
    – lazy update optimization only update the dimensions of the total parameter vector corresponding to the features in the sentence.
    – decoding based on beam search (using source agenda and target agenda)
    – experiment in CTB3, accuracy was competitive even when B=4
    Parsing
  8. Michael John Collins. A New Statistical Parser Based on Bigram Lexical Dependencies, ACL 1996.

    Model
    – head-modifier relations
    — statistical moel P(T|S) = P(B, D|S) (B: baseNPs — by chunking, D: dependencies, T: tree, S: tagged sentences)
    — convert CFG tree to dependency graph (by simple rules on re-writing rules)
    — approximate dependency probability of word-to-word by bigram co-occurence probability of word/PoS pairs, with a distance feature (like direction, punctuation, verb etc.)
    — smoothing by deleted interpolation
    Experiment
    — normal chart parser, then beam search parsing (very effecive with almost no loss of accuracy), performing as well as SPATTER

  9. Michael Collins.Three Generative, Lexicalised Models for Statistical Parsing, ACL 1998.

    – Model
    — Generative model P(S, T)
    — Model 1; generative model of (Collins 96), generate a head from a parent, then generate right & left modifiers by 0th-order markov processes, or STOP. (+ distance feature of Collins 96)
    — Model 2; integrate complement/adjunct distinction to constituent by rules. (cf. http://www.ucl.ac.uk/internet-grammar/phfunc/compare.htm) + subcategorization
    — Model 3; traces (slots by relative clauses), formalized in a similar way to GPSG by adding “gap” feature to costituents
    – Implementation
    — back-off interpolation smoothing
    — Model1 already achieved a significantly better performance than (Collins 96)
    – Comparison to (Eisner 96) — the reason why it’s better than completely lexicalized dependency (example of hope + VP(TO) and require + S(TO))

  10. Eugene Charniak. A Maximum-Entropy-Inspired Parser, NAACL 1999.

    – Model
    — generative model of Πt p( t | l) p(h|t, l, H) p(e | l, t, H) …
    — Markov grammar approach based on 2 generated words previously (second-order)
    – Result
    — 90.1% precision/recall for sents <= 40 words, 89.5% precision/recall for sents <= 100 words
    — 2% increase by guessing the lexical head's pre-terminal (PoS) before guessing the lexical head (PoS estimate can be a clue for various rule expansion)

  11. Eugene Charniak. Statistical Parsing with a Context-free Grammar and Word Statistics , AAAI 1997

    – Generative model “head” rule: P(s=head | h=head of parent constituent, t=type, l=parent constituent), approximated by deleted interpolation and cluters of head words.
    — + expansion rule: P(r=grammar rule | h, t, l)
    – Parsing
    — Parse with normal PCFG (pulled from tree-bank) and fill the charts with probable constituents, then run the full Viterbi parse using the full distribution
    – Result
    — Full model: 86.8% LP (label precision) and PCFG: 75.3%
    — Comparison with (Magerman 1995) and (Collins 1996), 18% error reduction over two systems
    — Advantage: can compute P(s, Π), no explicit part-of-speech tagging process, can process valency properly

  12. Dan Klein, Christopher D. Manning: A* Parsing: Fast Exact Viterbi Parse Selection, HLT-NAACL 2003.

    A* search: guaranteed to find the best parse (unlike beam search)
    edges = grammar items = constituents+span
    best-first parsing:
    agenda edges are processed according to a priority: figure-of-merit (FOM), based on the approximation of P(e|s)
    no guarantee that the first parse discovered is the Viterbi parse
    A* parsing:
    a: estimates of alpha (outside) prob, b: estimate of beta (inside) prob., use a+b as the edge priority (proof: Klein and Manning 2002)
    context: X (constituent) and the rest of the terminals
    -> precompute the costs for context summary (projection function of a context)
    compared different summaries: NULL, SX, XBAR, F, TRUE
    or, use grammar projection (instead of context) F (filter)
    Results
    – Even the simplest S blocks 40% of the edges (with only 1 min precomputation), 97% max (about 60 min precomputation)

  13. Adam Pauls and Dan Klein: K-Best A* Parsing, ACL 2009.

    parsing as a weighted deduction
    1-best A* search
    prioritized weighted deduction rule over deduction items I(A, i, j) -> collapsed way of possible inside derivations of (A, i, j)
    deduction schemes -> instantiated for every grammar rule
    the scheme ‘IN’ adds a heuristics h(A, i, j) to the weight, which meets consistency and monotonicity
    k-best A* search
    inside derivation items D(Ta, i, j) and combine them by the scheme ‘DERIV’ -> NAIVE
    -> efficient k-best A* search, use ‘delayed’ DERIV with exact Viterbi outside scores -> KA*
    Experiment
    — Compared the number of edged pushed into agenda: NAIVE degrades quickly, while KA* scales roughly linearly with the looseness of
    the heuristic
    — Also tested on tree transducer grammer and lexicalized grammar

  14. Ryan McDonald, Fernando Pereira, Kiril Ribarov and Jan Hajic. Non-projective Dependency Parsing using Spanning Tree Algorithms, EMNLP 2005.

    – non-projective dependency, where arcs cross each other (frequent in German, Dutch, Czech)
    – dependency parsing can be formalized as maximum spnanning tree search, O(n^2) efficient algorithm
    – formalize edge score as a dot product of weights and a feature vector
    – Chu-Liu-Edmonds algorithm for (nonrestricted) MST search
    – greedly select the incoming edge for a vertex. If a tree doens’t result, there’s a cycle, and recursively apply the algorithm to the contracted graph
    – learn the weight vector by single-best MIRA, Factored MIRA (related to the averaged perceptron — MIRA aggresively updates w to maximize the margin between the correct
    – Factored MIRA achieves the best dep. accuracy of 84.4 for Czech

  15. Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, Jun’ichi Tsujii.Incremental Joint POS Tagging and Dependency Parsing in Chinese, IJCNLP 2011

    – joint application of part-of-speech tagging and dependency parsing
    – some Chinese POS tags require long-range syntactic information to be disambiguated
    – shift-reduce parsing
    – process characters incrementally, chooses either shift, reduce-right, reduce-left, based on scores trained by perceptron algorithm
    – order status by using prefix score and inside score
    – proposed
    – shift -> shift(PoS)
    – delayed features enables the delayed evaluation of look-ahead PoS features
    – pipeline approach achieves 94.01% tag accuracy, 79.83% dep. accuracy vs. 81.33% (proposed)
    – Dramatically reduce # of NN/VV and DEC/DEG confusion

  16. Jennifer Foster et al. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0, IJCNLP 2011.

    – parser adaptation to noisy text from social media (tweets) due to PoS tagging accuracy
    – Four parsers compared: Brown > Berkeley > MST Parser > Malt
    – constructed Web 2.0 treebank from forum data and twitter
    — the parser trained from BBS data worked well on twitter data, but not vice versa (OOV rate 16.6% for TwitterDev)
    — self-training (McClosky et al. 2006) and up-training (training set of a less accurate parser is augmented with trees for the unannotated corpus sentences produced by a more accurate parser.), achieving 5.7% increase over baseline
    — experiments with pure self-training have had mixed results over the years.

    Named Entity & Multiword Expressions
  17. Michael Collins and Yoram Singer. Unsupervised Models for Named Entity Classification, EMNLP 1999.

    – settings
    — NYT data, (spelling, context) pairs for NNP or NNPS + appositive / PP relations
    + features: string, allcap, nonalpha, context
    – method 1 (DL-CoTrain)
    — unsupervised on decision lists: formalize (Yarowsky 95)’s unsupervised WSD
    — start from seed rules, gradually grow spelling rules using unlabeled data and context rules
    – method 2 (CoBoost)
    — learn two classifiers in parallel to minimize the num. of disagreement to unlabeled examples
    — define the training error function over the unlabeled instances as well
    – method 3 (compared; EM-based method)
    — feature sets are generated by (unobserved) labels
    – better accuracy both on clean and noise, but on par with Yarowsky-cautious algorithm (Yarowky’s variation so that the rules are added slowly)

  18. Fan Bu, Xiaoyan Zhu and Ming Li: Measuring the Non-compositionality of Multiword Expressions, COLING 2010

    – MED (Multiword Expression Distance)
    — defined by the “information distance” between context of “bag of words in MED” and the context of “conjunction of words in MED”
    — context is defined as the set of returned Web pages (using Yahoo!)
    — inversely proportional to MED under the independence assumption (but MED doesn’t assume independence)
    – Evaluated on QA Post-processing and NER

    Language Models, Lexicon, and Distributional Similarity
  19. Ronald Rosenfeld. Two Decades of Statistical Language Modeling: Where Do We Go from Here?, Proc. of IEEE, 2000

    – “most successful SLMs (like n-grams) use little knowledge of language”
    – definition of average-log-likelihood, cross-entropy, perplexity
    — 5% reduce in perplexity is not significant. 10%-20% is noteworthy, 30%- is significant and rare
    – review of n-gram language models, CFGs, log-linear models, decision tree, dependency models, and dimensionality reduction
    – domain adaptation
    — by interpolation with different co-efficient, or clustering training corpus

  20. Andreas Stolcke. Entropy-based Pruning of Backoff Language Models, arXiv:cs/0006025v1.

    – reduce large word n-gram model:
    — backoff langage model (P(w | h) -> a * P(n-1-gram | h-1)
    — absolute frequency thresholding
    model
    — set a threshold theta on the relative entropy change
    — compute relative entropy (motivated by relative change in perplexity) by pruning explicit estimates P(w | h), leading to variable length N-gram model
    evaluation
    – when theta = 10^-8, 26% of its original size, less than 6% perplexity increase
    – comparison with (Symore and Rosenfeld 1996), over 99% of the chosen N-grams were the same

  21. Marti A. Hearst, Automatic Acquisition of Hyponyms from Large Text Corpora, COLING 1992.

    lexico-syntactic pattern recognition for hyponymy to general texts
    – NP0 such as {NP1, NP2, …, (and|or)} NPn -> hyponym(NPi, NP0)
    – NP {, NP}* {,} or other NP, including, especially, …
    – used a constituent analyzer to recognize patterns
    purpose
    – dictionaries are limited, fixed -> lexicon augmentation (esp. WordNet)
    – noun phrase semantics, semantic relatedness (e.g., distributional similarity)
    patten discovery
    – decide a relation (e.g., IS-A), and a pair (e.g., England-country), then find environments in corpus
    Experiment
    – Ran on encyclopedia text (Grolier’s American Academic Encyclopedia) and NYT
    – context dependent hypernyms such as hyponym( “Washington”, “nationalist”), hyponym(“aircraft”, “target”), and too general hypernyms

  22. Patrick Pantel and Dekang Lin. Discovering Word Senses from Text. , ACM SIGKDD, 2002.

    discover word sense from text
    – problem: lexicons miss major senses, and include too minor senses
    – distributional hypothesis -> mixes senses
    CBC (Clustering By Committee)
    – compute centroid by averaging feature vectors of words, weighted by discounted PMI, sim = cosine similarity
    – find committees: clusters of words with high similarity to centroid
    – after word is assigned to a cluster, overlapping features are removed so that minor senses can be discovered
    Evaluation
    – 1GB newspaper from TREC, analyzed by MINIPAR + grammatical relationships
    – precision based on WordNet senses, determined by LinSim + recall calculated on union of senses extracted several clustering algorithms
    – Prec, Rec, F, all higher than existing methods

  23. Rion Snow, Daniel Jurafsky, Andrew Y. Ng. Semantic Taxonomy Induction from Heterogeneous Evidence, ACL/COLING 2006.

    – using multiple classifiers to optimize the entire structure of a taxonomy (e.g., coordinate term <-> hyponym)
    – Probabilistic framework for taxonomy induction
    — Focuses on hyponym relation and coordinate relation (the concept of (m,n)-cousin)
    — T* = arg max_T P(E | T), where T is taxonomy, E is observed evidence
    — When adding a relation Rij, add the implied relations at the same time, and consider multiplicative change of P(E|T)
    — extend the relation so that it holds only between the correct sense pair, WSD being an implicit side-effect
    – Trained hypernym classifier using dependency features and logistic regression classifier
    – Trained (m,n)-cousin classifier using clustering-based (distributional) similarity and softmax regression
    – Evaluated by fine-grained precision (if the hyponym is really true?) and hypernym sense disambiguation, coarse-grained evaluation (like NER). added 10,000 synsets with 84% precision

  24. Georgiana Dinu and Mirella Lapata. Measuring Distributional Similarity in Context. EMNLP 2010.

    – a probability distribution model P(z|t) over latent senses (not a vector model as before)
    – given a contect feature, P(z|t, c) \propto P(z|t) P(c|z)
    – Components: 1) input co-occurrence matrix 2) algorithm to induce latent structure
    — NMF (Nonnegative Matrix Factorization) – Factorize matrix into M, H (with reduced-dimensionality) and LDA (mapping word -> documents, using \theta as the sense distribution)
    – evaluation set (by Finkelstein et al. (2002).) 353 pairs of words with human judgement and SemEval 2007 Lexical Substitution Task, measure: correlation
    — baseline: SVD and [Lin 98]
    – Contextualized NMF and LDA were significantly better than other models

  25. Jun’ichi Kazama, Stijn De Saeger, Kow Kuroda, Masaki Murata, and Kentaro Torisawa. A Bayesian Method for Robust Estimation of Distributional Similarities, ACL 2010.

    generalized form of semantic similarity: sim(w1, w2) = g( v(w1), v(w2) )
    – focus on how to robustly estimate v(w1), solve the sparseness issue by Bayesian framework
    – Bayesian Bhattacharyya coefficient (BC): Dirichlet prior, multinominal distribution, calc. by Bhattacharyya coefficient, analytical solution can be efficiently computed
    — p(φ|D) = p(D|φ)p(φ) / P(D), p(φ), p(φ|D)->Dirichlet
    – evaluated in similar word (siblings from EDR V3.0 dictionary and closed sets) retrieval task, MAP
    — dependency parse from (Shinzato, et al. 2008), context = verb+particle
    — BC with absolute discounting performed the best, BCb > BC > JC tend to ouput high frequency words

    Machine Learning
  26. Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, Tom Mitchell:Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 1-34, 1999

    – Assuming data are generated from Gaussian mixture, unlabled data alone can recover all the model parameters, but not classes
    – Train a naive Bayes classifier to label unlabeled documents, then re-train the classifier from the whole data, repeat this like EM
    – Extend naive Bayes by multiple mixture of components (=unigram mixture)
    – Extension
    — introduced a weighting factor for unlabeled data which dynamically adjust, and use multiple mixture components per class
    — Used Laplace smoothing,

  27. Charles Sutton, Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning

    – Introduction of graphical models, (sequential models, general graph models), and discriminative & generative models
    — Naive Bayes vs logistic regression, HMM vs CRF
    – HMM to linear-chanin CRF
    — compact representation by feature functions (ones corresponding to transition, and ones corresponding to state/ommision)
    — Parameter estimation by maximizing penalized conditional log likelihood (via L(limited memory)-BFGS)
    — Inference by the foward-backward probability and dynamic algorithm

  28. Kevin Knight. Bayesian Inference with Tears

    – tutorials on (non-parametric) baysian methods and (Gibbs) sampling methods with little math
    – bayesian techniques beat plain EMs — the author’s another turning point
    – tree substitution grammar
    — grow a syntax tree by rewrites (substitutions), where plain EM fails because 1) num. of parameters and 2) overfitting (remembering every sentence as a rule)
    -> re-use is the key!: generative story: P(rule | root(rule) + count in cache (rules used so far))
    — choose the base distribution with beta, and weight by the length of history H (the longer, the more reliable) = Chinese Restaurant Process or Stick-Breaking Process
    — add one smoothing is a special case of this bayesian generative story
    – Chinese segmentation — history: the words used so far
    – English PoS tagging — history: tag bigram and tag->word emittion numbers
    – Gibbs sampling
    — start with an initial sample (e.g., random tagging), make a small change by a coin flip, and collect the counts
    — use exchangeability of past, present, and future events, and efficiently compute neighboring samples
    — burn-in: don’t collect the first N samples
    – Pitman-Yor process
    — each time we consult the base distribution, we actually raise the chance that we will consult the base distribution again in the future: keeping a long tail long

  29. Trond Grenager, Dan Klein, Christopher D. Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction, ACL 2005.

    Task
    – field segmentation / structured extraction from classified ads
    Experiment
    – bibliographic citations (author, title) and classified ad (size, rent, etc)
    Model
    – HMM, trained by normal EM Forward-Backward algorithms
    – baseline (chunk + cluster, supervised)
    – unuspervised (non-constrained HMM, performing very poorly, diagonal transition HMM with self-loop prob.)
    — hierarchical mixture emission model (by linearly mixing common word distribution)
    — boundary model by splitting states to non-final and final
    — semi-supervised leaning by simply adding observed counts to E-step

  30. Aria Haghighi and Dan Klein. Prototype-Driven Learning for Sequence Models. HLT-NAACL, pp. 320–327, 2006

    – provide “prototypes” – a few canonical examples of each target (classes) at first
    — In supervised learning, we have to carefully design the model, and this does not necessarily minimize total effort.
    — Grenager et al. (2005) information extraction from classifieds, modified unsupervised HMM
    – propate prototypes by distributional similarity (DS; L and R 2 words, with position, + SVD) and include them as features into log-linear generative model (MRF P(x,y|\theta) (c.f., CRF) The parameters are tuned by L-BFGS

    – tested on part of speech tagging (in English and Chinese) and information extraction
    — WSJ, features: wordtype, suffix (to 3 chars.) initial-capital, contains-hyphen, and contains-digit
    — DS features substaintially reduced error rates (38.4% reduction in EnPoS)
    — To achieve “stickiness” of IE, they removed direction and position info of DS context (like LSI)

  31. Percy Liang and Dan Klein. Online EM for Unsupervised Models. NAACL, 2009

    – online EM characteristics: (1) converge to different local optima by online (2) EM likelihood is only loosely related to accuracy metrics
    – incremental EM
    — keep track of each s_i (responsibility x stat. count), subtract old one from total parameter \mu, and add the new one
    — memory issue (have to store s_i)
    – stepwise EM (sEM)
    — stepsize \eta_k (k is the number of \mu updates so far), with stepwidth m
    — params m and \alpha greatly impact the accuracy
    – tasks: part-of-speech tagging, document classification, word segmentation, and word alignment
    — word segmentation: penalized unigram model, accompany a length penalty term so that it won’t converge to trivial sentence=word solution (best published result)
    – sometimes (esp. PoS and classification) acurracy was better than batch EM

    Machine Translation
  32. Kevin Knight and Philipp Koehn: Statistical What’s New in Statistical Machine Translation

    – Statistical model overview, IBM Model 4
    – Word alignment via the EM algorithm, improved word alignment
    – Phrases induction from word alignment
    – Syntax-based models (very brief introduction)
    – Languag model
    – Decoding algorithm: beam search
    – Numbers, dates, names requiring special treatment, noun phrase reranking

  33. Antti-Veikko I. Rosti, Necip Fazil Ayan, Bing Xiang, Spyros Matsoukas, Richard Schwartz, Bonnie J. Dorr. Combining Outputs from Multiple Machine Translation Systems, NAACL HLT 2007.

    sentence-level combination
    – reranking of merged N-best lists, based on a generalized linear model using system- and hypothesis- dependent features, combined with LM scores
    phrase-level combination
    – run a phrasal decoder again with a new translation table (obtained from target-to-score phrase alignments; the more hypothesis “agree,” the higher similarity)
    word-level combination
    – consensus network decoding (Mangu et al., 2000), which aligns the word with alternatives and creates a graph, but machine translation has re-ordering
    – minimum Bayes risk (MBR) decoding based on TER (Sim et al. 2007)
    experiment
    – TER (translation edit rate of ins, del, sub, shft to the reference) and BLEU
    – used Powell’s method to find optimal model parameters
    – sentence- and phrase-level combination yielded 0.5%-1.0% gain in TER and BLEU
    – The word-level combination yields about 2.0%-3.0% gain in TER and 2.0%-4.0% gain in BLEU

  34. Phillip Koehn: Statistical Machine Translation: the basic, the novel, and the speculative

    – Introduction of corpora, MT models (word-, phrase-, syntax-based), evaluation
    – Decoding examples, hypothesis operations, “pharaoh” examples
    – IBM Model, word alignment via the EM algorithm, phrase induction from word algnment
    – Descriminative training
    – Syntax-based models: string to tree (Yamada and Knight), Hierarchical phrase-model (Chiang 2005)

  35. David Chiang. A Hierarchical Phrase-Based Model for Statistical Machine Translation, ACL 2005.

    Model
    – hierarchical phrases (phrases that contain subphrases) (e.g., )
    – modeled as synchronous context-free grammar, using only one nonterminal X (as a phrase placeholder)
    – general log-linear model, with Pharaoh’s default feature sets * word penalty * language model
    Training
    – Running GIZA++-based word alignment, phrase alignment heuristics, then generating re-writing rules recursively
    – Decoding by CYK algorithm + beam search (implemented in Python!)
    7.5% BLEU increase over Pharaoh (Mandarin2English) FBIS corpus (7.2M+9.2M words)
    – limiting X to (real) constituents didn’t improve the result

  36. Masao Utiyama, Daisuke Kawahara, Keiji Yasuda and Eiichiro Sumita.Mining Parallel Texts from Mixed-Language Web Pages , Summit XII, 2009.

    – Mine mixed-language web pages (ones which contain parallel sentences within single pages), extract and align parallel sentences from them, and tested in on a real STM system

  37. Yanhui Feng, Yu Hong, Zhenxiang Yan, Jianmin Yao, Qiaoming Zhu. A Novel Method for Bilingual Web Page Acquisition from Search Engine Web Records, COLING-2010.

    – Extract bilingual Web page records (snippet) by HTML tag path clustering
    – Classify whether records are bilingual Web pages or not by SVM using three features (word overlap, length, and keyword frequencies)

    Transliteration
  38. Knight Knight, Graehl Jonathan. Machine transliteration.Computational Linguistics. 24(4), pp. 599-512, 1992.

    – Introduced the concept of “back-transliteration” from Japanese Katakana to English
    – Generative, modular (5 sub modules) learning approach — English word language model, pronunciation model, E2J pronunciation conversion model etc., by weighted finite-state transducers (WFSTs) and weighted finite-state acceptors (WFSAs)
    – Experiment: back-translate 100 US politician names into English, both by humans and machine. Machine was 64% correct, while human was 27%

  39. Kevin Knight and Richard Sproat. Writing Systems, Transliteration and Decipherment (Tutorial), NAACL 2009 Tutorial

    – Overview of writing systems, including Japanese Kana, Chinese Hanzi, Devanagari, and many others
    – Introduction of machine transliteration (largely the same as Knight 97 paper) and new approaches
    – Decipherment — Examples of old Perusian, letter substitution cipher, language ID, and many others

  40. Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. ACL, pp. 286-293

    – Improve the channel model of the noisy channel model for spelling correction
    – Models the probability of generic string to string edits: P(alpha -> beta) (extention to Damerau-Levenshtein edit distance)
    — Trained from training set {s_i, w_i} representing spelling errors s_i paired with corect spelling w_i
    – Experimented on 10,000-misspelled English words paired with correct spelling
    — Position info. gains 13%, full model = 52% relative error reduction
    — Better language model leads to better results, but error model is still important

  41. Haizhou Li, Min Zhang, Jian Su. A Joint Source-Channel Model for Machine Transliteration. ACL 2004.

    – Comparison of phonetic-based noisy channel model (NCM) and direct orthographic mapping (DOM)
    — DOM has larger number of parameters, which leads to finer-grained representation of TM
    – Joint source-channel model
    — Generative model of n-gram of C-E chunk pairs
    — Transliteration is Viterbi decoding over possible alignments of chunks (alignment is integrated into the decoding process)
    — EM-based maximum likelihood training
    – Experiments
    — E2C transliteration: more than 2~5 fold error reduction compared to NCM
    — Better than ID3 decision tree approach

  42. Haizhou Li, A Kumaran, Min Zhang and Vladimir Pervouchine. Report of NEWS 2010 Transliteration Generation Shared Task, NEWS 2010.

    — English to 9 languages, 3 languages to English (back-transliteration), started as NEWS 2009
    — three specific requirements for the “transliteration task”
    — training data: 5K~32K, test data: 2K~3K, with a reference set, most diverse and largest corpora
    — 7 temas from 5 countries, popular task: English 2 Hindi, phrase-based, CRF, etc.

  43. Shane Bergsma and Grzegorz Kondrak. Alignment-Based Discriminative String Similarity, ACL 2007.

    – cognate (words with a common form and meaning across languages; transliterations and translations) identification
    – features
    — substrings from minimum-edit-distance aligned strings (like: Brill and Moore 2000)
    – experiment
    — discriminate E+ and E- among LCSR>=0.58 pairs (verified from histogram), used “false friends” as negative examples
    — filtered candidate by aligned bitext and bilingual dictionary
    — better than other similarity metrics (LCSR, Dice, etc.) (in 11-point AP)

  44. Dmitry Zelenko and Chinatsu Aone. Discriminative Methods for Transliteration, EMNLP 2006.

    do not require alignment, do not generate conditional or generative model
    learn linear classifiers predicting a letter from a history e[1,i-1], given a transliteration dictionary D = {(f, e)}
    – binary classifier given (e[1, i-1], f), learned by voted perceptron, decoded by beam search
    — features (local): char. ngrams, length diff, LM, start+end + features (global): char. ngram, length diff, etc.
    – Experiment
    — hand-crafted & web-mined transliteration pairs in Arabic, Korean, and Russian
    — compared with joint probabilistic models (converted Korean to Jamo romanization based on Unicode Standard)
    — consistently outperformed probabilistic-based models, different performance with different d (rel. dist.)
    — d=3 is best for Korean but d=1 for Arabic, because Jamo romanization is lengthy

  45. Alexandre Klementiev Dan Roth. Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora, ACL 2006.

    – method (extension of self citation (Klementiev and Roth 2006)
    — temporally aligned bilingual news articles + NE annotation (in source lang.)
    — combine phonetic similarity + temporal similarity (re-ranking by Discrete Fourier Transform; DFT and F-index) — better than other traditional sim. metrics.
    — iteratively refine the (phonetic) transliteration model
    – transliteration model by char-to-char alignment with relative positions -1 0 +1.
    — learned by perceptron, negative example drawn from random Russian words, initialized by 20 pairs
    – experiment
    — English-Russian news web site
    — accuracy goes up until ~20 iterations
    — DFT-based similarity being robust especially for “loosely” aligned corpus

    Web Application
  46. Cory Barr, Rosie Jones, Moira Regelson. The Linguistic Structure of English Web-Search Queries. EMNLP-08.

    – Defined and PoS-tagged search queries, found that 71% are nouns or proper nouns (on 2,508 unique queries)
    – Non-standard capitalization accounts for 33% of queries.
    – Trained Brill tagger and Stanford tagger, the performance was enhanced by considering (automatic/oracle) capitalization
    – Adding 7385 known trademarked terms as proper nouns increased the performance to 80.2%
    – Classified grammatical structure into 7 classes (like CFG rules), of which noun phrase consists of 69.8%

  47. Ziv Bar-Yossef, Naama Kraus: Context-Sensitive Query Auto-Completion, WWW 2011.

    – different from query recommendation (user input: a full query vs a prefix of quer)

    – basic principl: depend on query log, shows the MostPopularCompletion (am -> amazon, american express)
    — construct a query database, from query log and/or document collection, and hit by prefixes, infix, or fuzzy, etc.
    — MRR is only 0.187
    – context sensitive query auto completion
    — sparsity of query history is a problem
    — focus on syntactic / semantic similarity
    – proposed: NearestCompletion
    — recursively apply a (black-box) query recommendation to form a recommendation tree
    — standard BoW + cosine similarity, so can be implemented
    — when the context is relevant: MRR 48% higher, but fails when the context is irrelevant
    – evaluation: “hit” rate = same as what the user is looking for

  48. Samuel Brody and Nicholas Diakopoulos.Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs, EMNLP 2011

    tweets are short, informal, differ greately from text corpora
    word lengthening
    – emphasis, subjectivity and sentiment
    detection
    – collect variations which differ only in the number of times each letter is repeated, and choose a canonical (most common) form. 108k out of 6.5million words are lengthened
    relation to sentiment
    – the more variations a word has, the more likely the word is subjective (compared to the sentiment lexicon (Wilson et al., 2005), which suffers limited coverage to the target domain)
    adopting the sentiment lexicon
    – words which are lengthened can provide us with good candidates for inclusion in the lexicon
    – graph approach (sentiment is propagated, weighted by cosine similarity of context)
    – “Web” algorithm is more precise than “Review”

    Textual Entailment, Inference, SRL
  49. Milen Kouylekov and Bernardo Magnini: Recognizing Textual Entailment with Tree Edit Distance Algorithms, In Proc. of the PASCAL RTE Challenge, 2005.

    – Tree Edit Distance
    — Entailment: sequence of edit operations (INS/DEL/SUB) of dependency tree from T to H, where each operation has cost
    — Based on (Zhang and Shasha 1990), but consider labels
    – Dep. analysis by Minipar, cost based on idf of node word, SUB cost = defined by similarity
    – Drawbacks: tree structure difference (Wal-Mart vs Wal-Mart inc.), paraphrase

  50. Bill MacCartney and Christopher D. Manning: Natural Logic for Textual Inference, RTE ’07 Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , 2007.

    – natural logic – direct logical inference on natural language (English)
    — a middle way of precise but brittle FOL (first order logic) and shallow but robust semantic overlap.
    — inferences involving monotonicity (classify linguistic expressions into upword/downward monotone ones)
    — define the entailment relation over words, phrases, and senteces
    – NatLog system
    — tokenization, PoS tagging, and parsing, and use Tregex pattern to specify non-monotone and downward-monotone expressions
    — Alingned P and H into atomic edits ADV/SUB/INS/DEL … and build a decision tree to classify them into elementary entailment relations
    – Evaluation on FraCaS test suite (346 problems and precision: 60%.) The system tend to answer “unk” to “yes” question
    – Boost on the performance on the PASCAL RTE Challenge, by combining with Stanford RTE system.

    Bill MacCartney and Bill MacCartney. Modeling Semantic Containment and Exclusion in Natural Language Inference, COLING 2008.

    – natural logic
    — NLI: shallow semantic representation and high-precision & low first order logic
    — Devleop a monotonicity calculus covering exclusion and unifies with implicatives
    — define seven entailment relations, up/down-ward monotonicity, and inference based on sequences of atomic edits
    – Implicatives
    — [Narin et al.] combinations of positive/negative/null both in positive and negative context. Define which relation is generated when it’s deleted
    – The rest of the experimental framework is almost the same as MacCartney’s previous paper, achieved 27% reduction in error.

  51. Matthew Gerber and Joyce Y. Chai. Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, ACL 2010.

    – extra sentential, implicit arguments of nominal arguments are ignored by NomBank, but pervasive and important for semantic role labeling
    – annotation
    — NomBank vs PropBank * freq(nominal) -> prevalence. 65% of role coverage increase after human annotation
    – implicit argument identification by classifier:
    — selected features: concatenation of verbnet class/role and the mention’s class, PMI between mention+(i)argment and mention+(e)argment, etc.
    – Evaluation on 816 annotated predicate instances, with 650 implicit argments
    — the proposed increased F1 by 15.8points (baseline = use the closest explicit arguments of the mention’s verbal form)

  52. Ryu Iida, Mamoru Komachi, Kentaro Inui and Yuji Matsumoto. Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations, (Linguistic Annotation Workshop V)

    – co-reference – refer to the same entity in the world – identity-of-reference anaphora (IRA) <-> identity-of-sense anaphora (ISA)
    – existing: PropBank (only intra-sentential zero-anaphora), Kyoto Text Corpus v4.0 and GDA Tagged Corpus (no distinction bet. IRA and ISA)
    – proposed: NAIST Text Corpus (only IRA)
    — 38,384 sents, 106K predicates, 28K event nouns, classified into (a)-(e) depending on where they coocr. 60% of “ga” were zero-anaphoric
    — predicate-argument relations as grammatical cases (Nominative -wa-, Accusative -o- and Dative -ni-)
    — zero-pronouns (exophoric use), can be either IRA or ISA.
    – event-noun (like: “growth”, “eikyo (influence)”)
    — only consider verbal nouns (sahen; e.g. denwa) and deverbal nouns (e.g., furumai)
    – issues
    — ambiguity of functional expression e.g., “toshite”
    — distinction between event-noun or a result expression
    — …

    Others
  53. Michal Ptaszynski, Jacek Maciejewski, Pawel Dybala, Rafal Rzepka and Kenji Araki: “CAO: A Fully Automatic Emoticon Analysis System”Proc. of AAAI-10, pp. 1026-1032, 2010.

    – CAO – a system to detect, extract, and classify (emotionally) emoticons is described
    – based on emoticons dictionaries collected from the Web
    – tried “emoticon parsing” into eyes, mouth, boundaries, and some other parts
    – employ 10-emotion classification of emoticons

  54. Timothy Baldwiny, Colin Bannardz, Takaaki Tanaka and Dominic Widdows:An Empirical Model of Multiword Expression Decomposability Proc. of Workshop on Multiword Expressions: pp. 89-96, 2003.

    – three levels of MWEs – non-decomposable, idiosyncratically decomposable, and simple decomposable
    – past researches classify whether MWEs are endocentric/excocentric (whether the expressions are hyponyms of their head words)
    – latent semantic analysis was used for computing similarities between MWEs and (head) words, built on the within-sentence co-occurrence with content-bearing words
    – computed the corelation between LSA-based similarity and decomposability

  55. William D. Lewis Haitian Creole : How to Build and Ship an MT Engine from Scratch in 4 days, 17 hours, & 30 minutes. EAMT-2010

    This paper is about developing a Haitian Creole – English MT system in a very short time span, in response to 2010 Haiti earthquake. It was their utmost importance to translate the flooding SMS relief information into English. Lack of bilingual corpora, language experts, and consistencies in the orthography made the task extremely difficult. This is a very intriguing paper both in a linguistic and practical perspectives.

  56. L. Venkata Subramaniam, Shourya Roy, Tanveer A. Faruquie, Sumit NegiA Survey of Types of Text Noise and Techniques to Handle
    Noisy Text
    , AND 2009.

    – noise = “any kind of difference in the surface form of an electronic text from the intended, correct or original text”.
    – Noise -> conversion process, produced in digital forms
    – noisy text
    — spelling errors, deletion (message -> msg), substitution (to -> 2), abbreviation (lol),
    — OCR, speech recognition, machine translation output
    – Information retrieval
    — challenges: higher error rate, many OOVs, shorter queries
    – effect of noise to text classification (robust to noise), summarization (vulnerable), and information extraction

  57. Rolf Schwitter: Controlled Natural Languages for Knowledge Representation Proc. of COLING 2010.

    – survey paper of English subset (controlled natural language; CNL): they look like natural language but precisely defined
    – human-oriented CNLs (for simplification, specific purpose) and machine-oriented CNLs (for translation, Semantic Web)
    — Attempto Controlled English (ACE), supported by various tools http://attempto.ifi.uzh.ch/site/tools/
    — Processable English (PENG), supported by a lookahead editor http://web.science.mq.edu.au/~rolfs/papers/CLAW03-ECOLE.pdf, reasoning of TPTP http://www.cs.miami.edu/~tptp/ problem.
    — Computer Processable Language (CPL), whose interpreter uses heuristics for PP-attachement, WSD, etc. CPL-Lite (slimmed down version)
    — CLCE, Formalized-English, E2V
    – CNLs for the Semantic Web
    — ACE, SOS, Rabbit (compared in Schwitter et al. 2008)
    — Lookahead editor like ECOLE is the best writing support (Kuhn 2010)

100 NLP Papers》上有1条评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注