Papers I Read Notes and Summaries

A Fast and Accurate Dependency Parser using Neural Networks

Introduction

  • The paper proposes a neural network classifier to perform transition-based dependency parsing using dense vector representation for the features.
  • Earlier approaches used a large, manually designed sparse feature vector which took a lot of time and effort to compute and was often incomplete.
  • Link to the paper

Description of the system

  • The system described in the paper uses arc-standard system (a greedy, transition-based dependency parsing system).
  • Words, POS tags and arc labels are represented as d dimensional vectors.
  • Sw, St, Sl denote the set of words, POS and labels respectively.
  • Neural network takes as input selected words from the 3 sets and uses a single hidden layer followed by Softmax which models the different actions that can be chosen by the arc-standard system.
  • Uses a cube activation function to allow interaction between features coming from the set of words, POS and labels in the first layer itself. These features come from different embeddings and are not related as such.
  • Using separate embedding for POS tags and labels allow for capturing aspects like NN (singular noun) should be closer to NNS (plural noun) than DT (determiner).
  • Input to the network contains words on the stack and buffer and their left and right children (read upon transition-based parsing), their labels and corresponding arc labels.
  • Output generated by the system is the action to be taken (transition to be performed) when reading each word in the input.
  • This sequential and deterministic nature of the input-output mapping allows the problem to be modelled as a supervised learning problem and a cross entropy loss can be used.
  • L2-regularization term is also added to the loss.
  • During inference, a greedy decoding strategy is used and transition with the highest score is chosen.
  • The paper mentions a pre-computation trick where matrix computation of most frequent top 10000 words is performed beforehand and cached.

Experiments

  • Dataset
    • English Penn Treebank (PTB)
    • Chinese Penn Treebank (CTB)
  • Two dependency representations used:
    • CoNLL Syntactic Dependencies (CD)
    • Stanford Basic Dependencies (SD)
  • Metrics:
    • Unlabeled Attached Scores (UAS)
    • Labeled Attached Scores (LAS)
  • Benchmarked against:
    • Greedy arc-eager parser
    • Greedy arc-standard parser
    • Malt-Parser
    • MSTParser
  • Results
    • The system proposed in the paper outperforms all other parsers in both speed and accuracy.

Analysis

  • Cube function gives a 0.8-1.2% improvement over tanh.
  • Pretained embeddings give 0.7-1.7% improvement over training embeddings from scratch.
  • Using POS and labels gives an improvement of 1.7% and 0.4% respectively.