Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Distributed representation of symbols is important in machine learning systems
  • Traditional word embeddings associate a separate vector with each word
  • Hash embeddings reduce memory footprint by representing each word as a summary of normalized word form, subword information and word shape
  • Technical report introduces embedding methods in spaCy and evaluates hash embedding architecture with multi-embeddings on Named Entity Recognition datasets

Paper Content

Introduction

  • SpaCy is a popular suite of Natural Language Processing software
  • It provides algorithms and models for common NLP tasks
  • It pays attention to stability, usability and documentation
  • It offers a fine-grained API for customizing and controlling training
  • It prioritizes run-time efficiency, long document efficiency, robustness to domain-shift, and the ability to fine-tune the model
  • It uses the hashing trick to reduce the search space in a lookup table
  • Word embeddings associate words with continuous vectors
  • They encode useful syntactic and semantic information
  • Collobert and Weston popularized the idea of using neural networks with pretrained word embeddings
  • Mikolov et al. made the pretraining phase cheaper
  • FastText computes word representations as the sum of word and subword vectors
  • SpaCy models can make use of pretrained embeddings and learn word embeddings by backpropagating errors
  • Hash embeddings represent a large number of words using a much smaller number of vectors

Embedding layers

  • Language processing deals with discrete symbols
  • Word embedding layer maps dictionary entries to vectors
  • Task-specific embedding functions are learned in an end-to-end fashion
  • General-purpose embeddings are trained in a self-supervised fashion
  • Input tokens are encoded and mapped to binary one-hot vectors
  • Embedding matrix is used to embed symbols as a lookup operation

Algorithm 2 create embedding table

  • Use a uniform random distribution to map dimensions and vocabulary
  • Use a threshold to determine the minimum frequency of a token to embed
  • Fix the size of the embedding table and only learn vectors for the top-k most frequent symbols
  • Map all symbols not in the embedding table to the special UNK symbol

Hash embedding layer

  • Hash embeddings reduce memory footprint by applying the hashing trick.
  • Inspired by Bloom filters, which are used to solve the membership problem.
  • Bloom filters have two operations: inserting an element and testing whether an element has already been inserted.
  • Hash embeddings are parametrized by the number of rows, width, and number of hash functions.

Collisions

  • Hash embeddings and Bloom filters are prone to collisions.
  • Probability of a symbol being mapped to any row is 1/n.
  • Probability of a row not being chosen by a single hash function is 1-1/n.
  • With 50,000 distinct words, 5,000 rows, and a single hash function, collision probability is 0.99995.
  • Using 4 independent hash functions with range 0 to n yields a collision probability of 5 x 10^-12.

Multi-embeddings with orthographic features

  • Tokenizer component in spaCy extracts various features from a token’s orthographic representation.
  • Features embedded include lowercased token with additional normalizations.
  • Learnable parameters of the network include matrices W1, W2, W3 and bias terms b1, b2, b3.
  • Maxout layer is implemented as a component-wise maximum over multiple linear layers.
  • MultiHashEmbed combines multi-embedding process with hash embeddings.
  • MultiEmbed is identical to MultiHashEmbed except it does not use the hashing trick.

Experimental setup

  • Goal of experiments: benchmark MultiHashEmbed against traditional word embeddings
  • Model architecture: tested on variety of named entity recognition datasets from multiple domains
  • Word embeddings: used vectors distributed with spaCy 3.4.3 (large) models
  • Unseen evaluation: separate evaluation for unseen test entities, consider each span as unseen entity if it does not appear verbatim in training set

Model architecture and training details

  • Named Entity Recognizer architecture uses transition-based model
  • BILUO sequence encoding scheme used to determine entity boundaries
  • Maxout network computes state vector and action probabilities
  • Dynamic oracle used with imitation learning objective
  • Model architecture and embeddings varied for experiments
  • 8-layer convolutional encoder with residual connections used
  • Layer normalization, dropout, AdamW optimizer, weight decay, gradient clipping applied

Results

  • MultiHashEmbed was tested in different benchmarking scenarios
  • Average F1-score was reported across three random seeds
  • Full results are included in tables in the Appendix

Comparing mu l t iem b e d and mu l t iha s hem b e d embedding strategies

  • MultiEmbed and MultiHashEmbed are compared with and without pretrained embeddings
  • Pretrained embeddings provide a consistent benefit across all datasets
  • MultiHashEmbed with default number of rows performs the same as MultiEmbed
  • MultiHashEmbed has memory savings compared to MultiEmbed
  • MultiEmbed and MultiHashEmbed perform similarly on unseen entities

Number of rows

  • Hash embeddings use a small amount of vectors to achieve good performance.
  • MultiHashEmbed performs comparably to MultiEmbed with half the number of vectors.
  • Tests show that MultiHashEmbed can achieve comparable performance to traditional embeddings with significantly less parameters.

Orthographic features

  • Hash embeddings compared to traditional embeddings
  • Evaluating contribution of orthographic features
  • Removing features one-by-one and measuring effect on performance
  • Tables 2 and 3 report relative error increase in F1-score for Dutch CoNLL 2002 and AnEM
  • Removing any of the features degrades performance
  • ORTH performs worst overall
  • Error decreased on seen entities, increased on unseen entities

Number of hash functions

  • Number of rows and number of independent hash functions can be used to control capacity of embedding layer.
  • Performance does not vary much with different number of hash functions.
  • Number of rows for PREFIX feature is too large for datasets with Latin scripts.

Discussion and conclusion

  • Word embeddings have a positive effect on NLP accuracy
  • spaCy uses hash embeddings as an alternative to traditional embeddings
  • Evaluated effectiveness of spaCy’s MultiHashEmbed
  • Found that hash embeddings are competitive with traditional embeddings
  • Orthographic features can improve performance
  • Benefit of additional orthographic features is subtle
  • Using more than one hash function does not lead to performance gains
  • Recommend spaCy users to inspect data before training, experiment with orthographic features, number of rows and hash functions, and make use of pretrained embeddings
  • MultiEmbed can outperform MultiHashEmbed on ConLL
  • Choice of 10 as minimum document frequency cutoff for MultiEmbed gives most consistent results
  • Performance of spaCy and fastText vectors is similar
  • Relative error increase on MultiHashEmbed given various combinations of orthographic features for CoNLL Dutch and AnEM datasets
  • Counts of the occurrence of orthographic features in the training sets