Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Distributed representation of symbols is important in machine learning systems
Traditional word embeddings associate a separate vector with each word
Hash embeddings reduce memory footprint by representing each word as a summary of normalized word form, subword information and word shape
Technical report introduces embedding methods in spaCy and evaluates hash embedding architecture with multi-embeddings on Named Entity Recognition datasets

Paper Content

Introduction

SpaCy is a popular suite of Natural Language Processing software
It provides algorithms and models for common NLP tasks
It pays attention to stability, usability and documentation
It offers a fine-grained API for customizing and controlling training
It prioritizes run-time efficiency, long document efficiency, robustness to domain-shift, and the ability to fine-tune the model
It uses the hashing trick to reduce the search space in a lookup table
Word embeddings associate words with continuous vectors
They encode useful syntactic and semantic information
Collobert and Weston popularized the idea of using neural networks with pretrained word embeddings
Mikolov et al. made the pretraining phase cheaper
FastText computes word representations as the sum of word and subword vectors
SpaCy models can make use of pretrained embeddings and learn word embeddings by backpropagating errors
Hash embeddings represent a large number of words using a much smaller number of vectors

Embedding layers

Language processing deals with discrete symbols
Word embedding layer maps dictionary entries to vectors
Task-specific embedding functions are learned in an end-to-end fashion
General-purpose embeddings are trained in a self-supervised fashion
Input tokens are encoded and mapped to binary one-hot vectors
Embedding matrix is used to embed symbols as a lookup operation

Algorithm 2 create embedding table

Use a uniform random distribution to map dimensions and vocabulary
Use a threshold to determine the minimum frequency of a token to embed
Fix the size of the embedding table and only learn vectors for the top-k most frequent symbols
Map all symbols not in the embedding table to the special UNK symbol

Hash embedding layer

Hash embeddings reduce memory footprint by applying the hashing trick.
Inspired by Bloom filters, which are used to solve the membership problem.
Bloom filters have two operations: inserting an element and testing whether an element has already been inserted.
Hash embeddings are parametrized by the number of rows, width, and number of hash functions.

Collisions

Hash embeddings and Bloom filters are prone to collisions.
Probability of a symbol being mapped to any row is 1/n.
Probability of a row not being chosen by a single hash function is 1-1/n.
With 50,000 distinct words, 5,000 rows, and a single hash function, collision probability is 0.99995.
Using 4 independent hash functions with range 0 to n yields a collision probability of 5 x 10^-12.

Multi-embeddings with orthographic features

Tokenizer component in spaCy extracts various features from a token’s orthographic representation.
Features embedded include lowercased token with additional normalizations.
Learnable parameters of the network include matrices W1, W2, W3 and bias terms b1, b2, b3.
Maxout layer is implemented as a component-wise maximum over multiple linear layers.
MultiHashEmbed combines multi-embedding process with hash embeddings.
MultiEmbed is identical to MultiHashEmbed except it does not use the hashing trick.

Experimental setup

Goal of experiments: benchmark MultiHashEmbed against traditional word embeddings
Model architecture: tested on variety of named entity recognition datasets from multiple domains
Word embeddings: used vectors distributed with spaCy 3.4.3 (large) models
Unseen evaluation: separate evaluation for unseen test entities, consider each span as unseen entity if it does not appear verbatim in training set

Model architecture and training details

Named Entity Recognizer architecture uses transition-based model
BILUO sequence encoding scheme used to determine entity boundaries
Maxout network computes state vector and action probabilities
Dynamic oracle used with imitation learning objective
Model architecture and embeddings varied for experiments
8-layer convolutional encoder with residual connections used
Layer normalization, dropout, AdamW optimizer, weight decay, gradient clipping applied

Results

MultiHashEmbed was tested in different benchmarking scenarios
Average F1-score was reported across three random seeds
Full results are included in tables in the Appendix

Comparing mu l t iem b e d and mu l t iha s hem b e d embedding strategies

MultiEmbed and MultiHashEmbed are compared with and without pretrained embeddings
Pretrained embeddings provide a consistent benefit across all datasets
MultiHashEmbed with default number of rows performs the same as MultiEmbed
MultiHashEmbed has memory savings compared to MultiEmbed
MultiEmbed and MultiHashEmbed perform similarly on unseen entities

Number of rows

Hash embeddings use a small amount of vectors to achieve good performance.
MultiHashEmbed performs comparably to MultiEmbed with half the number of vectors.
Tests show that MultiHashEmbed can achieve comparable performance to traditional embeddings with significantly less parameters.

Orthographic features

Hash embeddings compared to traditional embeddings
Evaluating contribution of orthographic features
Removing features one-by-one and measuring effect on performance
Tables 2 and 3 report relative error increase in F1-score for Dutch CoNLL 2002 and AnEM
Removing any of the features degrades performance
ORTH performs worst overall
Error decreased on seen entities, increased on unseen entities

Number of hash functions

Number of rows and number of independent hash functions can be used to control capacity of embedding layer.
Performance does not vary much with different number of hash functions.
Number of rows for PREFIX feature is too large for datasets with Latin scripts.

Discussion and conclusion

Word embeddings have a positive effect on NLP accuracy
spaCy uses hash embeddings as an alternative to traditional embeddings
Evaluated effectiveness of spaCy’s MultiHashEmbed
Found that hash embeddings are competitive with traditional embeddings
Orthographic features can improve performance
Benefit of additional orthographic features is subtle
Using more than one hash function does not lead to performance gains
Recommend spaCy users to inspect data before training, experiment with orthographic features, number of rows and hash functions, and make use of pretrained embeddings
MultiEmbed can outperform MultiHashEmbed on ConLL
Choice of 10 as minimum document frequency cutoff for MultiEmbed gives most consistent results
Performance of spaCy and fastText vectors is similar
Relative error increase on MultiHashEmbed given various combinations of orthographic features for CoNLL Dutch and AnEM datasets
Counts of the occurrence of orthographic features in the training sets

Link to paper#

Abstract#

Paper Content#

Introduction#

Embedding layers#

Algorithm 2 create embedding table#

Hash embedding layer#

Collisions#

Multi-embeddings with orthographic features#

Experimental setup#

Model architecture and training details#

Results#

Comparing mu l t iem b e d and mu l t iha s hem b e d embedding strategies#

Number of rows#

Orthographic features#

Number of hash functions#

Discussion and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Embedding layers

Algorithm 2 create embedding table

Hash embedding layer

Collisions

Multi-embeddings with orthographic features

Experimental setup

Model architecture and training details

Results

Comparing mu l t iem b e d and mu l t iha s hem b e d embedding strategies

Number of rows

Orthographic features

Number of hash functions

Discussion and conclusion