Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Distributed representation of symbols is important in machine learning systems
- Traditional word embeddings associate a separate vector with each word
- Hash embeddings reduce memory footprint by representing each word as a summary of normalized word form, subword information and word shape
- Technical report introduces embedding methods in spaCy and evaluates hash embedding architecture with multi-embeddings on Named Entity Recognition datasets
Paper Content
Introduction
- SpaCy is a popular suite of Natural Language Processing software
- It provides algorithms and models for common NLP tasks
- It pays attention to stability, usability and documentation
- It offers a fine-grained API for customizing and controlling training
- It prioritizes run-time efficiency, long document efficiency, robustness to domain-shift, and the ability to fine-tune the model
- It uses the hashing trick to reduce the search space in a lookup table
- Word embeddings associate words with continuous vectors
- They encode useful syntactic and semantic information
- Collobert and Weston popularized the idea of using neural networks with pretrained word embeddings
- Mikolov et al. made the pretraining phase cheaper
- FastText computes word representations as the sum of word and subword vectors
- SpaCy models can make use of pretrained embeddings and learn word embeddings by backpropagating errors
- Hash embeddings represent a large number of words using a much smaller number of vectors
Embedding layers
- Language processing deals with discrete symbols
- Word embedding layer maps dictionary entries to vectors
- Task-specific embedding functions are learned in an end-to-end fashion
- General-purpose embeddings are trained in a self-supervised fashion
- Input tokens are encoded and mapped to binary one-hot vectors
- Embedding matrix is used to embed symbols as a lookup operation
Algorithm 2 create embedding table
- Use a uniform random distribution to map dimensions and vocabulary
- Use a threshold to determine the minimum frequency of a token to embed
- Fix the size of the embedding table and only learn vectors for the top-k most frequent symbols
- Map all symbols not in the embedding table to the special UNK symbol
Hash embedding layer
- Hash embeddings reduce memory footprint by applying the hashing trick.
- Inspired by Bloom filters, which are used to solve the membership problem.
- Bloom filters have two operations: inserting an element and testing whether an element has already been inserted.
- Hash embeddings are parametrized by the number of rows, width, and number of hash functions.
Collisions
- Hash embeddings and Bloom filters are prone to collisions.
- Probability of a symbol being mapped to any row is 1/n.
- Probability of a row not being chosen by a single hash function is 1-1/n.
- With 50,000 distinct words, 5,000 rows, and a single hash function, collision probability is 0.99995.
- Using 4 independent hash functions with range 0 to n yields a collision probability of 5 x 10^-12.
Multi-embeddings with orthographic features
- Tokenizer component in spaCy extracts various features from a token’s orthographic representation.
- Features embedded include lowercased token with additional normalizations.
- Learnable parameters of the network include matrices W1, W2, W3 and bias terms b1, b2, b3.
- Maxout layer is implemented as a component-wise maximum over multiple linear layers.
- MultiHashEmbed combines multi-embedding process with hash embeddings.
- MultiEmbed is identical to MultiHashEmbed except it does not use the hashing trick.
Experimental setup
- Goal of experiments: benchmark MultiHashEmbed against traditional word embeddings
- Model architecture: tested on variety of named entity recognition datasets from multiple domains
- Word embeddings: used vectors distributed with spaCy 3.4.3 (large) models
- Unseen evaluation: separate evaluation for unseen test entities, consider each span as unseen entity if it does not appear verbatim in training set
Model architecture and training details
- Named Entity Recognizer architecture uses transition-based model
- BILUO sequence encoding scheme used to determine entity boundaries
- Maxout network computes state vector and action probabilities
- Dynamic oracle used with imitation learning objective
- Model architecture and embeddings varied for experiments
- 8-layer convolutional encoder with residual connections used
- Layer normalization, dropout, AdamW optimizer, weight decay, gradient clipping applied
Results
- MultiHashEmbed was tested in different benchmarking scenarios
- Average F1-score was reported across three random seeds
- Full results are included in tables in the Appendix
Comparing mu l t iem b e d and mu l t iha s hem b e d embedding strategies
- MultiEmbed and MultiHashEmbed are compared with and without pretrained embeddings
- Pretrained embeddings provide a consistent benefit across all datasets
- MultiHashEmbed with default number of rows performs the same as MultiEmbed
- MultiHashEmbed has memory savings compared to MultiEmbed
- MultiEmbed and MultiHashEmbed perform similarly on unseen entities
Number of rows
- Hash embeddings use a small amount of vectors to achieve good performance.
- MultiHashEmbed performs comparably to MultiEmbed with half the number of vectors.
- Tests show that MultiHashEmbed can achieve comparable performance to traditional embeddings with significantly less parameters.
Orthographic features
- Hash embeddings compared to traditional embeddings
- Evaluating contribution of orthographic features
- Removing features one-by-one and measuring effect on performance
- Tables 2 and 3 report relative error increase in F1-score for Dutch CoNLL 2002 and AnEM
- Removing any of the features degrades performance
- ORTH performs worst overall
- Error decreased on seen entities, increased on unseen entities
Number of hash functions
- Number of rows and number of independent hash functions can be used to control capacity of embedding layer.
- Performance does not vary much with different number of hash functions.
- Number of rows for PREFIX feature is too large for datasets with Latin scripts.
Discussion and conclusion
- Word embeddings have a positive effect on NLP accuracy
- spaCy uses hash embeddings as an alternative to traditional embeddings
- Evaluated effectiveness of spaCy’s MultiHashEmbed
- Found that hash embeddings are competitive with traditional embeddings
- Orthographic features can improve performance
- Benefit of additional orthographic features is subtle
- Using more than one hash function does not lead to performance gains
- Recommend spaCy users to inspect data before training, experiment with orthographic features, number of rows and hash functions, and make use of pretrained embeddings
- MultiEmbed can outperform MultiHashEmbed on ConLL
- Choice of 10 as minimum document frequency cutoff for MultiEmbed gives most consistent results
- Performance of spaCy and fastText vectors is similar
- Relative error increase on MultiHashEmbed given various combinations of orthographic features for CoNLL Dutch and AnEM datasets
- Counts of the occurrence of orthographic features in the training sets