| Part of a series on |
| Machine learning and data mining |
|---|
In natural language processing, a sentence embedding (or document embedding) is a representation of a natural language text as a vector of numbers which encodes meaningful semantic information[1][2][3][4][5][6][7]. The name stems from the initially limitations of the approach to embed sequences of text longer than a sentence, but this is not longer a limitation.
State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. BERT pioneered an approach involving the use of a dedicated [CLS] token prepended to the beginning of each sentence inputted into the model; the final hidden state vector of this token encodes information about the sentence and can be fine-tuned for use in sentence classification tasks. In practice however, BERT’s sentence embedding with the [CLS] token achieves poor performance, often worse than simply averaging non-contextual word embeddings[8]. SBERT later achieved superior sentence embedding performance[8] by fine tuning BERT’s [CLS] token embeddings through the usage of a siamese neural network architecture on the SNLI dataset.
Other approaches are loosely based on the idea of distributional semantics applied to sentences. Skip-Thought trains an encoder-decoder structure for the task of neighboring sentences predictions; this has been shown to achieve worse performance than approaches such as InferSent or SBERT.
An alternative direction is to aggregate word embeddings, such as those returned by Word2vec, into sentence embeddings. The most straightforward approach is to simply compute the average of word vectors, known as continuous bag-of-words (CBOW).[9] However, more elaborate solutions based on word vector quantization have also been proposed. One such approach is the vector of locally aggregated word embeddings (VLAWE),[10] which demonstrated performance improvements in downstream text classification tasks.
Applications
In recent years, sentence embedding has seen a growing level of interest due to its applications in natural language queryable knowledge bases through the usage of vector indexing for semantic search. LangChain for instance utilizes sentence transformers for purposes of indexing documents. In particular, an indexing is generated by generating embeddings for chunks of documents and storing (document chunk, embedding) tuples. Then given a query in natural language, the embedding for the query can be generated. A top k similarity search algorithm is then used between the query embedding and the document chunk embeddings to retrieve the most relevant document chunks as context information for question answering tasks. This approach is also known formally as retrieval-augmented generation.[11]
Though not as predominant as BERTScore, sentence embeddings are commonly used for sentence similarity evaluation which sees common use for the task of optimizing a Large language model‘s generation parameters is often performed via comparing candidate sentences against reference sentences. By using the cosine-similarity of the sentence embeddings of candidate and reference sentences as the evaluation function, a grid-search algorithm can be utilized to automate hyperparameter optimization.[citation needed]
Evaluation
Multiple approaches exists for evaluating the quality of sentence embeddings typically covering one or multiple of the use-cases on models. Some seek to measure whether the embedding is semantically meaningful by testing if semantically similar sentences appear closer together, while other sentence similarity or if embeddings reflect entailment using corpora such as Sentences Involving Compositional Knowledge (SICK)[12], STS Bencmark[13]. Other approaches seek to measure the quality of the embeddings by how well it performs for downstream use-cases such as clustering, classification or semantic search. Comprehensive frameworks such as BEIR[14] or MTEB[15][16] that encapsulates multiple of these approaches have since become the standard to evaluate the quality of embedding across domains and/or languages.
See also
External links
- InferSent sentence embeddings and training code
- Universal Sentence Encoder
- Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
References
- ^ Barkan, Oren; Razin, Noam; Malkiel, Itzik; Katz, Ori; Caciularu, Avi; Koenigstein, Noam (2019). “Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding”. arXiv:1908.05161 [cs.LG].
- ^ The Current Best of Universal Word Embeddings and Sentence Embeddings
- ^ Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St.; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (2018). “Universal Sentence Encoder”. arXiv:1803.11175 [cs.CL].
- ^ Wu, Ledell; Fisch, Adam; Chopra, Sumit; Adams, Keith; Bordes, Antoine; Weston, Jason (2017). “StarSpace: Embed All the Things!”. arXiv:1709.03856 [cs.CL].
- ^ Sanjeev Arora, Yingyu Liang, and Tengyu Ma. “A simple but tough-to-beat baseline for sentence embeddings.”, 2016; openreview:SyK00v5xx.
- ^ Trifan, Mircea; Ionescu, Bogdan; Gadea, Cristian; Ionescu, Dan (2015). “A graph digital signal processing method for semantic analysis”. 2015 IEEE 10th Jubilee International Symposium on Applied Computational Intelligence and Informatics. pp. 187–192. doi:10.1109/SACI.2015.7208196. ISBN 978-1-4799-9911-8. S2CID 17099431.
- ^ Basile, Pierpaolo; Caputo, Annalina; Semeraro, Giovanni (2012). “A Study on Compositional Semantics of Words in Distributional Spaces”. 2012 IEEE Sixth International Conference on Semantic Computing. pp. 154–161. doi:10.1109/ICSC.2012.55. ISBN 978-1-4673-4433-3. S2CID 552921.
- ^ a b Reimers, Nils; Gurevych, Iryna (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. arXiv:1908.10084 [cs.CL].
- ^ Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). “Efficient Estimation of Word Representations in Vector Space”. arXiv:1301.3781 [cs.CL].
- ^ Ionescu, Radu Tudor; Butnaru, Andrei (2019). “Vector of Locally-Aggregated Word Embeddings (“. Proceedings of the 2019 Conference of the North. Minneapolis, Minnesota: Association for Computational Linguistics. pp. 363–369. doi:10.18653/v1/N19-1033. S2CID 85500146.
- ^ Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. arXiv:2005.11401 [cs.CL].
- ^ Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. “A SICK cure for the evaluation of compositional distributional semantic models.” In LREC, pp. 216-223. 2014 [1].
- ^ Cer, Daniel; Diab, Mona; Agirre, Eneko; Lopez-Gazpio, Iñigo; Specia, Lucia (August 2017). Bethard, Steven; Carpuat, Marine; Apidianaki, Marianna; Mohammad, Saif M.; Cer, Daniel; Jurgens, David (eds.). “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation”. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics: 1–14. doi:10.18653/v1/S17-2001.
- ^ Thakur, Nandan; Reimers, Nils; Rücklé, Andreas; Srivastava, Abhishek; Gurevych, Iryna (2021-08-29). “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models”.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Muennighoff, Niklas; Tazi, Nouamane; Magne, Loic; Reimers, Nils (May 2023). Vlachos, Andreas; Augenstein, Isabelle (eds.). “MTEB: Massive Text Embedding Benchmark”. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics: 2014–2037. doi:10.18653/v1/2023.eacl-main.148.
- ^ Enevoldsen, Kenneth; Chung, Isaac; Kerboua, Imene; Kardos, Márton; Mathur, Ashwin; Stap, David; Gala, Jay; Siblini, Wissam; Krzemiński, Dominik (2025-02-19). “MMTEB: Massive Multilingual Text Embedding Benchmark”. arXiv.org. Retrieved 2026-06-08.