A simple but tough-to-beat baseline for sentence embeddings

Arora, Sanjeev; Liang, Yingyu; Ma, Tengyu

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-85086639984

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: A simple but tough-to-beat baseline for sentence embeddings

Title	A simple but tough-to-beat baseline for sentence embeddings
Authors	Arora, Sanjeev Liang, Yingyu Ma, Tengyu
Issue Date	2017
Citation	5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 2017 How to Cite?
Abstract	The success of neural network methods for computing word embeddings has motivated methods for generating semantic embeddings of longer pieces of text, such as sentences and paragraphs. Surprisingly, Wieting et al (ICLR'16) showed that such complicated methods are outperformed, especially in out-of-domain (transfer learning) settings, by simpler methods involving mild retraining of word embeddings and basic linear regression. The method of Wieting et al. requires retraining with a substantial labeled dataset such as Paraphrase Database (Ganitkevitch et al., 2013). The current paper goes further, showing that the following completely unsupervised sentence embedding is a formidable baseline: Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD. This weighting improves performance by about 10% to 30% in textual similarity tasks, and beats sophisticated supervised methods including RNN's and LSTM's. It even improves Wieting et al.'s embeddings. This simple method should be used as the baseline to beat in future, especially when labeled training data is scarce or nonexistent. The paper also gives a theoretical explanation of the success of the above unsupervised method using a latent variable generative model for sentences, which is a simple extension of the model in Arora et al. (TACL'16) with new “smoothing” terms that allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts.
Persistent Identifier	http://hdl.handle.net/10722/341275

DC Field	Value	Language
dc.contributor.author	Arora, Sanjeev	-
dc.contributor.author	Liang, Yingyu	-
dc.contributor.author	Ma, Tengyu	-
dc.date.accessioned	2024-03-13T08:41:32Z	-
dc.date.available	2024-03-13T08:41:32Z	-
dc.date.issued	2017	-
dc.identifier.citation	5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 2017	-
dc.identifier.uri	http://hdl.handle.net/10722/341275	-
dc.description.abstract	The success of neural network methods for computing word embeddings has motivated methods for generating semantic embeddings of longer pieces of text, such as sentences and paragraphs. Surprisingly, Wieting et al (ICLR'16) showed that such complicated methods are outperformed, especially in out-of-domain (transfer learning) settings, by simpler methods involving mild retraining of word embeddings and basic linear regression. The method of Wieting et al. requires retraining with a substantial labeled dataset such as Paraphrase Database (Ganitkevitch et al., 2013). The current paper goes further, showing that the following completely unsupervised sentence embedding is a formidable baseline: Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD. This weighting improves performance by about 10% to 30% in textual similarity tasks, and beats sophisticated supervised methods including RNN's and LSTM's. It even improves Wieting et al.'s embeddings. This simple method should be used as the baseline to beat in future, especially when labeled training data is scarce or nonexistent. The paper also gives a theoretical explanation of the success of the above unsupervised method using a latent variable generative model for sentences, which is a simple extension of the model in Arora et al. (TACL'16) with new “smoothing” terms that allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts.	-
dc.language	eng	-
dc.relation.ispartof	5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings	-
dc.title	A simple but tough-to-beat baseline for sentence embeddings	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-85086639984	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: A simple but tough-to-beat baseline for sentence embeddings

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats