File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions
| Title | What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions |
|---|---|
| Authors | |
| Issue Date | 2025 |
| Citation | Transactions on Machine Learning Research, 2025, v. July-2025 How to Cite? |
| Abstract | Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We show that the embeddings from autoregressive models correspond to predictive sufficient statistics. By identifying settings where the predictive sufficient statistics are interpretable distributions over latent variables, including exchangeable models and latent state models, we show that embeddings of autoregressive models encode these explainable quantities of interest. We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora. |
| Persistent Identifier | http://hdl.handle.net/10722/367863 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Zhang, Liyi | - |
| dc.contributor.author | Li, Michael Y. | - |
| dc.contributor.author | Thomas Mccoy, R. | - |
| dc.contributor.author | Sumers, Theodore R. | - |
| dc.contributor.author | Zhu, Jian Qiao | - |
| dc.contributor.author | Griffiths, Thomas L. | - |
| dc.date.accessioned | 2025-12-19T08:00:03Z | - |
| dc.date.available | 2025-12-19T08:00:03Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Transactions on Machine Learning Research, 2025, v. July-2025 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/367863 | - |
| dc.description.abstract | Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We show that the embeddings from autoregressive models correspond to predictive sufficient statistics. By identifying settings where the predictive sufficient statistics are interpretable distributions over latent variables, including exchangeable models and latent state models, we show that embeddings of autoregressive models encode these explainable quantities of interest. We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora. | - |
| dc.language | eng | - |
| dc.relation.ispartof | Transactions on Machine Learning Research | - |
| dc.title | What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions | - |
| dc.type | Article | - |
| dc.description.nature | link_to_subscribed_fulltext | - |
| dc.identifier.scopus | eid_2-s2.0-105011731541 | - |
| dc.identifier.volume | July-2025 | - |
| dc.identifier.eissn | 2835-8856 | - |
