Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries

Hou, Rui; Xie, Chao; Gui, Yuhan; Li, Gang; Li, Xiaoyu

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1021/acsomega.3c02152
Scopus: eid_2-s2.0-85161000286
WOS: WOS:001040521400001
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Chemistry: Journal/Magazine Articles

Article: Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries

Title	Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries
Authors	Hou, Rui Xie, Chao Gui, Yuhan Li, Gang Li, Xiaoyu
Issue Date	15-May-2023
Publisher	American Chemical Society
Citation	ACS Omega, 2023, v. 8, n. 21, p. 19057-19071 How to Cite? DOI: http://dx.doi.org/10.1021/acsomega.3c02152
Abstract	DNA-encoded library (DEL) is a powerful ligand discovery technology that has been widely adopted in the pharmaceutical industry. DEL selections are typically performed with a purified protein target immobilized on a matrix or in solution phase. Recently, DELs have also been used to interrogate the targets in the complex biological environment, such as membrane proteins on live cells. However, due to the complex landscape of the cell surface, the selection inevitably involves significant nonspecific interactions, and the selection data are much noisier than the ones with purified proteins, making reliable hit identification highly challenging. Researchers have developed several approaches to denoise DEL datasets, but it remains unclear whether they are suitable for cell-based DEL selections. Here, we report the proof-of-principle of a new machine-learning (ML)-based approach to process cell-based DEL selection datasets by using a Maximum A Posteriori (MAP) estimation loss function, a probabilistic framework that can account for and quantify uncertainties of noisy data. We applied the approach to a DEL selection dataset, where a library of 7,721,415 compounds was selected against a purified carbonic anhydrase 2 (CA-2) and a cell line expressing the membrane protein carbonic anhydrase 12 (CA-12). The extended-connectivity fingerprint (ECFP)-based regression model using the MAP loss function was able to identify true binders and also reliable structure–activity relationship (SAR) from the noisy cell-based selection datasets. In addition, the regularized enrichment metric (known as MAP enrichment) could also be calculated directly without involving the specific machine-learning model, effectively suppressing low-confidence outliers and enhancing the signal-to-noise ratio. Future applications of this method will focus on de novo ligand discovery from cell-based DEL selections.
Persistent Identifier	http://hdl.handle.net/10722/331318
ISSN	2470-1343 2023 Impact Factor: 3.7 2023 SCImago Journal Rankings: 0.710
ISI Accession Number ID	WOS:001040521400001

DC Field	Value	Language
dc.contributor.author	Hou, Rui	-
dc.contributor.author	Xie, Chao	-
dc.contributor.author	Gui, Yuhan	-
dc.contributor.author	Li, Gang	-
dc.contributor.author	Li, Xiaoyu	-
dc.date.accessioned	2023-09-21T06:54:40Z	-
dc.date.available	2023-09-21T06:54:40Z	-
dc.date.issued	2023-05-15	-
dc.identifier.citation	ACS Omega, 2023, v. 8, n. 21, p. 19057-19071	-
dc.identifier.issn	2470-1343	-
dc.identifier.uri	http://hdl.handle.net/10722/331318	-
dc.description.abstract	<p>DNA-encoded library (DEL) is a powerful ligand discovery technology that has been widely adopted in the pharmaceutical industry. DEL selections are typically performed with a purified protein target immobilized on a matrix or in solution phase. Recently, DELs have also been used to interrogate the targets in the complex biological environment, such as membrane proteins on live cells. However, due to the complex landscape of the cell surface, the selection inevitably involves significant nonspecific interactions, and the selection data are much noisier than the ones with purified proteins, making reliable hit identification highly challenging. Researchers have developed several approaches to denoise DEL datasets, but it remains unclear whether they are suitable for cell-based DEL selections. Here, we report the proof-of-principle of a new machine-learning (ML)-based approach to process cell-based DEL selection datasets by using a Maximum A Posteriori (MAP) estimation loss function, a probabilistic framework that can account for and quantify uncertainties of noisy data. We applied the approach to a DEL selection dataset, where a library of 7,721,415 compounds was selected against a purified carbonic anhydrase 2 (CA-2) and a cell line expressing the membrane protein carbonic anhydrase 12 (CA-12). The extended-connectivity fingerprint (ECFP)-based regression model using the MAP loss function was able to identify true binders and also reliable structure–activity relationship (SAR) from the noisy cell-based selection datasets. In addition, the regularized enrichment metric (known as MAP enrichment) could also be calculated directly without involving the specific machine-learning model, effectively suppressing low-confidence outliers and enhancing the signal-to-noise ratio. Future applications of this method will focus on de novo ligand discovery from cell-based DEL selections.</p>	-
dc.language	eng	-
dc.publisher	American Chemical Society	-
dc.relation.ispartof	ACS Omega	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.title	Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries	-
dc.type	Article	-
dc.identifier.doi	10.1021/acsomega.3c02152	-
dc.identifier.scopus	eid_2-s2.0-85161000286	-
dc.identifier.volume	8	-
dc.identifier.issue	21	-
dc.identifier.spage	19057	-
dc.identifier.epage	19071	-
dc.identifier.eissn	2470-1343	-
dc.identifier.isi	WOS:001040521400001	-
dc.identifier.issnl	2470-1343	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats