GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethyl silyl (TBDMS) derivatives for development of machine learning-based compound identification approaches

Data in Brief(2023)

引用 0|浏览13
暂无评分
摘要
In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challeng-ing task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these com-pounds are very poor. This is particularly true for semi po-lar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a se-vere lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1] . We set out to fill this gap and support the ma-chine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated deriva-tives in GC-MS laboratories working in the field of envi-ronment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of tert-butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra ac-quired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons. For each type of derivative, two testing datasets are gener-ated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below. Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, Mw, NIST number and ID number. The datasets we present here were used to train and test pre-dictive models for identification of silylated derivatives built with ML approaches [4] . The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regres-sion (CSI:OKR) [2] . Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protec-tion Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.(c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
更多
查看译文
关键词
Silylation,Derivative,Identification,Machine learning,GC-MS,Mass spectrometry
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要