Background: In order to develop hypothesis on unknown metabolic pathways, biochemists frequently rely on literature that uses a free-text format to describe functional groups or substructures. In computational chemistry or cheminformatics, molecules are typically represented by chemical descriptors, i.e., vectors that summarize information on its various properties. However, it is difficult to interpret these chemical descriptors since they are not directly linked to the terminology of functional groups or substructures that the biochemists use. Methods: In this study, we used KEGG Chemical Function (KCF) format to computationally describe biochemical substructures in seven attributes that resemble biochemists' way of dealing with substructures. Results: We established KCF-S (KCF-and-Substructures) format as an additional structural information of KCF. Applying KCF-S revealed the specific appearance of substructures from various datasets of molecules that describes the characteristics of the respective datasets. Structure-based clustering of molecules using KCF-S resulted the clusters in which molecular weights and structures were less diverse than those obtained by conventional chemical fingerprints. We further applied KCF-S to find the pairs of molecules that are possibly converted to each other in enzymatic reactions, and KCF-S clearly improved predictive performance than that presented previously. Conclusions: KCF-S defines biochemical substructures with keeping interpretability, suggesting the potential to apply more studies on chemical bioinformatics. KCF and KCF-S can be automatically converted from Molfile format, enabling to deal with molecules from any data sources.
All Science Journal Classification (ASJC) codes
- Structural Biology
- Modelling and Simulation
- Molecular Biology
- Computer Science Applications
- Applied Mathematics