Volume 69, Issue 4 e17971
RESEARCH ARTICLE
Open Access

Graph machine learning for design of high-octane fuels

Jan G. Rittig

Jan G. Rittig

Process Systems Engineering (AVT.SVT), RWTH Aachen University, Aachen, Germany

Contribution: Conceptualization (equal), Data curation (equal), Formal analysis (equal), Funding acquisition (supporting), ​Investigation (equal), Methodology (equal), Software (equal), Validation (equal), Visualization (lead), Writing - original draft (lead), Writing - review & editing (supporting)

Search for more papers by this author
Martin Ritzert

Martin Ritzert

Department of Computer Science, Aarhus University, Aarhus, Denmark

Contribution: Conceptualization (equal), Data curation (equal), Formal analysis (equal), ​Investigation (equal), Methodology (equal), Software (equal), Validation (equal), Visualization (supporting), Writing - original draft (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Artur M. Schweidtmann

Artur M. Schweidtmann

Department of Chemical Engineering, Delft University of Technology, Delft, The Netherlands

Contribution: Conceptualization (equal), Funding acquisition (supporting), Methodology (supporting), Supervision (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Stefanie Winkler

Stefanie Winkler

Chair of Computer Science 7, RWTH Aachen University, Aachen, Germany

Contribution: Data curation (supporting), Formal analysis (supporting), Methodology (equal), Software (equal), Validation (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Jana M. Weber

Jana M. Weber

Delft Bioinformatics Lab, Intelligent Systems, Delft University of Technology, Delft, The Netherlands

Contribution: Methodology (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Philipp Morsch

Philipp Morsch

Chair of High Pressure Gas Dynamics, RWTH Aachen University, Aachen, Germany

Contribution: ​Investigation (equal), Writing - original draft (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Karl Alexander Heufer

Karl Alexander Heufer

Chair of High Pressure Gas Dynamics, RWTH Aachen University, Aachen, Germany

Contribution: Funding acquisition (supporting), Supervision (supporting), Writing - review & editing (supporting)

Search for more papers by this author
Martin Grohe

Martin Grohe

Chair of Computer Science 7, RWTH Aachen University, Aachen, Germany

Contribution: Conceptualization (supporting), Funding acquisition (equal), Supervision (equal), Writing - review & editing (supporting)

Search for more papers by this author
Alexander Mitsos

Alexander Mitsos

Process Systems Engineering (AVT.SVT), RWTH Aachen University, Aachen, Germany

JARA-ENERGY, Aachen, Germany

Forschungszentrum Jülich GmbH, Institute for Energy and Climate Research IEK-10: Energy Systems Engineering, Jülich, Germany

Contribution: Conceptualization (supporting), Funding acquisition (equal), Supervision (equal), Writing - review & editing (supporting)

Search for more papers by this author
Manuel Dahmen

Corresponding Author

Manuel Dahmen

Forschungszentrum Jülich GmbH, Institute for Energy and Climate Research IEK-10: Energy Systems Engineering, Jülich, Germany

Correspondence

Manuel Dahmen, Forschungszentrum Jülich GmbH, Institute for Energy and Climate Research IEK-10: Energy Systems Engineering, Jülich 52425, Germany.

Email: [email protected]

Contribution: Conceptualization (supporting), Formal analysis (equal), Supervision (equal), Writing - review & editing (lead)

Search for more papers by this author
First published: 23 November 2022
Citations: 2

Jan G. Rittig and Martin Ritzert contributed equally to this study.

Funding information: Deutsche Forschungsgemeinschaft, Grant/Award Numbers: 466417970, EXC 2186, GRK 2236; Helmholtz-Gemeinschaft, Grant/Award Number: HDS-LEE (HIDSS-0004)

Abstract

Fuels with high-knock resistance enable modern spark-ignition engines to achieve high efficiency and thus low CO2 emissions. Identification of molecules with desired autoignition properties indicated by a high research octane number and a high octane sensitivity is therefore of great practical relevance and can be supported by computer-aided molecular design (CAMD). Recent developments in the field of graph machine learning (graph-ML) provide novel, promising tools for CAMD. We propose a modular graph-ML CAMD framework that integrates generative graph-ML models with graph neural networks and optimization, enabling the design of molecules with desired ignition properties in a continuous molecular space. In particular, we explore the potential of Bayesian optimization and genetic algorithms in combination with generative graph-ML models. The graph-ML CAMD framework successfully identifies well-established high-octane components. It also suggests new candidates, one of which we experimentally investigate and use to illustrate the need for further autoignition training data.

1 INTRODUCTION

With a share of 23% of total CO2 emissions, transportation is a major CO2 emission source.1 Replacing fossil fuels with renewable alternatives may provide a path toward carbon neutrality for the transportation sector and is investigated actively.2-5 An important step toward renewable fuels is the search for suitable gasoline substitutes for use in advanced high compression, turbocharged spark-ignition (SI) engines. A property of paramount importance for a renewable SI engine fuel is knock resistance, traditionally indicated by the research octane number (RON),6 the motor octane number (MON),7 and more recently the octane sensitivity (OS), that is, the difference between RON and MON values. The weighted sum of RON and OS is referred to as the octane index (OI).8 For modern SI engines, fuels with both high RON and high OS, hence high OI, are desired as they enable engine operation at conditions associated with particularly high efficiency.9-15 To boost the OI of a fuel, chemical species with high RON and high OS such as ethanol and MTBE can be added.16, 17 Identification of further molecules providing octane boosting is of great practical relevance and is studied actively, for example, see references 17, 18. Herein, we aim to identify such promising candidates exhibiting both high RON and high OS by computer-aided molecular design (CAMD). In particular, we investigate the role of novel methods from the domain of graph machine learning (graph-ML).

Traditionally, the search for molecules with desired properties for a given application has been mostly guided by human experts and experimentation. CAMD can enhance this process by utilizing computational methods to efficiently pre-screen a large number of molecular structures so that experiments can be dedicated to the most promising candidates. A wide variety of methods and tools for CAMD has been proposed over the last decades; we refer the interested reader to review articles for a detailed CAMD overview.19-25 Generally, the CAMD process incorporates the computational generation of candidate structures and the model-based prediction of their physico-chemical properties. Well-established approaches for the generation of candidate structures include formulating optimization problems in which structural groups are pieced together to form molecules,24, 26 exhaustive generation of molecular structures in a sequential generate-and-test manner,27 and utilizing evolutionary theory to evolve molecular structures.28 For predicting application-relevant properties of the formed candidate structures, CAMD typically employs quantitative structure–property relationships (QSPRs).29 QSPRs first describe the molecular structure by so-called molecular descriptors, for example, atom counts, and secondly map those descriptors to a property of interest by linear or nonlinear models. Today, nonlinear ML models such as feedforward neural networks or random forests are often utilized in this regression step.30-32

For classical CAMD, a broad range of applications25 can be found in the process systems engineering (PSE) literature, covering the design of single molecules (e.g., ionic liquids,33 polymers22), the design of mixtures,34-36 as well as integrated product and process design.37, 38 Classical CAMD techniques have also been applied extensively in the context of fuel design.2, 39-43 For example, in two previous articles,40, 44 we used enumeration-based generation of oxygenated hydrocarbons and subsequently screened the obtained molecules via QSPR models with respect to engine-relevant properties. We previously also developed a generate-and-test approach where molecular candidates are generated by iteratively refunctionalizing bioderived intermediates based on pre-defined transformation rules.2 Also, Cai et al.45 proposed a gasoline design model that employs rule-based transformation of molecules in combination with QSPR for property prediction to identify molecules with desired fuel properties such as high RON.

ML has recently been utilized for molecular structure generation by means of generative ML models, leading to novel, fully ML-based CAMD approaches.25, 46 In generative ML for molecules, two main directions can be distinguished: String-based approaches, for example, based on SMILES strings,47 and graph-based approaches, the latter directly working on the molecular graph. For both directions, a range of models has been developed such as recurrent neural networks (RNNs), variational or adversarial autoencoders (VAEs/AAEs), generative adversarial networks (GANs), and reinforcement learning (RL).46, 48 The goal of such generative ML techniques is the unsupervised learning from a data set of molecular structures to generate new, chemically feasible structures that were not seen during training, thereby designing molecules. Specifically, generative ML models typically learn to encode molecules into a continuous space, the so-called latent space, and then decode samples from the latent space back to molecular structures. The continuous latent space is assumed to capture chemical information about molecules and embed molecules with similar structure or even similar properties close to each other.49 Depending on the model architecture, ML-based CAMD typically relies either on strategic sampling of molecules from the latent space of the generative model using optimization strategies, for example, with VAEs,50-52 or on direct generation of molecules with desired properties, for example, by GANs53, 54 or RL.55, 56 In contrast to classical CAMD, generative models in ML-based CAMD replace discrete molecule representations such as combinations of structural groups, molecular graphs, or SMILES strings with a continuous representation, thus enabling the use of continuous optimization approaches for molecular design.57

ML has also recently enabled end-to-end learning of physico-chemical properties from molecular structure by means of graph neural networks (GNNs).58-60 GNNs are graph-ML architectures that directly operate on the underlying graph structure of a molecule and thus circumvent the need for selecting meaningful molecular descriptors, a step that is inherent to all QSPR/QSAR approaches. Instead, GNNs enable a data-driven end-to-end learning framework for molecular property prediction.

Up to now, fully ML-driven CAMD has mainly focused on drug design.46, 61-63 A particular reason might be the availability of large training data sets and the incorporation of multiple drug design targets such as logP and drug-likeness in benchmarking platforms such as MOSES64 and GuacaMol.65 Such ML-driven CAMD approaches often combine molecule generation and property prediction (e.g., VAEs51, 52), and sometimes optimization (e.g., GANs53, 54 or RL56) in a single ML model which needs to be retrained once the design target property changes and typically requires large property data sets for training.

In contrast to drug design, PSE applications, in particular model-based fuel design, often take place in a data-scarce environment, making ML-based CAMD challenging. In fact, there is only one very recent study using generative ML for fuel design: Liu et al.66 employed a string-based VAE to generate a large database of non-oxygenated hydrocarbons for subsequent screening of candidates with respect to fuel properties, followed by sampling further candidates from the most promising regions of the VAE's latent space. However, ML-driven CAMD has not yet been utilized for fuel design focusing on high SI engine efficiency including oxygenated hydrocarbons. Moreover, graph-ML approaches have not yet been applied to computer-aided fuel design.

In the present contribution, we propose a modular graph-ML CAMD framework* that integrates state-of-the-art graph-based ML methods and tools from the ML and drug design community and apply our framework to computer-aided design of high-octane fuel components for SI engines. Our framework is depicted in Figure 1 and consists of three distinct modules: (1) molecule generation by generative graph-ML models that learn a continuous molecular space from which new molecules can be generated; (2) property prediction through our recently published GNN model for fuel ignition quality prediction68; (3) optimization for strategic sampling from the continuous space of the generative graph-ML models to identify vectors that correspond to molecules with high predicted RON and OS values. Our framework has a modular architecture requiring minimal changes to the model structures if an additional property shall be targeted, that is, only a new property model needs to be trained and added, but the molecule generation and optimization modules do not need to be altered. Thus, the modular setup enhances reusability and therefore reduces the training effort compared to a single ML model approach, as indicated by Winter et al.69

Details are in the caption following the image
Schematic overview of the modular graph-ML CAMD framework for identification of high-octane fuels

We explore three different generative graph-ML models and two different optimization strategies. Importantly, we propose an applicability domain approach for GNN-based property prediction that allows us to focus the design process on molecules that presumably come with reliable predictions. We analyze the influence of the different ML methods on the structure and properties of the resulting molecules and compile a list of most promising high-octane fuel candidates. Finally, we perform an experimental investigation of one selected high-octane fuel candidate that emphasizes the importance of experimental validation of CAMD results and discuss potential pitfalls of the fully data-driven approach, particularly in a data-scarce environment.

The article is structured as follows: In Section 2, we briefly introduce the main principles behind graph-ML for molecules with regard to both molecule generation and property prediction. In Section 6, we present the modular graph-ML CAMD framework for design of high-octane fuels. The application of the framework in Section 12 includes a comparative analysis of the candidates obtained with different graph-ML modules and the experimental investigation of one particular candidate. Section 18 concludes our work.

2 PRELIMINARIES OF GRAPH MACHINE LEARNING

Graph-ML relies on a graph representation of molecules that can be utilized for generating molecular structures from a continuous space and for property prediction, as we briefly describe in the following. The interested reader is referred to references 70-72 for further details on graph-ML.

2.1 Molecular graph

The molecular graph of a molecule is an undirected graph Gmol = {V, Fv, E, Fe}; the nodes V represent the atoms; pairs of atoms u, vV that share a bond are connected by edges (u, v) ∈ E. Additional features of nodes (e.g., type of atom, degree of hybridization) are stored in Fv, while additional features of edges (e.g., bond length or type) are stored in Fe.

2.2 Generative models

Generative ML, the unsupervised learning from input data to generate new data that is similar to the provided data, allows to perform fully data-driven molecule generation and is an active research area.46, 48, 63, 73 Various works have developed string-based ML models in order to generate molecules with optimal properties based on SMILES,74-81 InChI,49 or SELFIES,76 the latter being a more robust string representation of molecules. In contrast, graph-ML directly works on the molecular graph which is arguably the more natural representation of a molecule and provides permutation invariance,82 that is, there is exactly one molecular graph for each molecule (neglecting steric effects). In this article, we focus on two frequently employed generative graph-ML approaches46, 48, 63: VAEs and GANs. Both methods construct a latent space where molecules are encoded as high-dimensional continuous vectors, referred to as latent vectors (LVs), which we denote as h LV n with the dimension n being a hyperparameter. We denote the encoding of a molecular graph into the latent space as a function
e GEN : G mol h LV . (1)

Autoencoders, and specifically VAEs, are a class of neural network architectures that employs an hourglass shape (cf. Figure 2A). They are trained to reproduce the input data at the output layer, a non-trivial task as the information has to be moved through some narrow layers in the middle of the network, that is, the hourglass shape forces VAEs to learn hLV as a low-dimensional representation of the input data at the most narrow layer. The left part of the network (from input to the latent vector) is called the encoder and the right part (from the latent vector to the output) is referred to as the decoder. The main difference between a standard autoencoder and a variational autoencoder (VAE) is that the latter assumes an underlying distribution for the data that it tries to learn in the latent vector space, for example, a multivariate Gaussian distribution h LV ~ N μ with parameters μ and ∑. VAEs can therefore be used to generate new data from presumably the same distribution as the input data. In the molecular context, VAEs map discrete molecule representations such as graphs to a continuous distribution from which new molecules can be sampled.

Details are in the caption following the image
Schematic structure of (A) VAEs and (B) GANs
GANs generate objects from a latent representation in a different manner (cf. Figure 2B). Instead of trying to reproduce an input sample, a GAN consists of two neural networks, a generator and a discriminator, where the discriminator is trained to distinguish between output data produced by the generator and real data, that is, the training samples. The generator thus learns to produce output data that resembles a given training data based on random input vectors hLV that are, for example, sampled from a Gaussian distribution, that is, h LV ~ N μ . In a GAN, the latent space therefore corresponds to the input space of the generator. We denote the decoding of the latent vector hLV to the molecular graph in case of both generators, VAE and GAN, with the function
d GEN : h LV G mol . (2)

2.3 Graph-based property prediction

A GNN59, 60 is a type of neural network that operates directly on the graph structure and thus enables end-to-end learning in molecular property prediction. Thereby, GNNs avoid the need for the often subjective manual selection process of molecular descriptors in QSPR/QSAR modeling that requires intuition and experience of the modeler.

GNNs for molecular property prediction are typically structured into two parts, a message passing phase and a readout phase83, 84 (cf. Figure 3). In the message passing phase, structural information is extracted from a local neighborhood of atoms by means of graph convolutions. In each graph convolution, every node sends a message to all its neighbors and thus also receives a message from each of its neighbors. The node uses the received messages, typically in form of a weighted sum, to update its current state (e.g., in GCN70 and GAT85). The update of the state h v l of a node v in a graph convolutional layer l can then be written as
h v l + 1 = σ ReLU h v l W 1 + u N v h u l W 2 , (3)
where W1, W2 are trainable weight matrices, N(v) is the one-hop neighborhood of v, and σReLU denotes the elementwise application of the ReLU activation function. Many different update functions have been proposed in the last years, see, for example, references 71, 86, 87, to advance the basic Equation (3) into a more powerful model for extracting information from the graph during message passing.88 For instance, inter-atomic distances and angles between atom pairs89-92 are commonly considered. Higher-order GNNs93, 94 and approaches where the information exchange is also based on individual edges95 constitute further extensions to the basic GNN approach.
Details are in the caption following the image
Schematic structure of a graph neural network for molecular property prediction
Subsequent to the message passing phase, a GNN employs a readout phase, where the molecular structure information that is stored in the nodes is aggregated into a single vector for the complete molecule, the so-called molecular fingerprint hFP. This aggregation, also called pooling, is typically performed by summing up the states of all nodes in the molecular graph after the last graph convolutional layer L, that is, hFP = ∑vV hvL. We denote the GNN encoding of the molecular graph into the molecular fingerprint with the function
g GNN : G mol h FP . (4)
Note that although the molecular fingerprint hFP in a GNN and the latent vector hLV in a generative ML model both represent a molecule in a continuous space, they are not related. In the GNN, the molecular fingerprint hFP is passed through a feedforward neural network (cf. Figure 3) to yield the property prediction p ̂ = MLP(hFP). Here, a multi-layer perceptron (MLP) is one of the simplest feedforward neural architectures and most frequently employed. We denote the entire end-to-end prediction process of a GNN as a function fGNN that maps the molecular graphs to a property prediction, that is,
f GNN : G mol p ̂ . (5)

3 GRAPH-ML CAMD FRAMEWORK FOR HIGH-OCTANE FUELS

In this section, we propose a fully data-driven, modular graph-ML CAMD framework for identification of high-octane fuels. The framework utilizes recent methods from the field of generative graph-ML and GNNs to design molecules with high-knock resistance for modern SI engines. Specifically, we set out to maximize the sum of RON and OS, hence the OI, as high-efficiency SI engines require both a high RON and a high OS.9-15 We show a high-level overview of our framework in Figure 1 and provide a detailed framework overview including our choices for algorithms and models in the three different modules in Figure 4. We combine the three modules to form an iterative molecular design loop: The optimization module proposes initial latent vectors from a continuous space, hLV, that are translated to corresponding molecules by the molecule generation module, cf. Equation (2). Then, the property prediction module performs the property evaluation, cf. Equation (5), and based on the property predictions, the optimization algorithm suggests new latent vectors to be tested. This iterative procedure is repeated until a pre-defined stopping criterion is met, for example, a certain number of molecules has been evaluated.

Details are in the caption following the image
Detailed overview of the modular graph-ML CAMD framework for identification of high-octane fuels including methods for the individual modules

An important observation with the graph-ML CAMD framework though is that not all molecules come with physically reasonable predictions. For instance, we have observed a molecule with predicted OS > 400 and negative RON and negative MON. In fact, the optimization often exploits weak spots of the GNN prediction model. Those weak spots typically appear for molecules that are strongly dissimilar from the molecules used for training the GNN. To focus on molecules with more reasonable property predictions, we extend the iterative design loop by an applicability domain (AD) for the GNN property prediction model. To this end, we build upon the AD approach from our previous study96 where we proposed to use a one-class classification model to identify the AD of feedforward NNs. The classification model learns from the data on which the NN is trained to decide if a new data point is similar to the training data and thus considered within the input domain for which the NN presumably provides reliable predictions. To transfer the AD approach to GNNs, we apply the classification model to the molecular fingerprint that serves as input to the MLP part of the GNN (cf. Section 4). If the AD is included, GNN predictions considered unreliable by the AD are ignored and instead a penalty value (−1000) is returned to the optimization approach so that the corresponding molecules are assigned a low objective value.

The design loop runs can be formulated as an optimization problem that aims to find the molecules with the highest predicted value of a certain target property p ̂ of interest, that is,
max h LV p ̂ s . t . G mol = d GEN h LV , p ̂ = f GNN G mol , h FP = g GNN G mol , AD h FP 0 , (6)
whereby the constraint with AD(hFP) ≥ 0 denotes a positive decision by the AD model.

Due to the high dimensionality of the search space that corresponds to the latent space of the generator models (see Equation (6)), deterministic global optimization is too computationally costly and practically impossible with current methods (cf. Section 8 below). Instead, we employ black-box optimization approaches that direct a heuristic search toward molecules with high p ̂ . Note that uncertainties in the prediction model prohibit a strict ranking of molecular candidates with similar p ̂ values. Practically, we therefore compile a list of molecules sampled by the optimizer and perform an investigation of the top candidates, that is, the molecules with the highest p ̂ values. Having multiple top candidates, also allows to take additional desired properties into account in later investigations, for example, availability for procurement and low production costs.

In the following, we briefly describe the three generative graph-ML models used in this article for the generation module, the GNN model used for the property prediction module, the two optimization algorithms used in the optimization module, and our AD approach.

3.1 Molecule generation

We consider two graph VAE models as generators: The Junction-Tree VAE by Jin et al.51 (JT-VAE) and the Molecular Hypergraph Grammar VAE by Kajino52 (MHG-VAE). Furthermore, we employ MolGAN, a GAN for molecular graphs published by De Cao and Kipf.54 Those three models have close to 100% chemical validity, that is, almost 100% of the generated molecules are chemically feasible,51, 52, 54 a feature that earlier generative methods struggled with, cf. references 97, 98. Apart from achieving high validity, the three models have strong conceptual differences, presumably leading to molecules with somewhat different characteristics.

The JT-VAE51 utilizes two graph representations of a molecule in parallel: The molecular graph and its associated junction tree, which is a contracted cycle-free graph generated by merging cycles of atoms into a single node. For encoding, the JT-VAE learns molecular structure information, represented as high-dimensional vectors, from the molecular graph and the junction tree through graph convolutions (cf. Section 2). For decoding, first, the junction tree's latent vector is decoded resulting in the general molecular structure. Then, the molecular graph's latent vector is decoded to determine the characteristics of the nodes within the junction tree, that is, (re)generating the local structure of the molecule. Jin et al. report a molecule reconstruction rate of 76.7% and 100% chemical validity of the decoded molecules.51

The MHG-VAE52 generates a graph grammar from the given training molecules which is used for the reconstruction of molecules. In this automatically generated graph grammar, terminal symbols can refer to either single atoms or complete functional groups and the rules of the grammar describe how such atoms of partial molecules can be combined into a chemically valid molecule. During the generation of the grammar, MHG-VAE ensures that the grammar accounts for chemical feasibility constraints such as valency rules, explaining the validity of 100%.

MolGAN54 only partially relies on graphs. Its adaptation to our case of high-octane fuel design is illustrated in Figure 5. The generator tries to directly predict a molecular graph's adjacency matrix with corresponding atom and bond features by using an MLP with a fixed output size, that is, the maximal size of a molecule that can be predicted by MolGAN is bounded. On the other hand, the discriminator is a GNN. One conceptual difference to the VAEs is that MolGAN is able to focus the generation on molecules with desirable properties by using a “reward network”, that is, a third network that encourages the generator to output molecules with high RON and OS. We use our GNN model68 to provide RON and OS predictions such that, in contrast to the VAEs, the training of MolGAN partially depends on the property prediction module. De Cao and Kipf state that while MolGAN generates novel molecules with desirable properties and almost 100% chemical validity, it also outputs many duplicates with only about one in 10 molecules being unique.54

Details are in the caption following the image
Adapted MolGAN for high-octane fuels, modified from reference 54. The reward network is coupled with our GNN68 for predicting RON and OS values.

3.2 Property prediction

We recently developed a GNN for predicting the RON, MON, and the derived cetane number (DCN) of a wide range of oxygenated and non-oxygenated hydrocarbons,68 for example, (cyclo-) alkanes, (cyclo-) alkenes, alcohols, esters, ethers, aromatics, and ketones. The model architecture is based on higher-order GNNs93 and additionally leverages the increased stability and accuracy of ensemble methods,99, 100 that is, the final property prediction is the average of multiple higher-order GNN predictions. Further, our GNN incorporates multitask learning101, 102 as it was trained on RON, MON, and DCN values simultaneously allowing the model to capture and exploit correlations between octane and cetane numbers.

As described in detail in reference 68, we compiled a data set comprising 335 RON, 318 MON, and 236 DCN values for 505 unique molecules in total to train the GNN. 85% of the data was used for training and validation, and 15% was used for testing. Note that for most molecules, both RON and MON values and thus OS were available. The mean absolute prediction error of the GNN model was 4.5 on the RON test set and 4.4 on the MON test set, indicating an overall high prediction quality on par with state-of-the-art QSPR- and ML-based RON and MON prediction models, cf. reference 68. The test sets also contain few outliers: Six predictions for RON and seven predictions for MON have a deviation >10, which we attribute, similarly to vom Lehn et al.,103 to some of these molecules having unique characteristics that are not well represented in the training data, a relatively small number of data points available with low RON and MON values, and potential disruptive factors in experimental data assembled from different sources.

3.3 Optimization

To sample molecules with high RON and OS from the latent space of the generative models, we employ numerical optimization using the RON + OS score predicted by the GNN model as objective function. Specifically, we seek to maximize p ̂ = RON + OS = 2 RON MON (cf. Equation (6)). We explore two derivative-free stochastic global optimization methods to perform the molecule sampling: A Bayesian optimization algorithm and a genetic algorithm.

Bayesian optimization (BO) is a probabilistic approach for global optimization104 commonly used for optimization of black-box models that are costly to evaluate. Usage of BO is well-established in ML-based CAMD, see, for example, references 51, 52, 75, as well as in chemical engineering applications, for example, the design of experiments in automated reaction platforms.105-107 BO uses a surrogate model, typically a Gaussian process (GP), to map the input variables to the objective. Based on the surrogate model, an acquisition function locates input variable values that have a high potential of maximizing the objective by accounting for both exploitation and exploration. For running BO, the GP is initialized with a set of feasible points. Then, the following steps are repeated until a termination criterion is reached: The acquisition function is optimized to determine the next sampling points, the sampling points are evaluated with respect to the objective function, and the objective values are used to refine the surrogate model. Note that different optimization algorithms can be used for maximizing the acquisition function, cf. reference 104.

A genetic algorithm (GA) is a meta-heuristic, population-based approach for global optimization that is inspired from evolutionary theory.108, 109 It is typically applied to optimization problems with cheap and fast evaluations of the objective function. In GAs, a set of feasible points is called population. Each feasible point has genes corresponding to specific values for the input variables of the optimization problem and constitutes a fitness related to the objective value. To solve an optimization problem, an initial population evolves in an iterative manner over multiple generations by promoting points with high fitness and using evolutionary heuristics, for example, combining genes of high fitness points, to replace points with low fitness. We choose the fitness to be RON + OS to directly optimize for high-octane ratings.

A major challenge in ML-based CAMD is the high dimensionality of the generators' latent space which typically requires a large number of sampling points for optimization, for example, in case of our generative models, we have latent space dimensionalities of 56 (JT-VAE),51 72 (MHG-VAE),52 and 32 (MolGAN).54 BO, however, employs a GP as surrogate model that in standard form has cubic scaling in complexity with respect to the number of sampling data points. Following the strategy by Kajino,52 we thus use PCA to reduce the dimensions of both the JT-VAE and the MHG-VAE before performing BO. Since the execution time of the evolutionary-based heuristics in the GA does not suffer from a high number of sampling points, we run the GA without dimensionality reduction. Note that the effects of PCA-based dimensionality reduction on the obtained molecules as well as the use of other mitigation strategies, such as reduction of the latent dimension within the generator or modification of BO for high-dimensional problems, see, for example, references 104, 110-113, are beyond the scope of this work.

3.4 Applicability domain

The AD of a model is a well-established concept in QSPR/QSAR modeling and is based on the general assumption that the prediction model would provide most reliable predictions for molecules that are similar to the ones seen during training.114-117 Molecular similarity is usually assessed by means of a distance metric, for example, the Euclidean distance between the descriptor values of two molecules.115, 118 For molecular property prediction with GNNs, determination of the AD is largely unexplored. Only very recently first approaches to quantify the AD of GNNs based on uncertainty quantification methods were proposed.119-121 Conceptually, defining the AD of a GNN requires handling the varying input sizes of molecular graphs and measuring the degree of similarity between different graphs. In this work, we address these challenges by extending our recently developed AD approach based on one-class support vector machines (SVMs)96 to GNNs. A one-class SVM is a ML model that can be used to identify outliers by classifying whether an input is similar or dissimilar to the training data. We train one-class SVMs on the molecular fingerprint of the GNN (cf. Figure 3) to determine the GNN's AD. We then restrict our molecular design loop to molecules which are accepted by the SVM (cf. Equation (6)) which formally means AD(·) = SVMAD(hFP,train) ≥ 0 where hFP,train is the molecular fingerprint computed by the GNN and SVMAD denotes the trained SVM. The underlying idea for the AD is that the GNN computes similar molecular fingerprints whenever two molecules are structurally similar. Since our prediction model is an ensemble of multiple GNNs, we train one SVM for each GNN model in the ensemble and apply a majority vote. That is, each SVM j evaluates SVMAD, j(hFP) and returns 1 if the molecule lies within the AD or −1 if not. Subsequently, we sum up the votes to decide if the prediction of the GNN ensemble (EL) for a new molecules is classified as reliable, that is, SVM AD - EL h FP = j SVM AD , j h FP > ! 0 . Note that further details on the AD are described in the Supporting Information S1.

3.5 Implementation and hyperparameters

We implement our graph-ML CAMD framework in Python with the cheminformatic package RDKit122 and the ML frameworks pytorch and tensorflow, accounting for the different implementations of the generators, and provide our code open-source, see, reference 67. Moreover, we follow the implementation of the MHG-VAE by Kajino52 and use Luigi123 to automate computational experiments. For the three generators, JT-VAE,51 MHG-VAE,52 and MolGAN,54 we use the original implementations and hyperparameters as provided in the respective study and code repository and only extend the code to work in our framework. We train the molecule generation models on all HCO-molecules in the QM9 data set,124, 125 that is, all molecules within QM9 that contain exclusively hydrogen, carbon, and oxygen atoms. QM9 contains approximately 50,000 HCO-molecules from various molecular classes. We use the original implementation and model parameters of our GNN68 which is based on pytorch-geometric.126 The SVMs for the AD are implemented with scikit-learn127 building on our AD study.96 For BO, we use GPyOpt.128 Note that we did not attempt deterministic global optimization of the acquisition function within the BO, for example, by using our tool MeLOn,129, 130 due to the high dimensionality (cf. Section 8) and associated high computational cost. Thus, we use the local optimization algorithm L-BFGS131 implemented in GPyOpt.128 As GA, we use the python package geneticalgorithm.132 For both BO and GA, we apply default settings. We follow the study of MHG-VAE by Kajino52 and reduce the dimensionality of the latent space within the VAEs by means of PCA aiming for an explained variance ratio of 99.9% (JT-VAE: from 56 to 41, MHG-VAE: from 72 to 38) before performing BO. Further details on the hyperparameter choice can be found in the Supporting Information S1. We run all computations on the HPC-cluster (CLAIX-2018) of RWTH Aachen University using one Supermicro 1029GQ-TVRT-01 node of an Intel Platinum 8160 core with 192 GB RAM, of which we used at most 8 GB, plus one NVIDIA Volta V100-SXM2 16 GB GPU. For reproducibility, we fixed random seeds for training the models and running the design loop that we provide with our code.

4 RESULTS AND DISCUSSION

We first present the computational results of our graph-based CAMD of high-octane fuels and then provide a discussion of the top candidates to demonstrate both strengths and potential weaknesses of the fully data-driven design approach.

4.1 CAMD results

We test all combinations of the three generator models (JT-VAE, MHG-VAE, and MolGAN) and the two optimization approaches (BO and GA) as well as two different stopping criteria (SC), that is, a limit on the number of candidate molecules generated (SC#molecs) and an upper limit on the wall-clock run time (SCtime). For SC#molecs, we consider both the number of unique molecules (1000) and the total number of molecules (2000) generated, as the number of duplicates can otherwise cause an unlimited run time. In the SC#molecs setting, the design loop will typically run for 0.5–8 h. The run time limit in SCtime is set to 12 h to investigate the effects of keeping the design loop running for a longer time. Furthermore, we distinguish between runs with and without the AD. All design loop runs are run five times (initialized with different random seeds) and the results are aggregated.

The top 12 molecules identified with SC#molecs and active AD for the respective generators are shown in Figure 6 together with the predicted RON and OS values. The results demonstrate that the generators successfully propose molecules with high predicted RON and OS. Moreover, the top molecules are from a variety of different molecular classes, for example, ethers, alcohols, and ketones, some of which are known to contain promising SI engine fuel candidates.2 The majority of molecules has at least one oxygen atom. Almost all top molecules generated by MolGAN include a cyclic structure, often associated with a cyclopropane feature, which we attribute to the high RON and OS for components with a cyclopropane substructure in the training set of the GNN model.133 Most top molecules generated by the two VAE models include strongly branched non-cyclic components, often in combination with one or two oxygen atoms, which are also known for high RON and OS values. Both VAE models generate the popular octane enhancers MTBE and ETBE, and some related small, branched ether structures. The JT-VAE also identifies ethanol, the prototype biofuel for SI engines.

Details are in the caption following the image
Top 12 candidates identified by the three different generator models with stopping criterion SC#molecs (max. 1000 unique molecules or max. 2000 total molecules) and applicability domain. RON and OS values are predicted by the graph neural network.68

Table 1 shows the statistics of all the runs with and without the AD, whereby each entry corresponds to the aggregated results over five runs. Both the maximum and the mean predicted RON + OS are typically lower if the AD is used. In most cases, also the total number of molecules generated is lower if the AD is considered. The observation that the AD often reduces the exploration performance is expected and in fact intended as the AD prohibits the generators from exploring structures that are far from the training data by strongly extrapolating the GNN model. We want to emphasize that we find the generators to mainly produce chemically valid molecules. Otherwise, for example, MolGAN sometimes generates disconnected substructures, the generated molecule is dropped so that effectively no chemically invalid structures are provided to the GNN and AD. Note that generated molecules, which are considered highly dissimilar to the training molecules by the AD, can still be chemically valid. We show examples of such chemically valid molecules well outside the GNN's AD in Figure 7, where the top candidates identified by the two VAEs with SCtime are depicted; we refer to the Supporting Information S1 for further examples.

TABLE 1. Results of optimization over five runs each
Predicted RON + OS JT-VAE MHG-VAE MolGAN
BO BO + AD GA GA + AD BO BO + AD GA GA + AD BO BO + AD GA GA + AD

SC#molecs

(1000 unique molecules, 2000 total)

#runs: 5

Max 205 130 129 130 138 129 136 131 121 121 121 121
Mean top 20 181 125 125 126 131 125 132 128 110 111 116 116
# unique mol. 2390 1347 3472 3712 4671 4308 4683 4427 21 21 46 46
# promising mol. 117 10 15 19 45 9 52 30 0 0 0 0

SCtime

(12 h run time)

#runs: 5

Max 205 130 187 131 138 129 145 131 121 121 121 121
Mean top 20 183 126 180 130 133 126 140 129 111 112 118 118
# unique mol. 2996 1935 109,830 80,818 6710 7081 55,255 46,989 22 23 193 172
# promising mol. 140 12 2096 376 104 15 678 142 0 0 0 0
  • Note: A molecule is considered promising if both RON > 110 and OS > 10. Runs with applicability domain are indicated by +AD.
Details are in the caption following the image
Top five candidates identified by the two VAE generator models with stopping criterion SCtime (12 run time) and without applicability domain. All RON and OS values are GNN predictions.68

When visually inspecting the top molecules from the design runs without AD, we find that the obtained molecules are typically huge, strongly branched hydrocarbons, for example, with up to almost 50 carbon atoms. As such compounds are presumably solid at room temperature, they are not suitable as fuels. To avoid the formation of solids within the fuel blend, a constraint on the melting point could be included in the design loop. However, the melting point can only serve as a rough proxy for the suitability of a compound as an octane booster, since miscibility and volatility also depend on the composition of the base fuel and the blending ratio.42, 134 Some of the proposed large molecules might be soluble in a fuel blend, which could be evaluated in further investigations of mixture properties, but is beyond the scope of this work. Furthermore, the RON and OS predictions for the molecules identified with the JT-VAE without AD (cf. Figure 7A) are visibly higher than the maximum RON (of 120 for 1,3,5-trimethylbenzene68, 135) and the maximum OS (of 36 for 1,4-cyclohexadiene68, 135) of the data used to train the GNN prediction model, indicating strong extrapolation. In the following, we therefore present and discuss only those results that have been obtained with the AD.

We observe that the VAE generators predict molecules with a maximum RON + OS of about 130 while MolGAN achieves a maximum of only 121 (cf. Table 1). The maximum RON + OS values of slightly above 130 for the two VAE models are in good agreement with known high-octane fuels such as MTBE with its experimentally validated RON + OS of 135. The encouraging performance of both VAE generators thus shows the general feasibility of our graph-ML CAMD framework utilizing the SVM-based AD.

To further compare the different generator and optimization combinations, we analyze the number of distinct molecules generated as well as the number of molecules with promising ignition properties, that is, the molecules with both a predicted RON > 110 and a predicted OS > 10. Both VAEs find a large number of distinct molecules irrespective of the employed stopping criteria (cf. Table 1). Specifically for SC#molecs, both VAEs generate more than 3500 unique molecules out of 5000 maximally possible unique molecules (1000 unique molecules each over five runs). This means that not only do the VAEs find a large number of distinct molecules in each run, but the identified molecules also vary greatly between different runs, thus leading to an overall small number of duplicates. In contrast, MolGAN mainly generates duplicates of which none are considered promising (cf. Table 1). Comparing the results for SC#molecs and SCtime (cf. Table 1), it can be seen that the VAE-GA combinations significantly increase the number of both explored and promising candidates with longer run time. Apparently, this observation does not extend to BO, with one possible explanation being that BO becomes inherently slower as more data points are added to the surrogate model, thereby reducing the number of predictions per time, whereas the corresponding rate remains unchanged in the GA (cf. Section 8).

The predicted RON and OS values of all promising molecules obtained with the two stopping criteria are shown in Figure 8. We also highlight those molecules identified in the SC#molecs setting that are commercially available at chemical suppliers. Commercial availability was assessed by a manual search on Sigma-Aldrich136 and Chemspider137 websites without imposing a price limit but only including those molecules with an explicitly stated price; we did not search for the lowest price on different websites. For SCtime, Figure 8B, the effort for a manual search was considered disproportional due to the high number of promising candidates. We further indicate molecules with high predicted RON + OS in the QM9 database124, 125 that is used for training the generative models; additional QM9 statistics are provided in the Supporting Information S1. Figure 8 demonstrates that the graph-ML CAMD framework is able to generate molecules with high predicted RON and high predicted OS that are not in the QM9 database. This observation is emphasized in case of SCtime (cf. Figure 8B). The capabilities of the generator models to generalize therefore allow to explore novel molecules for further investigation.

Details are in the caption following the image
Promising candidates (predicted RON > 110 and OS > 10). Commercially availability (red crosses) determined by manual search on Sigma-Aldrich and Chemspider websites136, 137

4.2 Discussion of top candidates

In the discussion of the top molecules, we restrict our analysis to the promising molecules (RON > 110 and OS > 10) generated using SC#molecs, as the number of molecules generated with SCtime is very large; we refer to the Supporting Information S1 for a detailed list of all generated promising molecules. The top molecules that are also commercially available are illustrated in Table 2, including RON and OS predictions, literature values for RON and OS (where available), price category, and the respective combinations of generator and optimizer that identified the molecule.

TABLE 2. All 16 commercially available molecules with predicted RON > 110 and OS > 10 (identified in SC#molecs setting and active applicability domain)
Class Structure SMILES RON OS Price category Generator (optimizer)
Alkanes image

C1CC1

cyclopropane

110 16 Medium

JT (BO, GA),

MHG (BO, GA)

image

CC

ethane

110 (111135) 12 (11135) Low JT (BO, GA)
Aromatics image

CCc1cccc(C)c1

3-ethyltoluene

110 (112135) 11 (12135) High JT (GA)
Ethers image

COC(C)(C)C

MTBE

115 (118138) 14 (17138) Low

JT (BO, GA),

MHG (BO, GA)

image

CCOC(C)(C)C

ETBE

114 (118139) 14 (16139) Medium

JT (BO, GA),

MHG (GA)

image

CC(C)OC(C)(C)C

tert-butyl isopropyl ether

114 13 High

JT (GA),

MHG (GA)

Aldehydes image

CC(CO)C(C)(C)C

2,3,3-trimethylbutanal

111 12 High MHG (GA)
image

CC(C)(C)CO

trimethylacetaldehyde

111 11 Medium

JT (GA),

MHG (BO)

Polyfunctional (aldehyde + ether) image

CC(C)(C)OCCO

tert-butoxyacetaldehyde

116 15 High MHG (GA)
image

CCOC(C)(C)CO

2-ethoxy-2-methylpropanal

114 13 High MHG (GA)
image

COC(C)(C)CO

2-methoxy-2-methylpropanal

116 11 High MHG (GA)
image

CC(C)OC(C)(C)CO

2-methyl-2-propan-2-yloxypropanal

114 12 High MHG (GA)
image

COC(C)CO

2-methoxypropanal

112 11 High MHG (BO, GA)
Polyfunctional (ketone + ether) image

COC(C)(C)C(C)O

3-methoxy-3-methyl-2-butanone

113 11 High MHG (GA)
image

COC(C)C(O)C(C)(C)C

4-methoxy-2,2-dimethylpentan-3-one

111 12 High MHG (GA)
Acetals image

COC(C)(C)OC

2,2-dimethoxypropane

116 14 Low JT (GA)
  • Note: RON and OS data available in the literature are stated in parentheses. Prices are categorized based on data from different chemical suppliers136, 140-142: ≤1000$/l (low), >1000 $/l and ≤10,000$/l (medium), >10,000 $/l (high).

4.2.1 Promising classes of molecules

We find both pure hydrocarbons and oxygenated hydrocarbons (cf. Table 2), molecules already in use as octane boosters and molecules that constitute interesting candidates for further experimental investigation. The two identified alkanes, ethane and cyclopropane, are gaseous under ambient conditions, whereas the one aromatic hydrocarbon, 3-ethyltoluene, is liquid. The known RON + OS scores from literature for ethane and 3-ethyltoluene of 122 and 124, respectively, are in good agreement with the GNN predictions. We want to emphasize that gaseous compounds, such as ethane and cyclopropane, are difficult to implement as octane boosters. To prevent gases within the candidate list, one could include boiling point constraints in the design loop. However, the normal boiling point is, similar to the melting point discussed at the beginning of this section, only a rough preselection criterion, since the miscibility and volatility of a potential octane booster in a fuel blend strongly depend on the overall blend composition. Next to alkanes, three ethers are identified, including methyl tert-butyl ether (MTBE) and ethyl tert-butyl ether (ETBE) that are used as octane boosters in practical applications.16, 17 Their experimentally RON + OS scores of 135 and 134138, 139 are slightly higher than the predicted scores. Furthermore, molecules from the class of aldehydes are identified. It has been found, however, that the formation of aldehydes during the combustion process of high-octane, oxygenated hydrocarbons results in increased exhaust emissions,143 indicating a lower suitability of aldehydes as fuels. Polyfunctional molecules with an aldehyde and an ether group are generated as well, which also entail the problem of aldehyde emissions. Further polyfunctional molecules containing an ether group and a ketone group are generated, with ketones being prominent high-octane fuels.144, 145 Most of the molecules containing an ether, a ketone, and/or an aldehyde functionality have a compact, branched structure with similarities to MTBE and ETBE, making them interesting high-octane fuel candidates; however, they also have a high price, hindering experimental investigation.

The last top candidate in Table 2, namely 2,2-dimethoxypropane (2,2-DMP), belongs to the class of acetales. It is a compact structure similar to ETBE, with the difference being that one carbon atom is replaced by a second oxygen atom. 2,2-DMP also has a low price, making it an attractive target for experimental investigation. A DCN measurement of 31 is known from literature146 which, however, is not suggestive of a very high RON, as molecules with RON > 110 typically correspond to DCN values below 10, cf. references 2, 147. Our high RON + OS prediction (cf. Table 2), however, is consistent with the RON + OS value of 143 stated in a recent study by Li et al.18 who used a ML-QSPR prediction model combining both ML and a group contribution approach. Another ML-based QSPR model for RON and OS recently developed by vom Lehm et al.103 likewise predicts a high RON + OS value of 156.

4.2.2 Comparison to previous fuel design studies

Our commercially available top candidates (cf. Table 2) generally match the molecular classes identified in previous fuel design/screening studies for SI engine fuels, for example, in references 2, 18, 40, 148. Specifically, prominent molecular classes from previous studies include the herein identified groups of ethers,2, 18, 148 ketones,2, 18, 40, 148 aromatics,148 aldehydes,18, 40 alkanes,148 and acetals.18 Interestingly, our top candidates do not include any esters, alcohols, and furans that have often been identified in the literature.2, 18, 148 When inspecting all molecules generated in our design loop runs with SC#molecs and with AD, we indeed find esters (e.g., methyl acetate), alcohols (e.g., ethanol and methanol), as well as furans (e.g., 2-methylfuran). However, these are not considered top candidates as predicted OS is below 10 for most esters and predicted RON is slightly below 110 in case of furans and alcohols. Such RON and OS predictions are generally in accordance with the literature values for representative molecules of these classes, cf. references 42, 68, 135, 149, 150.

The polyfunctional molecules identified in our study are hardly discussed in the literature. It should be noted that the availability of experimental RON and MON values for polyfunctional molecules is very limited, indicating a high uncertainty in the GNN predictions.

The generated top candidate of acetals, 2,2-DMP, has also been identified in the fuel screening by Li et al.18 and will be investigated experimentally in the following.

4.2.3 Experimental assessment of 2,2-DMP

Experimental investigation of 2,2-DMP was conducted in dedicated test engines according to the DIN EN ISO 5164151 and DIN EN ISO 5163 standards,152 respectively, by an external company. Measurement of RON and MON of pure 2,2-DMP, however, could not be performed. Instead, blends of 2,2-DMP with 90%, 80%, and 60% (v/v) of gasoline were investigated. The extrapolation to pure component values yielded a RON of 91.75 (±0.25) and a MON of 87.27 (±0.3), hence a RON + OS score of about 96, indicating a strong misprediction by our GNN model as well as the models by Li et al.18 and by vom Lehn et al.103 To further clarify the ignition properties of 2,2-DMP, we experimentally measured ignition delay times (IDT) in a rapid compression machine (RCM)153, 154 and compared the chemical reactivity of 2,2-DMP to that of a typical RON95E10 pump station fuel. IDT measurements for 2,2-DMP were performed at an end-of-compression pressure of 20 bar for a stoichiometric mixture and with an nitrogen-to-oxygen dilution ratio of 3.762 in the temperature range of 647–793 K. Details on the RCM measurements can be found in the Supporting Information S1. The ignition took place via a two-stage process in the investigated temperature regime indicating strong low-temperature chemistry, cf. Figure 9, not representative for a high-octane fuel. Compared to the RON95E10 fuel, 2,2-DMP shows a distinctively higher reactivity between 647 and 750 K pointing toward a lower knock resistance and thus RON value. The RCM results suggest a slightly worse knock resistance of 2,2-DMP compared to RON95E10 pump station fuel, supporting the extrapolated RON and MON measurements.

Details are in the caption following the image
Measured ignition delay time in a rapid compression machine for 2,2-dimethoxypropane and a commercially available RON95E10 pump station fuel. The error bars indicate ±20% scatter of the measured ignition delay time. The yellow line corresponds to a threefold Arrhenius model fit to the RON95E10 ignition delay times.155

The case of 2,2-DMP shows the potential weaknesses of a fully data-driven approach in a data-scarce environment. We account the large model prediction error of our GNN as well as those of the models by Li et al.18 and by vom Lehn et al.103 to the comparatively little training data available for RON and MON modeling. Specifically, our RON and MON training database includes just five ethers, a single acetal (not 2,2-DMP), no aldehydes, eight ketones, and only two molecules with more than one type of oxygen functionality (cf. reference 68). Similar data limitations apply to the other RON and MON prediction models,18, 103 explaining their similarly bad predictions in case of 2,2-DMP. Furthermore, we want to stress the fact that no RON and MON values for aldehydes are included in the training data, so our GNN may not sufficiently distinguish between aldehydes and ketones. The RON + OS predictions of the identified molecules with an aldehyde group are therefore considered subject to large uncertainty. In the case of 2,2-DMP, a DCN data point was available and used in the training of our multitask GNN for simultaneous RON, MON, and DCN prediction (cf. Section 7). As expected, our AD approach based on majority voting (cf. Section 9) considers 2,2-DMP within the region of reliable predictions as it was part of the training data. Yet, only 31 out of 40 SVMs voted for 2,2-DMP. Increasing the AD consensus level, for example, 80% instead of 50%, may provide some protection against such strong mispredictions, at the cost of a smaller search space. A systematic investigation of the relationship between the AD consensus level and the prediction accuracy for molecules proposed by the design loop, however, is beyond the scope of this work. The weak spots of prediction models for fuel ignition quality remain a huge challenge for model-based fuel design, even when utilizing state-of-the-art ML68, 103 and an applicability domain. Therefore, acquiring more training data is absolutely crucial.

5 CONCLUSION

We propose a fully data-driven CAMD approach based on recent methods from graph-ML for the identification of molecules with desired ignition characteristics for modern SI engines. Our graph-ML CAMD framework utilizes a representation of molecules as graphs and incorporates three modules for building a molecular design loop: (1) molecule generation from a continuous molecular space with generative graph-ML, (2) molecular property prediction through GNNs, and (3) optimization for strategic sampling from the continuous molecular space to find molecules with high predicted RON + OS. The modular structure enables the exploration of different ML models in combination with different optimization approaches. We additionally present a novel approach to identify the applicability domain (AD) of GNN models for molecular property prediction. By predicting promising high-octane fuel molecules in a fully data-driven fashion, our study exemplifies how recent developments in ML can be utilized for CAMD and its automation.

The top molecular candidates identified with our graph-ML CAMD framework are from well-known molecular classes for high-octane fuels, for example, ethers and ketones, and include both well-established components like MTBE and ETBE as well as new promising candidates for further experimental investigation. The comparison of different generative graph-ML models, namely JT-VAE,51 MHG-VAE,52 and MolGAN,54 in combination with different optimization approaches, BO and GA, shows that the choice of the generative model and optimization strategy influences the number and type of identified candidate molecules. Both VAEs provide a diverse continuous molecular space with a large number of potential molecules, while MolGAN generates a comparatively low number of candidates and yields lower target property values. We conclude that the GA is well suited for exploring large portions of the continuous molecular space of the generative models, especially when working with high dimensions where BO struggles but still finds some promising candidates. Our AD approach additionally enables us to focus the exploration on candidates with presumably more accurate predictions. The experimental investigation of one candidate within the AD, namely 2,2-dimethoxypropane, shows lower RON and OS values than predicted by our GNN model, demonstrating the limitations of CAMD in a comparatively data-scarce environment. We thereby highlight the importance of experimental validation to fuel design and the need for further RON and OS training data. Furthermore, the correlation between the AD threshold, that is, the consensus level, and the prediction accuracy for molecules proposed by the design loop should be investigated.

Future work could include additional physical and chemical properties in the design, for example, melting point, boiling point, vapor pressure, toxicity, or viscosity, similar to previous studies.2, 18, 40, 148 The framework, in principle, is not bound to fuel design as application but could also be applied to other CAMD applications such as drug discovery, design of catalysts, pesticides, and so forth.

AUTHOR CONTRIBUTIONS

Jan G. Rittig: Conceptualization (equal); data curation (equal); formal analysis (equal); funding acquisition (supporting); investigation (equal); methodology (equal); software (equal); validation (equal); visualization (lead); writing – original draft (lead); writing – review and editing (supporting). Martin Ritzert: Conceptualization (equal); data curation (equal); formal analysis (equal); investigation (equal); methodology (equal); software (equal); validation (equal); visualization (supporting); writing – original draft (supporting); writing – review and editing (supporting). Artur M. Schweidtmann: Conceptualization (equal); funding acquisition (supporting); methodology (supporting); supervision (supporting); writing – review and editing (supporting). Stefanie Winkler: Data curation (supporting); formal analysis (supporting); methodology (equal); software (equal); validation (supporting); writing – review and editing (supporting). Jana M. Weber: Methodology (supporting); writing – review and editing (supporting). Philipp Morsch: Investigation (equal); writing – original draft (supporting); writing – review and editing (supporting). K. Alexander Heufer: Funding acquisition (supporting); supervision (supporting); writing – review and editing (supporting). Martin Grohe: Conceptualization (supporting); funding acquisition (equal); supervision (equal); writing – review and editing (supporting). Alexander Mitsos: Conceptualization (supporting); funding acquisition (equal); supervision (equal); writing – review and editing (supporting). Manuel Dahmen: Conceptualization (supporting); formal analysis (equal); supervision (equal); writing – review and editing (lead).

ACKNOWLEDGMENTS

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—466417970—within the Priority Programme “SPP 2331: Machine Learning in Chemical Engineering.” It was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy—Cluster of Excellence 2186 “The Fuel Science Center.” This work was also performed as part of the Helmholtz School for Data Science in Life, Earth and Energy (HDS-LEE). Further, this work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the GRK 2236 UnRAVel. Simulations were performed with computing resources granted by RWTH Aachen University under project “rwth0664.” The authors thank Florian vom Lehn for providing RON and OS predictions for 2,2-dimethoxypropane by his model. MD received funding from the Helmholtz Association of German Research Centres. Open Access funding enabled and organized by Projekt DEAL.

    CONFLICT OF INTEREST

    The authors have no conflict of competing interest.

    Endnotes

  1. * Code is openly available, see reference 67.
  2. Found by GA when optimizing for OS only.
  3. DATA AVAILABILITY STATEMENT

    The data that support the findings of this study are openly available in our GitLab repository “Graph machine learning for design of high-octane fuels” at https://git.rwth-aachen.de/avt-svt/public/graph_ML_fuel_design, reference 67.