A machine learning framework for X-ray scattering data analysis

21 oktober 2022, 08:09

Magnus Röding, Piotr Tomaszewski, Shun Yu, Markus Borg, Jerk Rönnols

Across the sciences, data from many different measurements and characterization techniques are analyzed by curve fitting of computationally heavy models. One example is small angle X-ray scattering (SAXS), which is a commonly used technique to characterize materials at the scale of nanometers and a critical part of the toolbox in RISE’s involvement with LSRI (large scale research infrastructures) such as MAX IV in Lund.

Conventionally, the user has the option of either using simplified models with limited computational workload or accept lengthy run times for data analysis. However, there is a third route, using machine learning methods to accelerate the analysis and circumvent computationally heavy curve fitting. In the recent paper Machine learning-accelerated small-angle X-ray scattering analysis of disordered two- and three-phase materials, we develop a proof of concept for predicting material parameters using machine learning-based regression.

Statistical models for random materials

We investigate a class of computer models for disordered (random) material structures, based on so-called Gaussian random fields, a type of statistical model frequently used in materials science to represent a wide range of random materials, for example polymer coatings on pharmaceutical tablets, lithium-ion batteries, and porous metal alloys for energy storage.

Simulation of SAXS data

For most random materials, the model for the SAXS data is not analytically tractable. Therefore, the SAXS curve must be computed numerically using for example Fourier methods and the Fast Fourier transform (FFT). The generation of the material model is also performed using FFT-based methods; therefore, both steps can be heavily accelerated using GPUs. Some typical examples of simulated, realistic SAXS curves (with no measurement noise added) for different parameters are shown below.

This model can represent materials with multiple phases, such as porous materials with two phases, a solid part and porous network (filled with vacuum, air, gas, or fluid), or with three phases such as the figure above where the blue phase, for example water vapor, is adsorbed at the surface of the gray, solid phase.

The fact that the entire two-step model (material model plus SAXS model) can execute efficiently on GPUs makes it possible to generate very large synthetic datasets for machine learning. Indeed, on an NVIDIA A40 GPU the process takes about 1 s, and we can generate millions of synthetic SAXS curves.

Conventional curve fitting vs machine learning

Conventionally, for example least squares fitting would be used to fit this kind of model curve to real experimental data, which would imply hundreds of online model evaluations that are part of the actual estimation procedure and hence hundreds of seconds in execution time. In contrast, we train a boosted tree-based regression model (XGBoost) to learn to predict material parameters based on offline model evaluations (the synthetic training data). Boosted trees are generally among the best-performing models for non-image data and can also benefit greatly from GPU acceleration. Although both the data generation and the training of the XGBoost models take time, the execution time for the actual prediction is in the order of milliseconds.

A case study is shown in the figure above. A simulated material (center) yields a corresponding simulated SAXS curve with realistic experimental noise (left, blue). The XGBoost model is then used to predict the material parameters, from which the “fitted” SAXS curve (left, red) is obtained (there is no actual fitting) and an example of a structure with the predicted parameters (right) can be produced. Note that the materials are only statistically similar; since the materials are random, the aim is not an exact reproduction, only to capture parameters such as length scale.

Conclusions

This proof of concept, a collaboration between multiple parts of RISE, illustrates the usefulness of a machine learning-based approach for predicting material parameters. As this vastly accelerates the data analysis process once the measurement is done, experimenters can more quickly obtain predictions of the material parameters. The concept is easily extended to other experimental settings and other statistical models for materials. On top of the article, all the data and codes are available open access on Zenodo to facilitate further development in this field.