πŸš€ Getting started ================== .. note:: This section introduces the framework, its core concepts, and the main components of its workflow. It provides a starting point for understanding the framework and prepares you for the complete end-to-end :ref:`tutorial`. For advanced usage, consult the :doc:`/api`. Introduction ------------ At its core, |aidsorb| automates the end-to-end workflow of training deep learning models for porous materials. The process starts from a directory of molecular structures together with a ``labels.csv`` file containing the target properties. The structures are first converted into one of the built-in representations. Alternatively, users can supply their own precomputed representations stored as ``.npy`` files. The resulting data are then split into training, validation, and test sets, after which the entire training pipeline is orchestrated through a single ``.yaml`` configuration file. The general workflow is illustrated below. .. image:: images/aidsorb_workflow.png :align: center Representations --------------- .. tip:: The representations described below are built into |aidsorb|, but you are not limited to them. You can train models using your own representations, as long as they are stored as ``.npy`` files (see :func:`numpy.save`) in a directory. Point clouds ^^^^^^^^^^^^ *What is a point cloud?* A point cloud is a *set of 3D data points*, i.e. a *set of 3D coordinates and (optionally) associated features*. More formally: .. math:: \mathcal{P} = \{\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_N\} \quad \text{and} \quad \mathbf{p}_i \in \mathbb{R}^{3+C} where :math:`N` is the number of points in the point cloud and :math:`C` is the number of (per-point) features. In |aidsorb|, a point cloud is represented as a :class:`~numpy.ndarray` or :class:`~torch.Tensor` of shape ``(N, 3+C)``: .. math:: \mathcal{P} = \begin{bmatrix} \mathbf{p}_1 \\ \mathbf{p}_2 \\ \vdots \\ \mathbf{p}_N \end{bmatrix} = \begin{bmatrix} x_1 & y_1 & z_1 & f_{1}^1 & \dots & f_1^C \\ x_2 & y_2 & z_2 & f_{2}^1 & \dots & f_2^C \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x_N & y_N & z_N & f_{N}^1 & \dots & f_N^C \\ \end{bmatrix} *What is a molecular point cloud?* It is a point cloud where coordinates correspond to **atomic positions**, and features correspond to **atomic numbers and any additional information**. In |aidsorb|, a molecular point cloud is represented as :class:`~numpy.ndarray` or :class:`~torch.Tensor` of shape ``(N, 4+C)``, where ``N`` is the number of atoms, ``pcd[:, :3]`` are the **atomic coordinates**, ``pcd[:, 3]`` are the **atomic numbers** and ``pcd[:, 4:]`` any **additional features**. If ``C == 0``, then the only features are the atomic numbers. *Why molecular point clouds?* A fast, generic, and flexible representation that can be applied to a wide range of molecular and material systems. It enables deep learning directly from raw structural information, but typically requires more training data than more specialized representations. .. dropdown:: Molecular point cloud example .. raw:: html :file: images/pcd_plotly.html The above point cloud represents IRMOF-1. You can hover πŸ–±οΈ over the figure to play with it. Energy voxels ^^^^^^^^^^^^^ *What are energy voxels?* It is the **voxelized potential energy surface** of the material, that is a **3D energy image**, representing the landscape of host-guest interactions. In |aidsorb|, energy voxels are represented as :class:`~numpy.ndarray` or :class:`~torch.Tensor` of shape ``(C, D, H, W)`` (multi-channel image) or ``(D, H, W)`` (single-channel image). *Why energy voxels?* A physics-informed representation tailored for adsorption in porous materials. By explicitly encoding host–guest interaction energies, it often achieves good predictive performance with less training data than more generic representations, at the cost of reduced generality. .. dropdown:: Energy voxels example .. raw:: html :file: images/pes_plotly.html The above energy image represents IRMOF-1. You can hover πŸ–±οΈ over the figure to play with it. .. _tutorial: Tutorial -------- This tutorial demonstrates a complete workflow using molecular point clouds. .. tip:: The overall workflow is essentially the same for all built-in and custom representations. Only the representation-specific preprocessing and model configuration need to be adapted. Before starting, the following components are needed: * A directory containing files of **molecular structures**. * A ``.csv`` file containing the **labels of the molecular structures**. * A ``.yaml`` **configuration file** for orchestrating the DL part. .. note:: You are solely responsible for these 3 components. Data preparation ^^^^^^^^^^^^^^^^ .. rubric:: Create and store the point clouds Assuming your molecular structures are stored under a directory named ``structures``: .. tab-set:: .. tab-item:: CLI .. code-block:: console $ aidsorb create points path/to/structures path/to/pcd_data --features='[en_pauling]' $ aidsorb create --config=config.yaml # Recommended for reproducibility .. tab-item:: config.yaml .. code-block:: yaml dirname: 'path/to/structures' outname: 'path/to/pcd_data' features: ['en_pauling'] .. tab-item:: Python .. code-block:: python from aidsorb.utils import pcd_from_dir # Add electronegativity as additional feature. pcd_from_dir( dirname='path/to/structures', outname='path/to/pcd_data', features=['en_pauling'], ) .. rubric:: Split point clouds into train, validation and test sets .. tab-set:: .. tab-item:: CLI .. code-block:: console $ aidsorb prepare path/to/pcd_data --split_ratio='[0.7, 0.1, 0.2]' --seed=42 $ aidsorb prepare --config=config.yaml # Recommended for reproducibility .. tab-item:: config.yaml .. code-block:: yaml source: 'path/to/pcd_data' split_ratio: [0.7, 0.1, 0.2] seed: 42 .. tab-item:: Python .. code-block:: python from aidsorb.data import prepare_data # Split the data into (train, val, test). prepare_data( source='path/to/pcd_data', split_ratio=(0.7, 0.1, 0.2), seed=1, ) After creating and splitting the point clouds: .. code-block:: console project_root β”œβ”€β”€ pcd_data β”‚ β”œβ”€β”€ foo.npy β”‚ β”œβ”€β”€ ... β”‚ └── bar.npy β”œβ”€β”€ test.json β”œβ”€β”€ train.json └── validation.json * Each ``.npy`` file under ``pcd_data`` corresponds to a point cloud. * The ``.json`` files store the point cloud names for training, validation and testing. .. note:: The names stored in the ``.json`` files must match the entries in the ``index_col`` column of ``labels.csv``, without the ``.npy`` suffix (e.g. ``foo.npy`` β†’ ``foo``). .. tip:: You can visualize an input representation with: .. code-block:: console $ aidsorb visualize path/to/input.npy Train and test ^^^^^^^^^^^^^^ All you need is a ``.yaml`` configuration file and some keystrokes: .. tab-set:: .. tab-item:: Train .. code-block:: console $ aidsorb-lit fit --config=config.yaml .. tab-item:: Test .. code-block:: console $ aidsorb-lit test --config=config.yaml --ckpt_path=path/to/ckpt .. tab-item:: config.yaml You can generate and start customizing a configuration file as following:: $ aidsorb-lit fit --print_config > config.yaml Below is a dummy configuration file for multi-output regression using PointNet: .. warning:: The following configuration file is for illustration purposes only. **Adjust it as needed!** .. literalinclude:: examples/config.yaml :language: yaml .. tab-item:: labels.csv .. literalinclude:: examples/labels.csv :language: yaml .. seealso:: The documentation for the `LightningCLI `_, in case you are not familiar with |lightning| and YAML. Summing up ^^^^^^^^^^ .. code-block:: console $ aidsorb create points path/to/structures path/to/pcd_data # Create point clouds $ aidsorb prepare path/to/pcd_data # Split point clouds $ aidsorb-lit fit --config=path/to/config.yaml # Train $ aidsorb-lit test --config=path/to/config.yaml --ckpt_path=path/to/ckpt # Test .. _api_tutorial: Using the Python API ^^^^^^^^^^^^^^^^^^^^ Although you are encouraged to use the :doc:`cli`, for more flexibility you can also use |aidsorb| with plain |pytorch| or |lightning|. .. tab-set:: .. tab-item:: PyTorch .. code-block:: python import torch from torch.utils.data import DataLoader from aidsorb.data import Dataset, get_names # Create the datasets. train_set = Dataset( names=get_names('path/to/project_root/train.json'), path_to_X='path/to/input_data/', path_to_Y='path/to/labels.csv', ... ) val_set = Dataset( names=get_names('path/to/project_root/validation.json'), path_to_X='path/to/input_data/', path_to_Y='path/to/labels.csv', ... ) # Create the dataloaders. train_loader = DataLoader(train_set, ...) val_loader = DataLoader(val_set, ...) # Create the model. model = SomeModule(...) # Your code goes here. ... .. tab-item:: PyTorch Lightning .. code-block:: python import torch from lightning.pytorch import Trainer, seed_everything from aidsorb.datamodules import DataModule from aidsorb.litmodules import LitModule # For reproducibility. seed_everything(42) # Create the datamodule. dm = DataModule( path_to_X='path/to/input_data', path_to_Y='path/to/labels.csv', ..., ) # Create the litmodel. litmodel = LitModule( model=SomeModule(...), criterion=torch.nn.MSELoss(...), ... ) # Create the trainer. trainer = Trainer( max_epochs=10, accelerator='gpu', ...) # Train the model. trainer.fit(litmodel, datamodule=dm) # Test the model. trainer.test(litmodel, datamodule=dm) .. seealso:: * :class:`aidsorb.data.Dataset` * :class:`aidsorb.modules` * :class:`aidsorb.datamodules.DataModule` * :class:`aidsorb.litmodules.LitModule` Questions --------- We warmly encourage you to share any questions or ideas in the |discussions|. Before asking *how to do X?*, please read the documentation carefully.