Tutorial

Note

This tutorial covers the most common use cases of AIdsorb. For more advanced usage, you should consult the API Reference.

Introduction

What is a point cloud?

A point cloud is a set of 3D data points, i.e. a set of 3D coordinates and (optionally) associated features. More formally:

\[\mathcal{P} = \{\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_N\} \quad \text{and} \quad \mathbf{p}_i \in \mathbb{R}^{3+C}\]

where \(N\) is the number of points in the point cloud and \(C\) is the number of (per-point) features.

In AIdsorb, a point cloud is represented as a ndarray or Tensor of shape (N, 3+C):

\[\begin{split}\mathcal{P} = \begin{bmatrix} \mathbf{p}_1 \\ \mathbf{p}_2 \\ \vdots \\ \mathbf{p}_N \end{bmatrix} = \begin{bmatrix} x_1 & y_1 & z_1 & f_{1}^1 & \dots & f_1^C \\ x_2 & y_2 & z_2 & f_{2}^1 & \dots & f_2^C \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x_N & y_N & z_N & f_{N}^1 & \dots & f_N^C \\ \end{bmatrix}\end{split}\]

What is a molecular point cloud?

It is a point cloud where coordinates correspond to atomic positions, and features correspond to atomic numbers and any additional information.

In AIdsorb, a molecular pcd is represented as ndarray or Tensor of shape (N, 4+C), where N is the number of atoms, pcd[:, :3] are the atomic coordinates, pcd[:, 3] are the atomic numbers and pcd[:, 4:] any additional features. If C == 0, then the only features are the atomic numbers.

Deep learning on molecular point clouds

The following components are needed:

  • A directory containing files of molecular structures.

  • A .csv file containing the labels of the molecular structures.

  • A .yaml configuration file for orchestrating the DL part.

Note

You are solely responsible for these 3 components.

Data preparation

Create and store the point clouds

Assuming your molecular structures are stored under a directory named structures:

$ aidsorb create path/to/structures path/to/pcd_data --features="[en_pauling]"
$ aidsorb create --config=config.yaml  # Recommended for reproducibility
dirname: 'path/to/structures'
outname: 'path/to/pcd_data'
features: ['en_pauling']
from aidsorb.utils import pcd_from_dir

# Add electronegativity as additional feature.
pcd_from_dir(
    dirname='path/to/structures',
    outname='path/to/pcd_data',
    features=['en_pauling'],
)

Split point clouds into train, validation and test sets

$ aidsorb prepare path/to/pcd_data --split_ratio="[0.7, 0.1, 0.2]" --seed=1
$ aidsorb prepare --config=config.yaml  # Recommended for reproducibility
source: 'path/to/pcd_data'
split_ratio: [0.7, 0.1, 0.2]
seed: 1
from aidsorb.data import prepare_data

# Split the data into (train, val, test).
prepare_data(
    source='path/to/pcd_data',
    split_ratio=(0.7, 0.1, 0.2),
    seed=1,
)

After creating and splitting the point clouds:

project_root
├── pcd_data
│   ├── foo.npy
│   ├── ...
│   └── bar.npy
├── test.json
├── train.json
└── validation.json
  • Each .npy file under pcd_data corresponds to a point cloud.

  • The .json files store the point cloud names for training, validation and testing.

Tip

You can visualize a point cloud with:

$ aidsorb visualize path/to/structure_or_pcd  # Structure (.xyz, .cif, etc) or .npy

Train and test

All you need is a .yaml configuration file and some keystrokes:

$ aidsorb-lit fit --config=config.yaml
$ aidsorb-lit test --config=config.yaml --ckpt_path=path/to/ckpt

You can generate and start customizing a configuration file as following:

$ aidsorb-lit fit --print_config > config.yaml

Below is a dummy configuration file for multi-output regression using PointNet:

Warning

The following configuration file is for illustration purposes only. Adjust it as needed!

seed_everything: 1  # Workers are seeded as well

# (Optional) Here you setup the Trainer
trainer:
  max_epochs: 2
  accelerator: 'gpu'

# Here you setup the DataModule (PCDDataModule)
# For more information 👉 aidsorb.datamodules
data:
  # The paths must be relative to where aidsorb-lit is called
  # Consider using absolute paths
  path_to_X: 'path/to/pcd_data'
  path_to_Y: 'path/to/labels.csv'
  index_col: 'id'
  labels: ['y1', 'y3']
  train_transform_x:
    class_path: torchvision.transforms.v2.Compose
    init_args:
      transforms:
      - class_path: aidsorb.transforms.Center
      # Data augmentation
      - class_path: aidsorb.transforms.RandomJitter
        init_args:
          std: 0.3
      - class_path: aidsorb.transforms.RandomRotation
  eval_transform_x:
    class_path: aidsorb.transforms.Center
  train_size: Null  # Use all training data
  train_batch_size: 2
  eval_batch_size: 2
  shuffle: True
  config_dataloaders:
    collate_fn:
      class_path: aidsorb.data.Collator
      init_args:
        channels_first: True

# Here you setup the LightningModule (PCDLit)
# For more information 👉 aidsorb.litmodules
model:
  criterion:
    class_path: torch.nn.MSELoss
  metric:
    class_path: torchmetrics.MetricCollection
    init_args:
      metrics:
        r2: {class_path: torchmetrics.R2Score}
        mae: {class_path: torchmetrics.MeanAbsoluteError}
  model:
    # You can also pass a custom architecture
    class_path: aidsorb.modules.PointNet
    init_args:
      head:
        class_path: aidsorb.modules.PointNetClsHead
        init_args:
          n_outputs: 2
          dropout_rate: 0.1

# (Optional) Here you setup the optimizer
# If not specified, Adam will be used with default hyperparameters
optimizer:
  class_path: torch.optim.SGD
  init_args:
    lr: 0.001
    momentum: 0.0

# (Optional) Here you setup the learning rate scheduler
# If not specified, no scheduler will be applied
lr_scheduler:
  class_path: torch.optim.lr_scheduler.StepLR
  init_args:
    step_size: 10
    gamma: 0.1
id,y1,y2,y3
ZnMOF-74,1.0,2.1,3.2
IRMOF-1,1.1,2.4,3.4
Cu-BTC,9.6,2.7,3.3
COF-5,0.1,0.5,0.4
ala_phe_ala,5.2,1.0,8.2
ZIF-1,2.2,0.4,3.2

See also

The documentation for the LightningCLI, in case you are not familiar with PyTorch Lightning and YAML.

Summing up

$ aidsorb create path/to/structures path/to/pcd_data  # Create point clouds
$ aidsorb prepare path/to/pcd_data  # Split point clouds
$ aidsorb-lit fit --config=path/to/config.yaml  # Train
$ aidsorb-lit test --config=path/to/config.yaml --ckpt_path=path/to/ckpt  # Test

Questions

Using point clouds not created with AIdsorb?

Yes! The only requirement is to store them under a directory in .npy format (see numpy.save()) and respect the shapes described in Introduction. Then, you can proceed as described earlier (omitting the point clouds creation part).

Deep learning without the CLI?

Of course! Although you are encouraged to use the CLI, you can also use AIdsorb with plain PyTorch or PyTorch Lightning.

from torch.utils.data import DataLoader
from aidsorb.data import PCDDataset, Collator, get_names
from aidsorb.modules import PointNet

# Create the datasets.
train_set = PCDDataset(
    pcd_names=get_names('path/to/project_root/train.json'),
    path_to_X='path/to/pcd_data/',
    path_to_Y='path/to/labels.csv',
    ...
    )
val_set = PCDDataset(
    pcd_names=get_names('path/to/project_root/validation.json'),
    path_to_X='path/to/pcd_data/',
    path_to_Y='path/to/labels.csv',
    ...
    )

# Create the dataloaders.
train_loader = DataLoader(train_set, ..., collate_fn=Collator(channels_first=True))
val_loader = DataLoader(val_set, ..., collate_fn=Collator(channels_first=True))

# Create the model.
model = PointNet(...)

# Your code goes here.
...
import lightning as L
from aidsorb.data import Collator
from aidsorb.datamodules import PCDDataModule
from aidsorb.modules import PointNet
from aidsorb.litmodules import PCDLit

# Create the datamodule.
dm = PCDDataModule(
    path_to_X='path/to/pcd_data',
    path_to_Y='path/to/labels.csv',
    ...,
    config_dataloaders=dict(collate_fn=Collator(channels_first=True), ...),
    )

# Create the litmodel.
litmodel = PCDLit(model=PointNet(...), ...)

# Create the trainer.
trainer = L.Trainer(...)

# Your code goes here.
...

Predicting directly from the CLI?

Currently, this feature is not available (see TODO).

Further questions

We warmly encourage you to share any questions or ideas in the Discussions.

Note

Before asking how to do X?, please read the documentation carefully.