Tutorial

Note

This tutorial covers the most common use cases of AIdsorb. For more advanced usage, you should consult the API Documentation.

Introduction

What is a point cloud?

A point cloud is a set of 3D data points, i.e. a set of 3D coordinates and (optionally) associated features. More formally:

\[\mathcal{P} = \{\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_N\} \quad \text{and} \quad \mathbf{p}_i \in \mathbb{R}^{3+C}\]

where \(N\) is the number of points in the point cloud and \(C\) is the number of (per-point) features.

In AIdsorb, a point cloud is represented as a numpy.ndarray of shape (N, 3+C):

\[\begin{split}\mathcal{P} = \begin{bmatrix} \mathbf{p}_1 \\ \mathbf{p}_2 \\ \vdots \\ \mathbf{p}_N \end{bmatrix} = \begin{bmatrix} x_1 & y_1 & z_1 & f_{1}^1 & \dots & f_1^C \\ x_2 & y_2 & z_2 & f_{2}^1 & \dots & f_2^C \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ x_N & y_N & z_N & f_{N}^1 & \dots & f_N^C \\ \end{bmatrix}\end{split}\]

What is a molecular point cloud?

It is a point cloud where coordinates correspond to atomic positions, and features correspond to atomic numbers and any additional information.

In AIdsorb, a molecular pcd is represented as numpy.ndarray of shape (N, 4+C), where N is the number of atoms, pcd[:, :3] are the atomic coordinates, pcd[:, 3] are the atomic numbers and pcd[:, 4:] any additional features. If C == 0, then the only features are the atomic numbers.

Tip

You can visualize a molecular point cloud with:

$ aidsorb visualize path/to/structure

Deep learning on molecular point clouds

For creating molecular point clouds and performing deep learning, the following components are needed:

A directory containing files of molecular structures.
A .csv file containing the labels of the molecular structures.
A .yaml configuration file for orchestrating the DL part.

Note

You are solely responsible for these 3 components.

Data preparation

Create and store the point clouds

Assuming your molecular structures are stored under a directory named structures and the directory path/to/pcds exists:

CLI

$ aidsorb create path/to/structures path/to/pcds/pcds.npz -f "['en_pauling']"

Python

from aidsorb.utils import pcd_from_dir

# We add electronegativity as additional feature.
pcd_from_dir(
    dirname='path/to/structures',
    outname='path/to/pcds/pcds.npz',
    features=['en_pauling'],
)

Split data into train, validation and test sets

CLI

$ aidsorb prepare path/to/pcds/pcds.npz --split_ratio "(0.7, 0.1, 0.2)" --seed 1

Python

from aidsorb.data import prepare_data

# Split the data into (train, val, test).
prepare_data(
    source='path/to/pcds/pcds.npz',
    split_ratio=(0.7, 0.1, 0.2),
    seed=1,
)

Now the path/to/pcds directory is populated with the following files:

$ tree path/to/pcds
pcds/
├── pcds.npz
├── test.json
├── train.json
└── validation.json

The pcds.npz file which stores the point clouds.
Three .json files which store the names of the structures for training, validation and testing.

Train and test

🎉 All you need is a .yaml and some… ⌨️ keystrokes!

Train

$ aidsorb-lit fit --config=config.yaml

Test

$ aidsorb-lit test --config=cofnig.yaml --ckpt_path=path/to/ckpt

config.yaml

seed_everything: 1  # Workers are seeded as well.

# Here you setup the Trainer.
trainer:
  max_epochs: 2
  accelerator: 'gpu'

# Here you setup the DataModule (PCDDataModule).
# For more information 👉 aidsorb.datamodules
data:
  # The paths must be relative to where aidsorb-lit is called.
  # Consider using absolute paths.
  path_to_X: 'path/to/pcds/pcds.npz'
  path_to_Y: 'path/to/labels.csv'
  index_col: 'id'
  labels: ['y1, y3']
  train_transform_x:
    # Here you can pass transformations for augmentation.
    class_path: aidsorb.transforms.Center
  eval_transform_x:
    class_path: aidsorb.transforms.Center
  train_size: Null  # Use all training data.
  train_batch_size: 2
  eval_batch_size: 2
  shuffle: True
  config_dataloaders:
    collate_fn:
      class_path: aidsorb.data.Collator

# Here you setup the LightningModule (PointLit).
# For more information 👉 aidsorb.litmodels
model:
  loss:
    class_path: torch.nn.MSELoss
  metric:
    class_path: torchmetrics.MetricCollection
    init_args:
      metrics:
        r2: {class_path: torchmetrics.R2Score}
        mae: {class_path: torchmetrics.MeanAbsoluteError}
  model:
    class_path: aidsorb.models.PointNet
    init_args:
      head:
        class_path: aidsorb.modules.PointNetClsHead
        init_args:
          dropout_rate: 0.7

# Here you setup the optimizer.
optimizer:
  class_path: torch.optim.SGD
  init_args:
    lr: 0.001
    momentum: 0.0

# Here you setup the learning rate scheduler.
lr_scheduler:
  class_path: torch.optim.lr_scheduler.StepLR
  init_args:
    step_size: null
    gamma: 0.1

labels.csv

id,y1,y2,y3
ZnMOF-74,10,20,30
IRMOF-1,1,2,3
Cu-BTC,9,-2,3
COF-5,100,200,300
ala_phe_ala,-50,-150,-38
ZIF-1,20,40,-32

Summing up

$ aidsorb create path/to/inp path/to/out  # Create point clouds
$ aidsorb prepare path/to/out  # Split point clouds
$ aidsorb-lit fit --config=path/to/config.yaml  # Train
$ aidsorb-lit test --config=path/to/config.yaml --ckpt_path=path/to/ckpt  # Test

Questions

Can I use point clouds not created with AIdsorb?

Yes! The only requirement is to store them in .npz format (see numpy.savez()) file and respect the shapes described in Introduction. Then, you can proceed as described earlier (omitting the point clouds creation part).

Can I do DL without the CLI?

Of course! Although you are encouraged to use the CLI, you can also use AIdsorb with plain PyTorch or PyTorch Lightning.

Can I predict directly from the CLI?

Currently, this feature is not available (see TODO).

What’s next?