aidsorb.data

Helper functions and classes for creating datasets and handling point clouds of variable sizes.

class aidsorb.data.Collator(*, channels_first, mode='upsample', return_mask=False)[source]

Bases: object

Collate a sequence of samples into a batch.

Point clouds are padded before collation, so they can form a batch.

Shapes

Input: sequence of samples
Each sample is a tuple of (pcd, label).
- pcd tensor of shape (N_i, C).
- label tensor of shape (n_outputs,), () or None.
Output: tuple
If return_mask=False, then output is (x, y), else ((x, mask), y).
- x tensor of shape (B, C, T) if channels_first=True, else (B, T, C).
- y tensor of shape (B, n_outputs), (B,) or None.
- mask boolean tensor of shape (B, T) where True indicates padding.

B is the batch size and T is the size of the largest point cloud in the sequence.

Parameters:

channels_first (bool)
mode ({'zeropad', 'upsample'}, default='upsample')
return_mask (bool, default=False)

See also

pad_pcds(): For a description of the parameters.

Examples

>>> sample1 = (torch.tensor([[1, 4, 5, 2]]), torch.tensor([1., 2.]))
>>> sample2 = (torch.tensor([[0, 4, 0, 2], [2, 4, 1, 8]]), torch.tensor([7., 3.]))

>>> collate_fn = Collator(channels_first=True)
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 1],
         [4, 4],
         [5, 5],
         [2, 2]],

        [[0, 2],
         [4, 4],
         [0, 1],
         [2, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])

>>> collate_fn = Collator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 4, 5, 2],
         [0, 0, 0, 0]],

        [[0, 4, 0, 2],
         [2, 4, 1, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])

>>> # Label has shape (), i.e. is scalar.
>>> sample1 = (torch.tensor([[3, 4, 3, 2]]), torch.tensor(0))
>>> sample2 = (torch.tensor([[2, 4, 8, 2], [9, 4, 1, 8]]), torch.tensor(1))
>>> collate_fn = Collator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[3, 4, 3, 2],
         [0, 0, 0, 0]],

        [[2, 4, 8, 2],
         [9, 4, 1, 8]]])
>>> y
tensor([0, 1])

>>> # Label is None, i.e. unlabeled data.
>>> sample1 = (torch.tensor([[1., 0., 1., 0.]]), None)
>>> sample2 = (torch.tensor([[5., 2., 2., 0.], [9., 0., 0., 1.]]), None)
>>> collate_fn = Collator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1., 0.],
         [0., 0.],
         [1., 0.],
         [0., 0.]],

        [[5., 9.],
         [2., 0.],
         [2., 0.],
         [0., 1.]]])
>>> y

>>> # Collate and return padding mask.
>>> sample1 = (torch.tensor([[4, 2, 1, 4], [2, 0, 0, 1]]), torch.tensor(1))
>>> sample2 = (torch.tensor([[1, 2, 3, 1]]), torch.tensor(4))
>>> collate_fn = Collator(channels_first=False, mode='zeropad', return_mask=True)
>>> (x, mask), y = collate_fn((sample1, sample2))
>>> x
tensor([[[4, 2, 1, 4],
         [2, 0, 0, 1]],

        [[1, 2, 3, 1],
         [0, 0, 0, 0]]])
>>> y
tensor([1, 4])
>>> mask
tensor([[False, False],
        [False,  True]])

>>> # Batch a single unlabeled sample.
>>> sample = (torch.tensor([[2, 3, 4]]), None)
>>> collate_fn = Collator(channels_first=False)
>>> x, y = collate_fn([sample])
>>> x
tensor([[[2, 3, 4]]])
>>> y

>>> # Batch a single labeled sample.
>>> sample = (torch.tensor([[1, 1, 2]]), torch.tensor(10))
>>> collate_fn = Collator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn([sample])
>>> x
tensor([[[1],
         [1],
         [2]]])
>>> y
tensor([10])

class aidsorb.data.PCDDataset(pcd_names, path_to_X, *, path_to_Y=None, index_col=None, labels=None, transform_x=None, transform_y=None)[source]

Bases: Dataset

Dataset for point clouds.

Indexing the dataset returns (x, None) if data are unlabeled, i.e. path_to_Y=None, else (x, y), where x and y are the results of transform_x and transform_y, respectively.

Note

All data (i.e. point cloud and its label) are converted to Tensor before passed to transforms. As such, transform_x and transform_y expect Tensor as input.
y has shape (len(labels),) if transform_y=None.
Comma , is assumed as the field separator in .csv file.

Parameters:

pcd_names (sequence) – Point cloud names.
path_to_X (str) – Absolute or relative path to the directory holding the point clouds.
path_to_Y (str, optional) – Absolute or relative path to the .csv file holding the labels of the point clouds.
index_col (str, optional) – Column name of the .csv file to be used for indexing. This column must include pcd_names. No effect if path_to_Y=None.
labels (list, optional) – List of column names from the .csv file containing the properties to be predicted. No effect if path_to_Y=None.
transform_x (callable, optional) – Transformation to apply to point cloud.
transform_y (callable, optional) – Transformation to apply to label. No effect if path_to_Y=None.

See also

upsample_pcd(): For a description of 'upsample' mode.
torch.nn.utils.rnn.pad_sequence(): For a description of 'zeropad' mode.

Examples

>>> x1 = torch.tensor([[1, 2, 3, 4]])
>>> x2 = torch.tensor([[2, 5, 3, 8], [0, 2, 8, 9]])

>>> batch = pad_pcds((x1, x2), channels_first=False)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=True)
>>> batch
tensor([[[1, 1],
         [2, 2],
         [3, 3],
         [4, 4]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=False, mode='zeropad')
>>> batch
tensor([[[1, 2, 3, 4],
         [0, 0, 0, 0]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=True, mode='zeropad')
>>> batch
tensor([[[1, 0],
         [2, 0],
         [3, 0],
         [4, 0]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])

>>> # Pad and return padding mask (useful for attention-based architectures).
>>> batch, mask = pad_pcds((x1, x2), channels_first=False, return_mask=True)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> mask
tensor([[False,  True],
        [False, False]])

>>> # Pad a single point cloud.
>>> pad_pcds([x1], channels_first=False, mode='zeropad')
tensor([[[1, 2, 3, 4]]])
>>> pad_pcds([x1], channels_first=True, mode='upsample')
tensor([[[1],
         [2],
         [3],
         [4]]])

aidsorb.data.prepare_data(source, split_ratio=None, seed=1)[source]

Split point clouds into train, validation and test sets.

Each .json file that is created, stores the names of the point clouds that will be used for training, validation and testing.

Warning

All .json files are stored under the parent directory of source.
Splitting doesn’t support stratification. If your dataset is small and you want to perform classification, consider using train_test_split.

Parameters:

source (str) – Absolute or relative path to the directory holding the point clouds.
split_ratio (sequence, default=None) – Absolute sizes or fractions of splits of the form (train, val, test). If None, it is set to (0.8, 0.1, 0.1).
seed (int, default=1) – Controls randomness of the rng used for splitting.

Return type:

None

Examples

Before the split:

project_root
└── source
    ├── foo.npy
    ├── ...
    └── bar.npy

>>> prepare_data('path/to/source')

After the split:

project_root
├── source
│   ├── foo.npy
│   ├── ...
│   └── bar.npy
├── test.json
├── train.json
└── validation.json