aidsorb.data

This module provides helper functions and classes for creating datasets and handling point clouds of variable sizes.

class aidsorb.data.Collator(channels_first=True, mode='upsample')[source]

Bases: object

Collate a sequence of samples into a batch.

Point clouds are padded before collation, so they can form a batch.

Shapes

  • Input: sequence of samples

    Each sample is a tuple of tensors (pcd, label), where pcd has shape (N_i, C) and label has shape (n_outputs,) or ().

  • Output: tuple of length 2

    • batch[0] == x with shape (B, C, T) if channels_first=True, otherwise (B, T, C). B is the batch size and T is the size of the largest point cloud in the sequence.

    • batch[1] == y with shape (B, n_outputs) or (B,).

Tip

Use an instance of this class as collate_fn with channels_first=True, if your model is PointNet.

Todo

Add functionality for collating only point clouds (useful when the dataset is unlabeled).

Parameters:
  • channels_first (bool, default=True)

  • mode ({'zeropad', 'upsample'}, default='upsample')

See also

pad_pcds()

For a description of the parameters.

upsample_pcd()

For a description of the parameters.

Examples

>>> sample1 = (torch.tensor([[1, 4, 5, 2]]), torch.tensor([1., 2.]))
>>> sample2 = (torch.tensor([[0, 4, 0, 2], [2, 4, 1, 8]]), torch.tensor([7., 3.]))
>>> collate_fn = Collator()
>>> x, y = collate_fn((sample1, sample2))
>>> x.shape
torch.Size([2, 4, 2])
>>> y.shape
torch.Size([2, 2])
>>> x
tensor([[[1, 1],
         [4, 4],
         [5, 5],
         [2, 2]],

        [[0, 2],
         [4, 4],
         [0, 1],
         [2, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])
>>> collate_fn = Collator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 4, 5, 2],
         [0, 0, 0, 0]],

        [[0, 4, 0, 2],
         [2, 4, 1, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])
>>> # Label has shape (), i.e. is scalar.
>>> sample1 = (torch.tensor([[3, 4, 3, 2]]), torch.tensor(0))
>>> sample2 = (torch.tensor([[2, 4, 8, 2], [9, 4, 1, 8]]), torch.tensor(1))
>>> collate_fn = Collator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[3, 4, 3, 2],
         [0, 0, 0, 0]],

        [[2, 4, 8, 2],
         [9, 4, 1, 8]]])
>>> y
tensor([0, 1])
class aidsorb.data.PCDDataset(pcd_names, path_to_X, path_to_Y=None, index_col=None, labels=None, transform_x=None, transform_y=None)[source]

Bases: Dataset

Dataset for point clouds.

Tip

For implementing your own transforms, have a look at the transforms tutorial. For more flexibility, consider implementing them as callable instances of classes.

Parameters:
  • pcd_names (list) – List containing the names of the point clouds.

  • path_to_X (str) – Absolute or relative path to the .npz file holding the point clouds.

  • path_to_Y (str, optional) –

    Absolute or relative path to the .csv file holding the labels of the point clouds.

    Warning

    The comma , is assumed as the field separator.

  • index_col (str, optional) – Column name of the .csv file to be used as row labels. The names (values) under this column must follow the same naming scheme as in pcd_names.

  • labels (list, optional) – List containing the names of the properties to be predicted. No effect if path_to_Y=None.

  • transform_x (callable, optional) – Transforms applied to input, i.e to each point cloud.

  • transform_y (callable, optional) – Transforms applied to output. No effect if path_to_Y=None.

See also

aidsorb.transforms

For available point cloud transformations.

property pcd_names

The names of the point clouds.

aidsorb.data.get_names(filename)[source]

Return names stored in a .json file.

Parameters:

filename (str) – The name of the file from which names will be retrieved.

Returns:

names

Return type:

list

aidsorb.data.pad_pcds(pcds, channels_first=True, mode='upsample')[source]

Pad a sequence of variable size point clouds.

Each point cloud must have shape (N_i, C).

Parameters:
  • pcds (sequence of tensors)

  • mode ({'zeropad', 'upsample'}, default='upsample')

  • channels_first (bool, default=True)

Returns:

batch – If channels_first=False, then batch has shape (B, T, C), where B == len(pcds) is the batch size and T is the size of the largest point cloud in pcds. Otherwise, (B, C, T).

Return type:

tensor of shape (B, T, C) or (B, C, T)

See also

upsample_pcd()

For a description of 'upsample' mode.

torch.nn.utils.rnn.pad_sequence()

For a description of 'zeropad' mode.

Examples

>>> x1 = torch.tensor([[1, 2, 3, 4]])
>>> x2 = torch.tensor([[2, 5, 3, 8], [0, 2, 8, 9]])
>>> batch = pad_pcds((x1, x2), channels_first=False)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True)
>>> batch
tensor([[[1, 1],
         [2, 2],
         [3, 3],
         [4, 4]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=False, mode='zeropad')
>>> batch
tensor([[[1, 2, 3, 4],
         [0, 0, 0, 0]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True, mode='zeropad')
>>> batch
tensor([[[1, 0],
         [2, 0],
         [3, 0],
         [4, 0]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])
aidsorb.data.prepare_data(source, split_ratio=(0.8, 0.1, 0.1), seed=1)[source]

Split a source of point clouds in train, validation and test sets.

Each .json file that is created, stores the names of the point clouds that will be used for training, validation and testing.

Warning

  • No directory is created by prepare_data(). All .json files are stored under the directory containing source.

  • Splitting doesn’t support stratification. If your dataset is small and you want to perform classification, consider using train_test_split.

Parameters:
  • source (str) – Absolute or relative path to the file holding the point clouds.

  • split_ratio (sequence, default=(0.8, 0.1, 0.1)) –

    The sizes or fractions of splits to be produced.

    • split_ratio[0] == train.

    • split_ratio[1] == validation.

    • split_ratio[2] == test.

  • seed (int, default=1) – Controls the randomness of the rng used for splitting.

Examples

Before the split:

pcd_data
└──source.npz
>>> prepare_data('path/to/pcd_data/source.npz')  

After the split:

pcd_data
├──source.npz
├──train.json
├──validation.json
└──test.json
aidsorb.data.upsample_pcd(pcd, size)[source]

Upsample pcd to a new size by sampling with replacement from pcd.

Parameters:
  • pcd (tensor of shape (N, C)) – The original point cloud of size N.

  • size (int) – The size of the new point cloud.

Returns:

new_pcd

Return type:

tensor of shape (size, C).

Examples

>>> pcd = torch.tensor([[2, 4, 5, 6]])
>>> upsample_pcd(pcd, 3)
tensor([[2, 4, 5, 6],
        [2, 4, 5, 6],
        [2, 4, 5, 6]])
>>> # New points point must be from pcd.
>>> pcd = torch.randn(10, 4)
>>> new_pcd = upsample_pcd(pcd, 20)
>>> (new_pcd[-1] == pcd).all(1).any()  # Check for last point.
tensor(True)
>>> # No upsampling.
>>> pcd = torch.randn(100, 4)
>>> new_pcd = upsample_pcd(pcd, len(pcd))
>>> torch.equal(pcd, new_pcd)
True