aidsorb.data#

Helper functions and classes for creating datasets and collating data.

class aidsorb.data.Dataset(names, path_to_X, *, path_to_Y=None, index_col=None, labels=None, transform_x=None, transform_y=None)[source]#

Bases: Dataset

Dataset for supervised/unsupervised learning.

Indexing the dataset returns (x, None) if data are unlabeled, i.e. path_to_Y=None, else (x, y), where x and y are the results of transform_x and transform_y, respectively.

Note

All data (i.e. input and its label) are converted to Tensor before passed to transforms. As such, transform_x and transform_y expect Tensor as input.
y has shape (len(labels),) if transform_y=None.
Comma , is assumed as the field separator in .csv file.

Parameters:

names (sequence) – Names of the materials.
path_to_X (str) – Absolute or relative path to the directory holding the inputs.
path_to_Y (str, optional) – Absolute or relative path to the .csv file holding the labels of the inputs.
index_col (str, optional) – Column name of the .csv file to be used for indexing. This column must include names. No effect if path_to_Y=None.
labels (list, optional) – List of column names from the .csv file containing the properties to be predicted. No effect if path_to_Y=None.
transform_x (callable, optional) – Transformation to apply to input.
transform_y (callable, optional) – Transformation to apply to label. No effect if path_to_Y=None.

See also

pad_pcds(): For a description of the parameters.

Examples

>>> sample1 = (torch.tensor([[1, 4, 5, 2]]), torch.tensor([1., 2.]))
>>> sample2 = (torch.tensor([[0, 4, 0, 2], [2, 4, 1, 8]]), torch.tensor([7., 3.]))

>>> collate_fn = PCDCollator(channels_first=True)
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 1],
         [4, 4],
         [5, 5],
         [2, 2]],

        [[0, 2],
         [4, 4],
         [0, 1],
         [2, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])

>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 4, 5, 2],
         [0, 0, 0, 0]],

        [[0, 4, 0, 2],
         [2, 4, 1, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])

>>> # Label has shape (), i.e. is scalar.
>>> sample1 = (torch.tensor([[3, 4, 3, 2]]), torch.tensor(0))
>>> sample2 = (torch.tensor([[2, 4, 8, 2], [9, 4, 1, 8]]), torch.tensor(1))
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[3, 4, 3, 2],
         [0, 0, 0, 0]],

        [[2, 4, 8, 2],
         [9, 4, 1, 8]]])
>>> y
tensor([0, 1])

>>> # Label is None, i.e. unlabeled data.
>>> sample1 = (torch.tensor([[1., 0., 1., 0.]]), None)
>>> sample2 = (torch.tensor([[5., 2., 2., 0.], [9., 0., 0., 1.]]), None)
>>> collate_fn = PCDCollator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1., 0.],
         [0., 0.],
         [1., 0.],
         [0., 0.]],

        [[5., 9.],
         [2., 0.],
         [2., 0.],
         [0., 1.]]])
>>> y

>>> # Collate and return padding mask.
>>> sample1 = (torch.tensor([[4, 2, 1, 4], [2, 0, 0, 1]]), torch.tensor(1))
>>> sample2 = (torch.tensor([[1, 2, 3, 1]]), torch.tensor(4))
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad', return_mask=True)
>>> (x, mask), y = collate_fn((sample1, sample2))
>>> x
tensor([[[4, 2, 1, 4],
         [2, 0, 0, 1]],

        [[1, 2, 3, 1],
         [0, 0, 0, 0]]])
>>> y
tensor([1, 4])
>>> mask
tensor([[False, False],
        [False,  True]])

>>> # Batch a single unlabeled sample.
>>> sample = (torch.tensor([[2, 3, 4]]), None)
>>> collate_fn = PCDCollator(channels_first=False)
>>> x, y = collate_fn([sample])
>>> x
tensor([[[2, 3, 4]]])
>>> y

>>> # Batch a single labeled sample.
>>> sample = (torch.tensor([[1, 1, 2]]), torch.tensor(10))
>>> collate_fn = PCDCollator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn([sample])
>>> x
tensor([[[1],
         [1],
         [2]]])
>>> y
tensor([10])

aidsorb.data.get_names(filename)[source]#

Return names stored in a .json file.

Parameters:: filename (str) – Absolute or relative path to the file.
Returns:: names
Return type:: tuple

aidsorb.data.pad_pcds(pcds, *, channels_first, mode='upsample', return_mask=False)[source]#

Pad a sequence of variable size point clouds.

Each point cloud must have shape (N_i, C).

Shapes

batch tensor of shape (B, T, C) if channels_first=False, else (B, C, T).
mask boolean tensor of shape (B, T) where True indicates padding.

B is the batch size and T is the size of the largest point cloud in the sequence.

Parameters:

pcds (sequence of tensors)
channels_first (bool)
mode ({'zeropad', 'upsample'}, default='upsample')
return_mask (bool, default=False)

Returns:

batch if return_mask=False, else (batch, mask).

Return type:

tensor or tuple of tensors

See also

upsample_pcd(): For a description of 'upsample' mode.
torch.nn.utils.rnn.pad_sequence(): For a description of 'zeropad' mode.

Examples

>>> x1 = torch.tensor([[1, 2, 3, 4]])
>>> x2 = torch.tensor([[2, 5, 3, 8], [0, 2, 8, 9]])

>>> batch = pad_pcds((x1, x2), channels_first=False)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=True)
>>> batch
tensor([[[1, 1],
         [2, 2],
         [3, 3],
         [4, 4]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=False, mode='zeropad')
>>> batch
tensor([[[1, 2, 3, 4],
         [0, 0, 0, 0]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])

>>> batch = pad_pcds((x1, x2), channels_first=True, mode='zeropad')
>>> batch
tensor([[[1, 0],
         [2, 0],
         [3, 0],
         [4, 0]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])

>>> # Pad and return padding mask (useful for attention-based architectures).
>>> batch, mask = pad_pcds((x1, x2), channels_first=False, return_mask=True)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> mask
tensor([[False,  True],
        [False, False]])

>>> # Pad a single point cloud.
>>> pad_pcds([x1], channels_first=False, mode='zeropad')
tensor([[[1, 2, 3, 4]]])
>>> pad_pcds([x1], channels_first=True, mode='upsample')
tensor([[[1],
         [2],
         [3],
         [4]]])

aidsorb.data.prepare_data(source, split_ratio=None, seed=1)[source]#

Split materials into train, validation and test sets.

Each .json file that is created, stores the names of the materials that will be used for training, validation and testing.

Warning

All .json files are stored under the parent directory of source.
Splitting doesn’t support stratification. If your dataset is small and you want to perform classification, consider using train_test_split.

Parameters:

source (str) – Absolute or relative path to the directory holding the inputs.
split_ratio (sequence, default=None) – Absolute sizes or fractions of splits of the form (train, val, test). If None, it is set to (0.8, 0.1, 0.1).
seed (int, default=1) – Controls randomness of the rng used for splitting.

Return type:

None

Examples

Before the split:

project_root
└── source
    ├── foo.npy
    ├── ...
    └── bar.npy

>>> prepare_data('path/to/source')

After the split:

project_root
├── source
│   ├── foo.npy
│   ├── ...
│   └── bar.npy
├── test.json
├── train.json
└── validation.json