aidsorb.data#

Helper functions and classes for creating datasets and collating data.

class aidsorb.data.Dataset(names, path_to_X, *, path_to_Y=None, index_col=None, labels=None, transform_x=None, transform_y=None)[source]#

Bases: Dataset

Dataset for supervised/unsupervised learning.

Indexing the dataset returns (x, None) if data are unlabeled, i.e. path_to_Y=None, else (x, y), where x and y are the results of transform_x and transform_y, respectively.

Note

  • All data (i.e. input and its label) are converted to Tensor before passed to transforms. As such, transform_x and transform_y expect Tensor as input.

  • y has shape (len(labels),) if transform_y=None.

  • Comma , is assumed as the field separator in .csv file.

Parameters:
  • names (sequence) – Names of the materials.

  • path_to_X (str) – Absolute or relative path to the directory holding the inputs.

  • path_to_Y (str, optional) – Absolute or relative path to the .csv file holding the labels of the inputs.

  • index_col (str, optional) – Column name of the .csv file to be used for indexing. This column must include names. No effect if path_to_Y=None.

  • labels (list, optional) – List of column names from the .csv file containing the properties to be predicted. No effect if path_to_Y=None.

  • transform_x (callable, optional) – Transformation to apply to input.

  • transform_y (callable, optional) – Transformation to apply to label. No effect if path_to_Y=None.

See also

aidsorb.transforms

For available input transformations.

Y#

Dataframe for the labels. The columns follow the order in labels.

property names: tuple#

Names of the materials.

class aidsorb.data.PCDCollator(*, channels_first, mode='upsample', return_mask=False)[source]#

Bases: object

Collator for point clouds.

Point clouds are padded before collation, so they can form a batch.

Shapes

  • Input: sequence of samples

    Each sample is a tuple of (pcd, label).

    • pcd tensor of shape (N_i, C).

    • label tensor of shape (n_outputs,), () or None.

  • Output: tuple

    If return_mask=False, then output is (x, y), else ((x, mask), y).

    • x tensor of shape (B, C, T) if channels_first=True, else (B, T, C).

    • y tensor of shape (B, n_outputs), (B,) or None.

    • mask boolean tensor of shape (B, T) where True indicates padding.

B is the batch size and T is the size of the largest point cloud in the sequence.

Parameters:
  • channels_first (bool)

  • mode ({'zeropad', 'upsample'}, default='upsample')

  • return_mask (bool, default=False)

See also

pad_pcds()

For a description of the parameters.

Examples

>>> sample1 = (torch.tensor([[1, 4, 5, 2]]), torch.tensor([1., 2.]))
>>> sample2 = (torch.tensor([[0, 4, 0, 2], [2, 4, 1, 8]]), torch.tensor([7., 3.]))
>>> collate_fn = PCDCollator(channels_first=True)
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 1],
         [4, 4],
         [5, 5],
         [2, 2]],

        [[0, 2],
         [4, 4],
         [0, 1],
         [2, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1, 4, 5, 2],
         [0, 0, 0, 0]],

        [[0, 4, 0, 2],
         [2, 4, 1, 8]]])
>>> y
tensor([[1., 2.],
        [7., 3.]])
>>> # Label has shape (), i.e. is scalar.
>>> sample1 = (torch.tensor([[3, 4, 3, 2]]), torch.tensor(0))
>>> sample2 = (torch.tensor([[2, 4, 8, 2], [9, 4, 1, 8]]), torch.tensor(1))
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[3, 4, 3, 2],
         [0, 0, 0, 0]],

        [[2, 4, 8, 2],
         [9, 4, 1, 8]]])
>>> y
tensor([0, 1])
>>> # Label is None, i.e. unlabeled data.
>>> sample1 = (torch.tensor([[1., 0., 1., 0.]]), None)
>>> sample2 = (torch.tensor([[5., 2., 2., 0.], [9., 0., 0., 1.]]), None)
>>> collate_fn = PCDCollator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn((sample1, sample2))
>>> x
tensor([[[1., 0.],
         [0., 0.],
         [1., 0.],
         [0., 0.]],

        [[5., 9.],
         [2., 0.],
         [2., 0.],
         [0., 1.]]])
>>> y
>>> # Collate and return padding mask.
>>> sample1 = (torch.tensor([[4, 2, 1, 4], [2, 0, 0, 1]]), torch.tensor(1))
>>> sample2 = (torch.tensor([[1, 2, 3, 1]]), torch.tensor(4))
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad', return_mask=True)
>>> (x, mask), y = collate_fn((sample1, sample2))
>>> x
tensor([[[4, 2, 1, 4],
         [2, 0, 0, 1]],

        [[1, 2, 3, 1],
         [0, 0, 0, 0]]])
>>> y
tensor([1, 4])
>>> mask
tensor([[False, False],
        [False,  True]])
>>> # Batch a single unlabeled sample.
>>> sample = (torch.tensor([[2, 3, 4]]), None)
>>> collate_fn = PCDCollator(channels_first=False)
>>> x, y = collate_fn([sample])
>>> x
tensor([[[2, 3, 4]]])
>>> y
>>> # Batch a single labeled sample.
>>> sample = (torch.tensor([[1, 1, 2]]), torch.tensor(10))
>>> collate_fn = PCDCollator(channels_first=True, mode='zeropad')
>>> x, y = collate_fn([sample])
>>> x
tensor([[[1],
         [1],
         [2]]])
>>> y
tensor([10])
aidsorb.data.get_names(filename)[source]#

Return names stored in a .json file.

Parameters:

filename (str) – Absolute or relative path to the file.

Returns:

names

Return type:

tuple

aidsorb.data.pad_pcds(pcds, *, channels_first, mode='upsample', return_mask=False)[source]#

Pad a sequence of variable size point clouds.

Each point cloud must have shape (N_i, C).

Shapes

  • batch tensor of shape (B, T, C) if channels_first=False, else (B, C, T).

  • mask boolean tensor of shape (B, T) where True indicates padding.

B is the batch size and T is the size of the largest point cloud in the sequence.

Parameters:
  • pcds (sequence of tensors)

  • channels_first (bool)

  • mode ({'zeropad', 'upsample'}, default='upsample')

  • return_mask (bool, default=False)

Returns:

batch if return_mask=False, else (batch, mask).

Return type:

tensor or tuple of tensors

See also

upsample_pcd()

For a description of 'upsample' mode.

torch.nn.utils.rnn.pad_sequence()

For a description of 'zeropad' mode.

Examples

>>> x1 = torch.tensor([[1, 2, 3, 4]])
>>> x2 = torch.tensor([[2, 5, 3, 8], [0, 2, 8, 9]])
>>> batch = pad_pcds((x1, x2), channels_first=False)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True)
>>> batch
tensor([[[1, 1],
         [2, 2],
         [3, 3],
         [4, 4]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=False, mode='zeropad')
>>> batch
tensor([[[1, 2, 3, 4],
         [0, 0, 0, 0]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True, mode='zeropad')
>>> batch
tensor([[[1, 0],
         [2, 0],
         [3, 0],
         [4, 0]],

        [[2, 0],
         [5, 2],
         [3, 8],
         [8, 9]]])
>>> # Pad and return padding mask (useful for attention-based architectures).
>>> batch, mask = pad_pcds((x1, x2), channels_first=False, return_mask=True)
>>> batch
tensor([[[1, 2, 3, 4],
         [1, 2, 3, 4]],

        [[2, 5, 3, 8],
         [0, 2, 8, 9]]])
>>> mask
tensor([[False,  True],
        [False, False]])
>>> # Pad a single point cloud.
>>> pad_pcds([x1], channels_first=False, mode='zeropad')
tensor([[[1, 2, 3, 4]]])
>>> pad_pcds([x1], channels_first=True, mode='upsample')
tensor([[[1],
         [2],
         [3],
         [4]]])
aidsorb.data.prepare_data(source, split_ratio=None, seed=1)[source]#

Split materials into train, validation and test sets.

Each .json file that is created, stores the names of the materials that will be used for training, validation and testing.

Warning

  • All .json files are stored under the parent directory of source.

  • Splitting doesn’t support stratification. If your dataset is small and you want to perform classification, consider using train_test_split.

Parameters:
  • source (str) – Absolute or relative path to the directory holding the inputs.

  • split_ratio (sequence, default=None) – Absolute sizes or fractions of splits of the form (train, val, test). If None, it is set to (0.8, 0.1, 0.1).

  • seed (int, default=1) – Controls randomness of the rng used for splitting.

Return type:

None

Examples

Before the split:

project_root
└── source
    β”œβ”€β”€ foo.npy
    β”œβ”€β”€ ...
    └── bar.npy
>>> prepare_data('path/to/source')

After the split:

project_root
β”œβ”€β”€ source
β”‚Β Β  β”œβ”€β”€ foo.npy
β”‚Β Β  β”œβ”€β”€ ...
β”‚Β Β  └── bar.npy
β”œβ”€β”€ test.json
β”œβ”€β”€ train.json
└── validation.json