aidsorb.data#
Helper functions and classes for creating datasets and collating data.
- class aidsorb.data.Dataset(names, path_to_X, *, path_to_Y=None, index_col=None, labels=None, transform_x=None, transform_y=None)[source]#
Bases:
DatasetDataset for supervised/unsupervised learning.
Indexing the dataset returns
(x, None)if data are unlabeled, i.e.path_to_Y=None, else(x, y), wherexandyare the results oftransform_xandtransform_y, respectively.Note
- Parameters:
names (sequence) β Names of the materials.
path_to_X (str) β Absolute or relative path to the directory holding the inputs.
path_to_Y (str, optional) β Absolute or relative path to the
.csvfile holding the labels of the inputs.index_col (str, optional) β Column name of the
.csvfile to be used for indexing. This column must includenames. No effect ifpath_to_Y=None.labels (list, optional) β List of column names from the
.csvfile containing the properties to be predicted. No effect ifpath_to_Y=None.transform_x (callable, optional) β Transformation to apply to input.
transform_y (callable, optional) β Transformation to apply to label. No effect if
path_to_Y=None.
See also
aidsorb.transformsFor available input transformations.
- Y#
Dataframe for the labels. The columns follow the order in
labels.
- class aidsorb.data.PCDCollator(*, channels_first, mode='upsample', return_mask=False)[source]#
Bases:
objectCollator for point clouds.
Point clouds are padded before collation, so they can form a batch.
Shapes
Input: sequence of samples
Each sample is a tuple of
(pcd, label).pcdtensor of shape(N_i, C).labeltensor of shape(n_outputs,),()orNone.
Output: tuple
Bis the batch size andTis the size of the largest point cloud in the sequence.- Parameters:
See also
pad_pcds()For a description of the parameters.
Examples
>>> sample1 = (torch.tensor([[1, 4, 5, 2]]), torch.tensor([1., 2.])) >>> sample2 = (torch.tensor([[0, 4, 0, 2], [2, 4, 1, 8]]), torch.tensor([7., 3.]))
>>> collate_fn = PCDCollator(channels_first=True) >>> x, y = collate_fn((sample1, sample2)) >>> x tensor([[[1, 1], [4, 4], [5, 5], [2, 2]], [[0, 2], [4, 4], [0, 1], [2, 8]]]) >>> y tensor([[1., 2.], [7., 3.]])
>>> collate_fn = PCDCollator(channels_first=False, mode='zeropad') >>> x, y = collate_fn((sample1, sample2)) >>> x tensor([[[1, 4, 5, 2], [0, 0, 0, 0]], [[0, 4, 0, 2], [2, 4, 1, 8]]]) >>> y tensor([[1., 2.], [7., 3.]])
>>> # Label has shape (), i.e. is scalar. >>> sample1 = (torch.tensor([[3, 4, 3, 2]]), torch.tensor(0)) >>> sample2 = (torch.tensor([[2, 4, 8, 2], [9, 4, 1, 8]]), torch.tensor(1)) >>> collate_fn = PCDCollator(channels_first=False, mode='zeropad') >>> x, y = collate_fn((sample1, sample2)) >>> x tensor([[[3, 4, 3, 2], [0, 0, 0, 0]], [[2, 4, 8, 2], [9, 4, 1, 8]]]) >>> y tensor([0, 1])
>>> # Label is None, i.e. unlabeled data. >>> sample1 = (torch.tensor([[1., 0., 1., 0.]]), None) >>> sample2 = (torch.tensor([[5., 2., 2., 0.], [9., 0., 0., 1.]]), None) >>> collate_fn = PCDCollator(channels_first=True, mode='zeropad') >>> x, y = collate_fn((sample1, sample2)) >>> x tensor([[[1., 0.], [0., 0.], [1., 0.], [0., 0.]], [[5., 9.], [2., 0.], [2., 0.], [0., 1.]]]) >>> y
>>> # Collate and return padding mask. >>> sample1 = (torch.tensor([[4, 2, 1, 4], [2, 0, 0, 1]]), torch.tensor(1)) >>> sample2 = (torch.tensor([[1, 2, 3, 1]]), torch.tensor(4)) >>> collate_fn = PCDCollator(channels_first=False, mode='zeropad', return_mask=True) >>> (x, mask), y = collate_fn((sample1, sample2)) >>> x tensor([[[4, 2, 1, 4], [2, 0, 0, 1]], [[1, 2, 3, 1], [0, 0, 0, 0]]]) >>> y tensor([1, 4]) >>> mask tensor([[False, False], [False, True]])
>>> # Batch a single unlabeled sample. >>> sample = (torch.tensor([[2, 3, 4]]), None) >>> collate_fn = PCDCollator(channels_first=False) >>> x, y = collate_fn([sample]) >>> x tensor([[[2, 3, 4]]]) >>> y
>>> # Batch a single labeled sample. >>> sample = (torch.tensor([[1, 1, 2]]), torch.tensor(10)) >>> collate_fn = PCDCollator(channels_first=True, mode='zeropad') >>> x, y = collate_fn([sample]) >>> x tensor([[[1], [1], [2]]]) >>> y tensor([10])
- aidsorb.data.pad_pcds(pcds, *, channels_first, mode='upsample', return_mask=False)[source]#
Pad a sequence of variable size point clouds.
Each point cloud must have shape
(N_i, C).Shapes
batchtensor of shape(B, T, C)ifchannels_first=False, else(B, C, T).maskboolean tensor of shape(B, T)whereTrueindicates padding.
Bis the batch size andTis the size of the largest point cloud in the sequence.- Parameters:
- Returns:
batchifreturn_mask=False, else(batch, mask).- Return type:
tensor or tuple of tensors
See also
upsample_pcd()For a description of
'upsample'mode.torch.nn.utils.rnn.pad_sequence()For a description of
'zeropad'mode.
Examples
>>> x1 = torch.tensor([[1, 2, 3, 4]]) >>> x2 = torch.tensor([[2, 5, 3, 8], [0, 2, 8, 9]])
>>> batch = pad_pcds((x1, x2), channels_first=False) >>> batch tensor([[[1, 2, 3, 4], [1, 2, 3, 4]], [[2, 5, 3, 8], [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True) >>> batch tensor([[[1, 1], [2, 2], [3, 3], [4, 4]], [[2, 0], [5, 2], [3, 8], [8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=False, mode='zeropad') >>> batch tensor([[[1, 2, 3, 4], [0, 0, 0, 0]], [[2, 5, 3, 8], [0, 2, 8, 9]]])
>>> batch = pad_pcds((x1, x2), channels_first=True, mode='zeropad') >>> batch tensor([[[1, 0], [2, 0], [3, 0], [4, 0]], [[2, 0], [5, 2], [3, 8], [8, 9]]])
>>> # Pad and return padding mask (useful for attention-based architectures). >>> batch, mask = pad_pcds((x1, x2), channels_first=False, return_mask=True) >>> batch tensor([[[1, 2, 3, 4], [1, 2, 3, 4]], [[2, 5, 3, 8], [0, 2, 8, 9]]]) >>> mask tensor([[False, True], [False, False]])
>>> # Pad a single point cloud. >>> pad_pcds([x1], channels_first=False, mode='zeropad') tensor([[[1, 2, 3, 4]]]) >>> pad_pcds([x1], channels_first=True, mode='upsample') tensor([[[1], [2], [3], [4]]])
- aidsorb.data.prepare_data(source, split_ratio=None, seed=1)[source]#
Split materials into train, validation and test sets.
Each
.jsonfile that is created, stores the names of the materials that will be used for training, validation and testing.Warning
All
.jsonfiles are stored under the parent directory ofsource.Splitting doesnβt support stratification. If your dataset is small and you want to perform classification, consider using train_test_split.
- Parameters:
source (str) β Absolute or relative path to the directory holding the inputs.
split_ratio (sequence, default=None) β Absolute sizes or fractions of splits of the form
(train, val, test). IfNone, it is set to(0.8, 0.1, 0.1).seed (int, default=1) β Controls randomness of the
rngused for splitting.
- Return type:
None
Examples
Before the split:
project_root βββ source βββ foo.npy βββ ... βββ bar.npy>>> prepare_data('path/to/source')
After the split:
project_root βββ source βΒ Β βββ foo.npy βΒ Β βββ ... βΒ Β βββ bar.npy βββ test.json βββ train.json βββ validation.json