Poison Dataset Wrapper ====================== The ``Poison Dataset Wrapper`` is implemented in the ``BackdoorMBTI/utils/data.py`` file and provides a class, ``BadSet``, for managing datasets with poisoned samples. This wrapper allows users to inject specific labels into a dataset at a controlled rate, making it especially useful for creating backdoor attacks in machine learning datasets. .. code-block:: python class BadSet(Dataset): def __init__( self, benign_set, poison_set_path, type, dataset, attack, num_classes=None, mislabel=None, target_label=0, poison_rate=0.1, seed=0, mode=Literal["train", "test"], pop=False, ) -> None: ... Class Parameters ---------------- - **benign_set**: The original, clean dataset. - **poison_set_path**: Path to the poisoned dataset. - **type**: Data type (e.g., "image", "text", "audio", "video"). - **dataset**: The name of the dataset being used. - **attack**: Specifies the type of attack for which the poison data is created. - **num_classes**: (Optional) Number of classes in the dataset. - **mislabel**: (Optional) If `True`, randomly mislabels some benign samples. - **target_label**: Default `0`. Specifies the target label for poisoned data. This is customizable to any label you want the poisoned samples to take. - **poison_rate**: Default `0.1`. Specifies the proportion of data to poison within the dataset. For example, `poison_rate=0.2` means 20% of samples will be poisoned. - **seed**: Default `0`. Random seed for reproducibility. - **mode**: Indicates the dataset mode, either `"train"` or `"test"`. - **pop**: If `True`, removes classes that do not match the target label from the poisoned set. Main Methods ------------ **_get_poison_dataset** Loads the poisoned dataset from the specified `poison_set_path`. Raises `FileNotFoundError` if the file is missing. .. code-block:: python def _get_poison_dataset(self): """Load the poisoned dataset from a given path.""" data_path = self.poison_set_path / f"{self.type}_{self.attack}_poison_{self.mode}_set.pt" if not data_path.exists(): raise FileNotFoundError(f"No such file: {data_path}") return torch.load(data_path) **_pop** Removes classes from the dataset other than the target label. This is useful for focusing the dataset on the specific label being targeted in the poisoning attack. .. code-block:: python def _pop(self): """Remove classes other than the target label from the dataset.""" **_mis_label** For non-poisoned samples, randomly assigns an incorrect label. This method is particularly useful during training to add label noise. .. code-block:: python def _mis_label(self, target, num_classes): """Randomly mislabel non-poisoned samples.""" return (target + random.randint(1, num_classes)) % num_classes **get_poisoned_index** Determines which data samples to poison based on the `poison_rate` and `seed`. It returns a dictionary with the indices of poisoned samples. .. code-block:: python def get_poisoned_index(self, length, seed, rate): """Calculate the indices for poisoned samples.""" n = round(length * rate) torch.manual_seed(seed) indices = torch.randperm(length)[:n] return {int(idx): 1 for idx in indices} **__getitem__** Retrieves a data sample by index. If the sample is in the poisoned index or if the mode is "test", it assigns the `target_label` as the sample label. Supports different data types including image, text, audio, and video. .. code-block:: python def __getitem__(self, index): """Retrieve a sample, applying target label if poisoned.""" if index in self.poison_index or self.mode == "test": # Apply target label if poisoned return ... # Logic for poisoned data else: return ... # Logic for benign data **__len__** Returns the total length of the poisoned dataset. .. code-block:: python def __len__(self): """Return the length of the poisoned dataset.""" return len(self.poison_set) Example Usage ------------- Initialize the `BadSet` dataset with custom `target_label` and `poison_rate` values. .. code-block:: python from BackdoorMBTI.utils.data import BadSet badset = BadSet( benign_set=original_dataset, poison_set_path="path/to/poison_set", type="image", dataset="example_dataset", attack="attack_type", target_label=1, # Custom target label poison_rate=0.2 # Custom poison rate (20% of data) ) In this example, the `target_label` is set to `1`, and the `poison_rate` is `0.2`, meaning 20% of the data will be labeled as `1` to simulate a poisoning attack.