fibber.datasets.dataset_utils module¶

This module provides utility functions and classes to handle fibber’s datasets.

To load a dataset, use get_dataset function. For example, to load AG’s news dataset, run:
```
trainset, testset =  get_dataset("ag")
```

The trainset and testset are both dicts. The dict looks like:

{
  "label_mapping": [
    "World",
    "Sports",
    "Business",
    "Sci/Tech"
  ],
  "data": [
    {
      "label": 1,
      "text0": "Boston won the NBA championship in 2008."
    },
    {
      "label": 3,
      "text0": "Apple releases its latest cell phone."
    },
    ...
  ]
}

To sub-sample 100 examples from training set, run:

subsampled_dataset = subsample_dataset(trainset, 100)

To convert a dataset dict to a torch.IterableDataset for BERT model, run:

iterable_dataset = DatasetForTransformers(trainset, "bert-base-cased", batch_size=32);

For more details, see https://dai-lab.github.io/fibber/

class fibber.datasets.dataset_utils.DatasetForTransformers(*args, **kwds)[source]¶

Bases: torch.utils.data.dataset.IterableDataset

Create a torch.IterableDataset for a BERT model.

The module is an iterator that yields infinite batches from the dataset. To construct a batch, we randomly sample a few examples with similar length. Then we pad all selected examples to the same length L. Then we construct a tuple of 4 or 5 tensors. All tensors are on CPU.

Each example starts with [CLS], and ends with [SEP]. If there are two parts in the input, the two parts are separated by [SEP].

__iter__(self):

Yields

A tuple of tensors (or list).

The first tensor is an int tensor of size (batch_size, L), representing word ids. Each row of this tensor correspond to one example in the dataset. If masked_lm == True, the tensor stores the masked text.
The second tensor is an int tensor of size (batch_size, L), representing the text length. Each entry is 1 if the corresponding position is text, and it is 0 if the position is padding.
The third tensor is an int tensor of size (batch_size, L), representing the token type. The token type is 0 if current position is in the first part of the input text. And it is 1 if current position is in the second part of the input. For padding positions, token type is 0.
The forth tensor an int tensor of size (batch_size,), representing the classification label.
(optional) If masked_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if the position is not masked, or the correct word if the position is masked. Note that, a masked position is not always a [MASK] token in the first tensor. With 80% probability, it is a [MASK]. With 10% probability, it is the original word. And with 10% probability, it is a random word.
(optional) If autoregressive_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if it’s [CLS], [PAD].
(optional) if include_raw_text == True, the last item is a list of str.

Initialize.

Parameters

dataset (dict) – a dataset dict.
model_init (str) – the pre-trained model name. select from ['bert-base-cased', 'bert-base-uncased', 'bert-large-cased', and 'bert-large-uncased'].
batch_size (int) – the batch size in each step.
exclude (int) – exclude one category from the data. Use -1 (default) to include all categories.
masked_lm (bool) – whether to randomly replace words with mask tokens.
masked_lm_ratio (float) – the ratio of random masks. Ignored when masked_lm is False.
select_field (None or str) – select one field. None to use all available fields.

reinforce_type(expected_type)¶: Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

fibber.datasets.dataset_utils.clip_sentence(dataset, model_init, max_len)[source]¶: Inplace clipping sentences.

fibber.datasets.dataset_utils.get_dataset(dataset_name)[source]¶

Load dataset from fibber root directory.

Users should make sure the data is downloaded to the datasets folder in fibber root dir (default: ~/.fibber/datasets). Otherwise, assertion error is raised.

Parameters: dataset_name (str) – the name of the dataset. See https://dai-lab.github.io/fibber/ for a full list of built-in datasets.
Returns: the function returns a tuple of two dict, representing the training set and test set respectively.
Return type: (dict, dict)

fibber.datasets.dataset_utils.get_demo_dataset()[source]¶

download demo dataset.

Returns: trainset and testset.
Return type: (dict, dict)

fibber.datasets.dataset_utils.subsample_dataset(dataset, n, offset=0)[source]¶

Sub-sample a dataset to n examples.

Data is selected evenly and randomly from each category. Data in each category is sorted by its md5 hash value. The top (n // k) examples from each category are included in the sub-sampled dataset, where k is the number of categories.

If n is not divisible by k, one more data is sampled from the first (n % k) categories.

If the dataset has less than n examples, a copy of the original dataset will be returned.

Parameters

dataset (dict) – a dataset dict.
n (int) – the size of the sub-sampled dataset.
offset (int) – dataset offset.

Returns

a sub-sampled dataset as a dict.

Return type

(dict)

fibber.datasets.dataset_utils.text_md5(x)[source]¶: Computes and returns the md5 hash of a str.

fibber.datasets.dataset_utils.verify_dataset(dataset)[source]¶

Verify if the dataset dict contains necessary fields.

Assertion error is raised if there are missing or incorrect fields.

Parameters: dataset (dict) – a dataset dict.

fibber.datasets package

fibber.datasets.download_datasets module