fibber.datasets.dataset_utils module

This module provides utility functions and classes to handle fibber’s datasets.

  • To load a dataset, use get_dataset function. For example, to load AG’s news dataset, run:

    trainset, testset =  get_dataset("ag")
    
  • The trainset and testset are both dicts. The dict looks like:

    {
      "label_mapping": [
        "World",
        "Sports",
        "Business",
        "Sci/Tech"
      ],
      "data": [
        {
          "label": 1,
          "text0": "Boston won the NBA championship in 2008."
        },
        {
          "label": 3,
          "text0": "Apple releases its latest cell phone."
        },
        ...
      ]
    }
    
  • To sub-sample 100 examples from training set, run:

    subsampled_dataset = subsample_dataset(trainset, 100)
    
  • To convert a dataset dict to a torch.IterableDataset for BERT model, run:

    iterable_dataset = DatasetForTransformers(trainset, "bert-base-cased", batch_size=32);
    

For more details, see https://dai-lab.github.io/fibber/

class fibber.datasets.dataset_utils.DatasetForTransformers(*args, **kwds)[source]

Bases: torch.utils.data.dataset.IterableDataset

Create a torch.IterableDataset for a BERT model.

The module is an iterator that yields infinite batches from the dataset. To construct a batch, we randomly sample a few examples with similar length. Then we pad all selected examples to the same length L. Then we construct a tuple of 4 or 5 tensors. All tensors are on CPU.

Each example starts with [CLS], and ends with [SEP]. If there are two parts in the input, the two parts are separated by [SEP].

__iter__(self):

Yields

A tuple of tensors (or list).

  • The first tensor is an int tensor of size (batch_size, L), representing word ids. Each row of this tensor correspond to one example in the dataset. If masked_lm == True, the tensor stores the masked text.

  • The second tensor is an int tensor of size (batch_size, L), representing the text length. Each entry is 1 if the corresponding position is text, and it is 0 if the position is padding.

  • The third tensor is an int tensor of size (batch_size, L), representing the token type. The token type is 0 if current position is in the first part of the input text. And it is 1 if current position is in the second part of the input. For padding positions, token type is 0.

  • The forth tensor an int tensor of size (batch_size,), representing the classification label.

  • (optional) If masked_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if the position is not masked, or the correct word if the position is masked. Note that, a masked position is not always a [MASK] token in the first tensor. With 80% probability, it is a [MASK]. With 10% probability, it is the original word. And with 10% probability, it is a random word.

  • (optional) If autoregressive_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if it’s [CLS], [PAD].

  • (optional) if include_raw_text == True, the last item is a list of str.

Initialize.

Parameters
  • dataset (dict) – a dataset dict.

  • model_init (str) – the pre-trained model name. select from ['bert-base-cased', 'bert-base-uncased', 'bert-large-cased', and 'bert-large-uncased'].

  • batch_size (int) – the batch size in each step.

  • exclude (int) – exclude one category from the data. Use -1 (default) to include all categories.

  • masked_lm (bool) – whether to randomly replace words with mask tokens.

  • masked_lm_ratio (float) – the ratio of random masks. Ignored when masked_lm is False.

  • select_field (None or str) – select one field. None to use all available fields.

reinforce_type(expected_type)

Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

fibber.datasets.dataset_utils.clip_sentence(dataset, model_init, max_len)[source]

Inplace clipping sentences.

fibber.datasets.dataset_utils.get_dataset(dataset_name)[source]

Load dataset from fibber root directory.

Users should make sure the data is downloaded to the datasets folder in fibber root dir (default: ~/.fibber/datasets). Otherwise, assertion error is raised.

Parameters

dataset_name (str) – the name of the dataset. See https://dai-lab.github.io/fibber/ for a full list of built-in datasets.

Returns

the function returns a tuple of two dict, representing the training set and test set respectively.

Return type

(dict, dict)

fibber.datasets.dataset_utils.get_demo_dataset()[source]

download demo dataset.

Returns

trainset and testset.

Return type

(dict, dict)

fibber.datasets.dataset_utils.subsample_dataset(dataset, n, offset=0)[source]

Sub-sample a dataset to n examples.

Data is selected evenly and randomly from each category. Data in each category is sorted by its md5 hash value. The top (n // k) examples from each category are included in the sub-sampled dataset, where k is the number of categories.

If n is not divisible by k, one more data is sampled from the first (n % k) categories.

If the dataset has less than n examples, a copy of the original dataset will be returned.

Parameters
  • dataset (dict) – a dataset dict.

  • n (int) – the size of the sub-sampled dataset.

  • offset (int) – dataset offset.

Returns

a sub-sampled dataset as a dict.

Return type

(dict)

fibber.datasets.dataset_utils.text_md5(x)[source]

Computes and returns the md5 hash of a str.

fibber.datasets.dataset_utils.verify_dataset(dataset)[source]

Verify if the dataset dict contains necessary fields.

Assertion error is raised if there are missing or incorrect fields.

Parameters

dataset (dict) – a dataset dict.