fibber.datasets package

Module contents

class fibber.datasets.DatasetForTransformers(*args, **kwds)[source]

Bases: torch.utils.data.dataset.IterableDataset

Create a torch.IterableDataset for a BERT model.

The module is an iterator that yields infinite batches from the dataset. To construct a batch, we randomly sample a few examples with similar length. Then we pad all selected examples to the same length L. Then we construct a tuple of 4 or 5 tensors. All tensors are on CPU.

Each example starts with [CLS], and ends with [SEP]. If there are two parts in the input, the two parts are separated by [SEP].

__iter__(self):

Yields

A tuple of tensors (or list).

  • The first tensor is an int tensor of size (batch_size, L), representing word ids. Each row of this tensor correspond to one example in the dataset. If masked_lm == True, the tensor stores the masked text.

  • The second tensor is an int tensor of size (batch_size, L), representing the text length. Each entry is 1 if the corresponding position is text, and it is 0 if the position is padding.

  • The third tensor is an int tensor of size (batch_size, L), representing the token type. The token type is 0 if current position is in the first part of the input text. And it is 1 if current position is in the second part of the input. For padding positions, token type is 0.

  • The forth tensor an int tensor of size (batch_size,), representing the classification label.

  • (optional) If masked_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if the position is not masked, or the correct word if the position is masked. Note that, a masked position is not always a [MASK] token in the first tensor. With 80% probability, it is a [MASK]. With 10% probability, it is the original word. And with 10% probability, it is a random word.

  • (optional) If autoregressive_lm == True the fifth tensor is a tensor of size (batch_size, L). Each entry in this tensor is either -100 if it’s [CLS], [PAD].

  • (optional) if include_raw_text == True, the last item is a list of str.

Initialize.

Parameters
  • dataset (dict) – a dataset dict.

  • model_init (str) – the pre-trained model name. select from ['bert-base-cased', 'bert-base-uncased', 'bert-large-cased', and 'bert-large-uncased'].

  • batch_size (int) – the batch size in each step.

  • exclude (int) – exclude one category from the data. Use -1 (default) to include all categories.

  • masked_lm (bool) – whether to randomly replace words with mask tokens.

  • masked_lm_ratio (float) – the ratio of random masks. Ignored when masked_lm is False.

  • select_field (None or str) – select one field. None to use all available fields.

reinforce_type(expected_type)

Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

fibber.datasets.clip_sentence(dataset, model_init, max_len)[source]

Inplace clipping sentences.

fibber.datasets.get_dataset(dataset_name)[source]

Load dataset from fibber root directory.

Users should make sure the data is downloaded to the datasets folder in fibber root dir (default: ~/.fibber/datasets). Otherwise, assertion error is raised.

Parameters

dataset_name (str) – the name of the dataset. See https://dai-lab.github.io/fibber/ for a full list of built-in datasets.

Returns

the function returns a tuple of two dict, representing the training set and test set respectively.

Return type

(dict, dict)

fibber.datasets.get_demo_dataset()[source]

download demo dataset.

Returns

trainset and testset.

Return type

(dict, dict)

fibber.datasets.subsample_dataset(dataset, n, offset=0)[source]

Sub-sample a dataset to n examples.

Data is selected evenly and randomly from each category. Data in each category is sorted by its md5 hash value. The top (n // k) examples from each category are included in the sub-sampled dataset, where k is the number of categories.

If n is not divisible by k, one more data is sampled from the first (n % k) categories.

If the dataset has less than n examples, a copy of the original dataset will be returned.

Parameters
  • dataset (dict) – a dataset dict.

  • n (int) – the size of the sub-sampled dataset.

  • offset (int) – dataset offset.

Returns

a sub-sampled dataset as a dict.

Return type

(dict)

fibber.datasets.verify_dataset(dataset)[source]

Verify if the dataset dict contains necessary fields.

Assertion error is raised if there are missing or incorrect fields.

Parameters

dataset (dict) – a dataset dict.