fibber.datasets.dataset_utils module¶
This module provides utility functions and classes to handle fibber’s datasets.
To load a dataset, use
get_dataset
function. For example, to load AG’s news dataset, run:trainset, testset = get_dataset("ag")
The trainset and testset are both dicts. The dict looks like:
{ "label_mapping": [ "World", "Sports", "Business", "Sci/Tech" ], "data": [ { "label": 1, "text0": "Boston won the NBA championship in 2008." }, { "label": 3, "text0": "Apple releases its latest cell phone." }, ... ] }
To sub-sample 100 examples from training set, run:
subsampled_dataset = subsample_dataset(trainset, 100)
To convert a dataset dict to a
torch.IterableDataset
for BERT model, run:iterable_dataset = DatasetForTransformers(trainset, "bert-base-cased", batch_size=32);
For more details, see https://dai-lab.github.io/fibber/
-
class
fibber.datasets.dataset_utils.
DatasetForTransformers
(*args, **kwds)[source]¶ Bases:
torch.utils.data.dataset.IterableDataset
Create a
torch.IterableDataset
for a BERT model.The module is an iterator that yields infinite batches from the dataset. To construct a batch, we randomly sample a few examples with similar length. Then we pad all selected examples to the same length
L
. Then we construct a tuple of 4 or 5 tensors. All tensors are on CPU.Each example starts with
[CLS]
, and ends with[SEP]
. If there are two parts in the input, the two parts are separated by[SEP]
.__iter__(self):
- Yields
A tuple of tensors (or list).
The first tensor is an int tensor of size
(batch_size, L)
, representing word ids. Each row of this tensor correspond to one example in the dataset. Ifmasked_lm == True
, the tensor stores the masked text.The second tensor is an int tensor of size
(batch_size, L)
, representing the text length. Each entry is 1 if the corresponding position is text, and it is 0 if the position is padding.The third tensor is an int tensor of size
(batch_size, L)
, representing the token type. The token type is 0 if current position is in the first part of the input text. And it is 1 if current position is in the second part of the input. For padding positions, token type is 0.The forth tensor an int tensor of size
(batch_size,)
, representing the classification label.(optional) If
masked_lm == True
the fifth tensor is a tensor of size(batch_size, L)
. Each entry in this tensor is either -100 if the position is not masked, or the correct word if the position is masked. Note that, a masked position is not always a[MASK]
token in the first tensor. With 80% probability, it is a[MASK]
. With 10% probability, it is the original word. And with 10% probability, it is a random word.(optional) If
autoregressive_lm == True
the fifth tensor is a tensor of size(batch_size, L)
. Each entry in this tensor is either -100 if it’s [CLS], [PAD].(optional) if
include_raw_text == True
, the last item is a list of str.
Initialize.
- Parameters
dataset (dict) – a dataset dict.
model_init (str) – the pre-trained model name. select from
['bert-base-cased', 'bert-base-uncased', 'bert-large-cased', and 'bert-large-uncased']
.batch_size (int) – the batch size in each step.
exclude (int) – exclude one category from the data. Use -1 (default) to include all categories.
masked_lm (bool) – whether to randomly replace words with mask tokens.
masked_lm_ratio (float) – the ratio of random masks. Ignored when masked_lm is False.
select_field (None or str) – select one field. None to use all available fields.
-
reinforce_type
(expected_type)¶ Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.
-
fibber.datasets.dataset_utils.
clip_sentence
(dataset, model_init, max_len)[source]¶ Inplace clipping sentences.
-
fibber.datasets.dataset_utils.
get_dataset
(dataset_name)[source]¶ Load dataset from fibber root directory.
Users should make sure the data is downloaded to the
datasets
folder in fibber root dir (default:~/.fibber/datasets
). Otherwise, assertion error is raised.- Parameters
dataset_name (str) – the name of the dataset. See
https://dai-lab.github.io/fibber/
for a full list of built-in datasets.- Returns
the function returns a tuple of two dict, representing the training set and test set respectively.
- Return type
(dict, dict)
-
fibber.datasets.dataset_utils.
get_demo_dataset
()[source]¶ download demo dataset.
- Returns
trainset and testset.
- Return type
(dict, dict)
-
fibber.datasets.dataset_utils.
subsample_dataset
(dataset, n, offset=0)[source]¶ Sub-sample a dataset to n examples.
Data is selected evenly and randomly from each category. Data in each category is sorted by its md5 hash value. The top
(n // k)
examples from each category are included in the sub-sampled dataset, wherek
is the number of categories.If
n
is not divisible byk
, one more data is sampled from the first(n % k)
categories.If the dataset has less than
n
examples, a copy of the original dataset will be returned.- Parameters
dataset (dict) – a dataset dict.
n (int) – the size of the sub-sampled dataset.
offset (int) – dataset offset.
- Returns
a sub-sampled dataset as a dict.
- Return type
(dict)