fibber.datasets package¶
Submodules¶
Module contents¶
-
class
fibber.datasets.
DatasetForTransformers
(*args, **kwds)[source]¶ Bases:
torch.utils.data.dataset.IterableDataset
Create a
torch.IterableDataset
for a BERT model.The module is an iterator that yields infinite batches from the dataset. To construct a batch, we randomly sample a few examples with similar length. Then we pad all selected examples to the same length
L
. Then we construct a tuple of 4 or 5 tensors. All tensors are on CPU.Each example starts with
[CLS]
, and ends with[SEP]
. If there are two parts in the input, the two parts are separated by[SEP]
.__iter__(self):
- Yields
A tuple of tensors (or list).
The first tensor is an int tensor of size
(batch_size, L)
, representing word ids. Each row of this tensor correspond to one example in the dataset. Ifmasked_lm == True
, the tensor stores the masked text.The second tensor is an int tensor of size
(batch_size, L)
, representing the text length. Each entry is 1 if the corresponding position is text, and it is 0 if the position is padding.The third tensor is an int tensor of size
(batch_size, L)
, representing the token type. The token type is 0 if current position is in the first part of the input text. And it is 1 if current position is in the second part of the input. For padding positions, token type is 0.The forth tensor an int tensor of size
(batch_size,)
, representing the classification label.(optional) If
masked_lm == True
the fifth tensor is a tensor of size(batch_size, L)
. Each entry in this tensor is either -100 if the position is not masked, or the correct word if the position is masked. Note that, a masked position is not always a[MASK]
token in the first tensor. With 80% probability, it is a[MASK]
. With 10% probability, it is the original word. And with 10% probability, it is a random word.(optional) If
autoregressive_lm == True
the fifth tensor is a tensor of size(batch_size, L)
. Each entry in this tensor is either -100 if it’s [CLS], [PAD].(optional) if
include_raw_text == True
, the last item is a list of str.
Initialize.
- Parameters
dataset (dict) – a dataset dict.
model_init (str) – the pre-trained model name. select from
['bert-base-cased', 'bert-base-uncased', 'bert-large-cased', and 'bert-large-uncased']
.batch_size (int) – the batch size in each step.
exclude (int) – exclude one category from the data. Use -1 (default) to include all categories.
masked_lm (bool) – whether to randomly replace words with mask tokens.
masked_lm_ratio (float) – the ratio of random masks. Ignored when masked_lm is False.
select_field (None or str) – select one field. None to use all available fields.
-
reinforce_type
(expected_type)¶ Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.
-
fibber.datasets.
get_dataset
(dataset_name)[source]¶ Load dataset from fibber root directory.
Users should make sure the data is downloaded to the
datasets
folder in fibber root dir (default:~/.fibber/datasets
). Otherwise, assertion error is raised.- Parameters
dataset_name (str) – the name of the dataset. See
https://dai-lab.github.io/fibber/
for a full list of built-in datasets.- Returns
the function returns a tuple of two dict, representing the training set and test set respectively.
- Return type
(dict, dict)
-
fibber.datasets.
get_demo_dataset
()[source]¶ download demo dataset.
- Returns
trainset and testset.
- Return type
(dict, dict)
-
fibber.datasets.
subsample_dataset
(dataset, n, offset=0)[source]¶ Sub-sample a dataset to n examples.
Data is selected evenly and randomly from each category. Data in each category is sorted by its md5 hash value. The top
(n // k)
examples from each category are included in the sub-sampled dataset, wherek
is the number of categories.If
n
is not divisible byk
, one more data is sampled from the first(n % k)
categories.If the dataset has less than
n
examples, a copy of the original dataset will be returned.- Parameters
dataset (dict) – a dataset dict.
n (int) – the size of the sub-sampled dataset.
offset (int) – dataset offset.
- Returns
a sub-sampled dataset as a dict.
- Return type
(dict)