fibber.paraphrase_strategies.asrs_utils_wpe module¶
-
class
fibber.paraphrase_strategies.asrs_utils_wpe.
WordPieceDataset
(*args, **kwds)[source]¶ Bases:
torch.utils.data.dataset.Dataset
-
fibber.paraphrase_strategies.asrs_utils_wpe.
get_wordpiece_emb
(dataset_name, trainset, tokenizer, device, steps=5000, bs=1000, lr=1, lr_halve_steps=1000)[source]¶ Transfer GloVe embeddings to BERT vocabulary.
The transfered embeddings will be stored at
~.fibber/wordpiece_emb_conterfited/ wordpiece_emb_<dataset>_<steps>.pt
.- Parameters
dataset_name (str) – dataset name.
trainset (dict) – the dataset dist.
tokenizer (transformers.PreTrainedTokenizer) – the tokenizer that specifies wordpieces.
device (torch.Device) – a device to train the model.
steps (int) – transfering steps.
bs (int) – transfering batch size.
lr (str) – transfering learning rate.
lr_halve_steps (int) – steps to halve the learning rate.
- Returns
a array of size (300, N) where N is the vocabulary size for a bert-base model.
- Return type
(np.array)