fibber.paraphrase_strategies.asrs_utils_wpe module

class fibber.paraphrase_strategies.asrs_utils_wpe.WordPieceDataset(*args, **kwds)[source]

Bases: torch.utils.data.dataset.Dataset

fibber.paraphrase_strategies.asrs_utils_wpe.get_wordpiece_emb(dataset_name, trainset, tokenizer, device, steps=5000, bs=1000, lr=1, lr_halve_steps=1000)[source]

Transfer GloVe embeddings to BERT vocabulary.

The transfered embeddings will be stored at ~.fibber/wordpiece_emb_conterfited/ wordpiece_emb_<dataset>_<steps>.pt.

Parameters
  • dataset_name (str) – dataset name.

  • trainset (dict) – the dataset dist.

  • tokenizer (transformers.PreTrainedTokenizer) – the tokenizer that specifies wordpieces.

  • device (torch.Device) – a device to train the model.

  • steps (int) – transfering steps.

  • bs (int) – transfering batch size.

  • lr (str) – transfering learning rate.

  • lr_halve_steps (int) – steps to halve the learning rate.

Returns

a array of size (300, N) where N is the vocabulary size for a bert-base model.

Return type

(np.array)