fibber.metrics.bert_lm_utils module¶
-
fibber.metrics.bert_lm_utils.
compute_lm_loss
(lm_model, seq, mask, tok_type, lm_label, stats)[source]¶ Compute masked language model training loss.
- Parameters
lm_model (transformers.BertForMaskedLM) – a BERT language model.
seq (torch.Tensor) – an int tensor of size (batch_size, length) representing the word pieces.
mask (torch.Tensor) – an int tensor of size (batch_size, length) representing the attention mask.
tok_type (torch.Tensor) – an int tensor of size (batch_size, length) representing the token type id.
lm_label (torch.Tensor) – an int tensor of size (batch_size, length) representing the label for each position. Use -100 if the loss is not computed for that position.
stats (dist) – a dictionary storing training stats.
- Returns
(torch.Scalar) a scalar loss value.
-
fibber.metrics.bert_lm_utils.
fine_tune_lm
(output_dir, trainset, filter, device, model_init='bert-base-cased', lm_steps=5000, lm_bs=32, lm_opt='adamw', lm_lr=0.0001, lm_decay=0.01, lm_period_summary=100, lm_period_save=5000, as_masked_lm=True, select_field=None)[source]¶ Returns a finetuned BERT language model on a given dataset.
The language model will be stored at
<output_dir>/lm_all
if filter is -1, or<output_dir>/lm_filter_?
if filter is not -1.If filter is not -1. The pretrained langauge model will first be pretrained on the while dataset, then it will be finetuned on the data excluding the filter category.
- Parameters
output_dir (str) – a directory to store pretrained language model.
trainset (DatasetForTransformers) – the training set for finetune the language model.
filter (int) – a category to exclude from finetuning.
device (torch.Device) – a device to train the model.
model_init (str) – the backbone bert model.
lm_steps (int) – finetuning steps.
lm_bs (int) – finetuning batch size.
lm_opt (str) – optimzer name. choose from [“sgd”, “adam”, “adamW”].
lm_lr (float) – learning rate.
lm_decay (float) – weight decay for the optimizer.
lm_period_summary (int) – number of steps to write training summary.
lm_period_save (int) – number of steps to save the finetuned model.
as_masked_lm (bool) – use BERT as a masked language model. If False, use as auto-regressive.
select_field (None or str) – select one field for language model.
- Returns
a finetuned language model.
- Return type
(BertForMaskedLM)
-
fibber.metrics.bert_lm_utils.
get_lm
(lm_option, dataset_name, trainset, device, model_init='bert-base-cased', filter=- 1, lm_steps=5000, lm_bs=32, lm_opt='adamw', lm_lr=0.0001, lm_decay=0.01, lm_period_summary=100, lm_period_save=5000, select_field=None)[source]¶ Returns a BERT language model or a list of language models on a given dataset.
The language model will be stored at
<output_dir>/lm_all
if lm_option is finetune. The language model will be stored at<output_dir>/lm_filter_?
if lm_option is adv.If filter is not -1. The pretrained language model will first be pretrained on the while dataset, then it will be finetuned on the data excluding the filter category.
The re
- Parameters
lm_option (str) – choose from [“pretrain”, “finetune”, “adv”]. pretrain means the pretrained BERT model without fine-tuning on current dataset. finetune means fine-tuning the BERT model on current dataset. adv means adversarial tuning on current dataset.
dataset_name (str) – a directory to store pretrained language model.
trainset (dict) – the training set for finetune the language model.
device (torch.Device) – a device to train the model.
model_init (str) – the backbone bert model.
lm_steps (int) – finetuning steps.
lm_bs (int) – finetuning batch size.
lm_opt (str) – optimzer name. choose from [“sgd”, “adam”, “adamW”].
lm_lr (float) – learning rate.
lm_decay (float) – weight decay for the optimizer.
lm_period_summary (int) – number of steps to write training summary.
lm_period_save (int) – number of steps to save the finetuned model.
select_field (str or None) – train language model on one specific field.
- Returns
the tokenizer for the language model. (BertForMaskedLM): a finetuned language model if lm_option is pretrain or finetune. ([BertForMaskedLM]): a list of finetuned language model if lm_option is adv. The i-th
language model in the list is fine-tuned on data not having label i.
- Return type
(BertTokenizerFast)