fibber.metrics.bert_lm_utils module

fibber.metrics.bert_lm_utils.compute_lm_loss(lm_model, seq, mask, tok_type, lm_label, stats)[source]

Compute masked language model training loss.

Parameters
  • lm_model (transformers.BertForMaskedLM) – a BERT language model.

  • seq (torch.Tensor) – an int tensor of size (batch_size, length) representing the word pieces.

  • mask (torch.Tensor) – an int tensor of size (batch_size, length) representing the attention mask.

  • tok_type (torch.Tensor) – an int tensor of size (batch_size, length) representing the token type id.

  • lm_label (torch.Tensor) – an int tensor of size (batch_size, length) representing the label for each position. Use -100 if the loss is not computed for that position.

  • stats (dist) – a dictionary storing training stats.

Returns

(torch.Scalar) a scalar loss value.

fibber.metrics.bert_lm_utils.fine_tune_lm(output_dir, trainset, filter, device, model_init='bert-base-cased', lm_steps=5000, lm_bs=32, lm_opt='adamw', lm_lr=0.0001, lm_decay=0.01, lm_period_summary=100, lm_period_save=5000, as_masked_lm=True, select_field=None)[source]

Returns a finetuned BERT language model on a given dataset.

The language model will be stored at <output_dir>/lm_all if filter is -1, or <output_dir>/lm_filter_? if filter is not -1.

If filter is not -1. The pretrained langauge model will first be pretrained on the while dataset, then it will be finetuned on the data excluding the filter category.

Parameters
  • output_dir (str) – a directory to store pretrained language model.

  • trainset (DatasetForTransformers) – the training set for finetune the language model.

  • filter (int) – a category to exclude from finetuning.

  • device (torch.Device) – a device to train the model.

  • model_init (str) – the backbone bert model.

  • lm_steps (int) – finetuning steps.

  • lm_bs (int) – finetuning batch size.

  • lm_opt (str) – optimzer name. choose from [“sgd”, “adam”, “adamW”].

  • lm_lr (float) – learning rate.

  • lm_decay (float) – weight decay for the optimizer.

  • lm_period_summary (int) – number of steps to write training summary.

  • lm_period_save (int) – number of steps to save the finetuned model.

  • as_masked_lm (bool) – use BERT as a masked language model. If False, use as auto-regressive.

  • select_field (None or str) – select one field for language model.

Returns

a finetuned language model.

Return type

(BertForMaskedLM)

fibber.metrics.bert_lm_utils.get_lm(lm_option, dataset_name, trainset, device, model_init='bert-base-cased', filter=- 1, lm_steps=5000, lm_bs=32, lm_opt='adamw', lm_lr=0.0001, lm_decay=0.01, lm_period_summary=100, lm_period_save=5000, select_field=None)[source]

Returns a BERT language model or a list of language models on a given dataset.

The language model will be stored at <output_dir>/lm_all if lm_option is finetune. The language model will be stored at <output_dir>/lm_filter_? if lm_option is adv.

If filter is not -1. The pretrained language model will first be pretrained on the while dataset, then it will be finetuned on the data excluding the filter category.

The re

Parameters
  • lm_option (str) – choose from [“pretrain”, “finetune”, “adv”]. pretrain means the pretrained BERT model without fine-tuning on current dataset. finetune means fine-tuning the BERT model on current dataset. adv means adversarial tuning on current dataset.

  • dataset_name (str) – a directory to store pretrained language model.

  • trainset (dict) – the training set for finetune the language model.

  • device (torch.Device) – a device to train the model.

  • model_init (str) – the backbone bert model.

  • lm_steps (int) – finetuning steps.

  • lm_bs (int) – finetuning batch size.

  • lm_opt (str) – optimzer name. choose from [“sgd”, “adam”, “adamW”].

  • lm_lr (float) – learning rate.

  • lm_decay (float) – weight decay for the optimizer.

  • lm_period_summary (int) – number of steps to write training summary.

  • lm_period_save (int) – number of steps to save the finetuned model.

  • select_field (str or None) – train language model on one specific field.

Returns

the tokenizer for the language model. (BertForMaskedLM): a finetuned language model if lm_option is pretrain or finetune. ([BertForMaskedLM]): a list of finetuned language model if lm_option is adv. The i-th

language model in the list is fine-tuned on data not having label i.

Return type

(BertTokenizerFast)

fibber.metrics.bert_lm_utils.new_stats()[source]

Create a new stats dict.

fibber.metrics.bert_lm_utils.write_summary(stats, summary, global_step)[source]

Save langauge model training summary.