tkitAutoMask package

Submodules

tkitAutoMask.attention_mask module

tkitAutoMask.mask module

class tkitAutoMask.mask.autoMask(mask_prob=0.15, replace_prob=0.9, num_tokens=None, random_token_prob=0.0, mask_token_id=103, pad_token_id=- 100, mask_ignore_token_ids=[], probabilitis=[0.9, 0.05, 0.05])[源代码]

基类:torch.nn.modules.module.Module

动态mask数据

示例

>>> from transformers import BertTokenizer
 >>> tokenizer = BertTokenizer.from_pretrained("uer/chinese_roberta_L-2_H-128")
# dir(tokenizer)
 >>> tomask = autoMask(
>>>     # transformer,
 >>>     mask_token_id = tokenizer.mask_token_id,          # the token id reserved for masking
 >>>    pad_token_id = tokenizer.pad_token_id,           # the token id for padding
 >>>    mask_prob = 0.05,           # masking probability for masked language modeling
 >>>    replace_prob = 0.90,        # ~10% probability that token will not be masked, but included in loss, as detailed in the epaper
 >>>    mask_ignore_token_ids = [tokenizer.cls_token_id,tokenizer.eos_token_id]  # other tokens to exclude from masking, include the [cls] and [sep] here
>>>  )

修改默认的pad和mask_token_id 默认使用https://huggingface.co/uer/chinese_roberta_L-2_H-128/blob/main/vocab.txt

forward(input, indices=False, **kwargs)[源代码]

indices :获取mask的索引

training: bool
tkitAutoMask.mask.get_mask_subset_with_prob(mask, prob)[源代码]
tkitAutoMask.mask.get_mask_subset_with_prob_diagonal(mask, prob, subset_with_prob=True)[源代码]

优化版本 对角线矩阵掩盖 迭代中会自动随机掩盖 实现连续多个 掩盖。用于预测片段内容

# 添加按照比例 subset_with_prob 掩码

tkitAutoMask.mask.get_mask_subset_with_prob_tri(mask, prob, subset_with_prob=True)[源代码]

优化版本 三角形 动态掩盖 对屏蔽的数据进行放弃一部分 上三角和下三角自动随机选择

# 添加按照比例 subset_with_prob 掩码

tkitAutoMask.mask.mask_with_tokens(t, token_ids)[源代码]

用记号遮掩

tkitAutoMask.mask.prob_mask_like(t, prob)[源代码]

tkitAutoMask.spanmask module

Module contents