tkitAutoTokenizerPosition package
Submodules
tkitAutoTokenizerPosition.AutoPos module
tkitAutoTokenizerPosition.AutoTokenizerPosition module
- class tkitAutoTokenizerPosition.AutoTokenizerPosition.AutoTokenizerPosition(tokenizer)[源代码]
基类:
object
用来处理只有关键词的ner数据
起始位置 tokenizer = BertTokenizer.from_pretrained(“clue/albert_chinese_tiny”) tokenizer = BertTokenizer.from_pretrained(“bert-base-chinese”) 使用这个可以减少不必要的麻烦 ## 安装
> pip install tkitAutoTokenizerPosition
# or
> pip install git+https://github.com/napoler/tkit-AutoTokenizerPosition
- E_trans_to_C(string)[源代码]
[summary]
中文标点转换成英文
- 参数
string ([type]) – [description]
- 返回
[description]
- 返回类型
[type]
- autoTypeWord(text, word, wType=None, startList=[])[源代码]
[summary]
- 参数
text ([type]) – [description]
word ([type]) – [description]
wType ([type], optional) – [description]. Defaults to None.
startList (list, optional) – [description]. Defaults to [].
- filterPunctuation(x)[源代码]
[summary]
中文标点转换成英文
- 参数
x ([type]) – [description]
- 返回
[description]
- 返回类型
[type]
- findAll(text, word)[源代码]
[summary]
获取词语在文字中的所有开始位置
- 参数
text ([type]) – [description]
word ([type]) – [description]
- 生成器
[type] – [description]
tkitAutoTokenizerPosition.bio module
构建BIO结构数据集
- class tkitAutoTokenizerPosition.bio.autoBIO[源代码]
基类:
object
[summary]
用于构建bio结构数据集
- bulid(it)[源代码]
[summary]
it={‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘wordList’: [‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘@’, ‘有’, ‘研’, ‘究’, ‘显’, ‘示’, ‘,’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘[PAD]’, ‘(’, ‘24’, ‘[PAD]’, ‘小’, ‘时’, ‘内’, ‘)’, ‘可’, ‘以’, ‘降’, ‘低’, ‘梗’, ‘阻’, ‘性’, ‘胆’, ‘总’, ‘管’, ‘结’, ‘石’, ‘患’, ‘者’, ‘的’, ‘并’, ‘发’, ‘症’, ‘发’, ‘生’, ‘率’, ‘和’, ‘死’, ‘亡’, ‘率’, ‘;’, ‘[PAD]’, ‘但’, ‘是’, ‘,’, ‘对’, ‘于’, ‘无’, ‘胆’, ‘总’, ‘管’, ‘梗’, ‘阻’, ‘的’, ‘胆’, ‘汁’, ‘性’, ‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘患’, ‘者’, ‘,’, ‘不’, ‘需’, ‘要’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 69, ‘end’: 74, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 19, ‘type’: ‘检查’}, {‘start’: 85, ‘end’: 87, ‘type’: ‘检查’},
返回格式如下
{‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘wordList’: [‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘@’, ‘有’, ‘研’, ‘究’, ‘显’, ‘示’, ‘,’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘[PAD]’, ‘(’, ‘24’, ‘[PAD]’, ‘小’, ‘时’, ‘内’, ‘)’, ‘可’, ‘以’, ‘降’, ‘低’, ‘梗’, ‘阻’, ‘性’, ‘胆’, ‘总’, ‘管’, ‘结’, ‘石’, ‘患’, ‘者’, ‘的’, ‘并’, ‘发’, ‘症’, ‘发’, ‘生’, ‘率’, ‘和’, ‘死’, ‘亡’, ‘率’, ‘;’, ‘[PAD]’, ‘但’, ‘是’, ‘,’, ‘对’, ‘于’, ‘无’, ‘胆’, ‘总’, ‘管’, ‘梗’, ‘阻’, ‘的’, ‘胆’, ‘汁’, ‘性’, ‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘患’, ‘者’, ‘,’, ‘不’, ‘需’, ‘要’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 69, ‘end’: 74, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 19, ‘type’: ‘检查’}, {‘start’: 85, ‘end’: 87, ‘type’: ‘检查’}], ‘data’: {‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘spo_list’: [{‘Combined’: False, ‘predicate’: ‘影像学检查’, ‘subject’: ‘急性胰腺炎’, ‘subject_type’: ‘疾病’, ‘object’: {‘@value’: ‘ERCP’}, ‘object_type’: {‘@value’: ‘检查’}}]}, ‘tagList’: [‘B-疾病’, ‘M-疾病’, ‘M-疾病’, ‘M-疾病’, ‘E-疾病’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-检查’, ‘E-检查’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-疾病’, ‘M-疾病’, ‘M-疾病’, ‘M-疾病’, ‘E-疾病’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-检查’, ‘E-检查’, ‘O’]}
tkitAutoTokenizerPosition.span module
- class tkitAutoTokenizerPosition.span.autoSpan(labelsList=[])[源代码]
基类:
object
[summary] 用于处理生成span模型训练格式
- autoSpan(datas, maxLen=128)[源代码]
[summary] 自动构建Span数据集
输入格式如下
- 参数
datas ([type]) – [description]
> datas=[{‘text’: ‘骨性关节炎@在其他关节(如踝关节和腕关节),骨性关节炎比较少见,并且一般有潜在的病因(如结晶性关节病、创伤)。’, ‘wordList’: [‘骨’, ‘性’, ‘关’, ‘节’, ‘炎’, ‘@’, ‘在’, ‘其’, ‘他’, ‘关’, ‘节’, ‘(’, ‘如’, ‘踝’, ‘关’, ‘节’, ‘和’, ‘腕’, ‘关’, ‘节’, ‘)’, ‘,’, ‘骨’, ‘性’, ‘关’, ‘节’, ‘炎’, ‘比’, ‘较’, ‘少’, ‘见’, ‘,’, ‘并’, ‘且’, ‘一’, ‘般’, ‘有’, ‘潜’, ‘在’, ‘的’, ‘病’, ‘因’, ‘(’, ‘如’, ‘结’, ‘晶’, ‘性’, ‘关’, ‘节’, ‘病’, ‘、’, ‘创’, ‘伤’, ‘)’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 2, ‘end’: 4, ‘type’: ‘部位’}, {‘start’: 9, ‘end’: 11, ‘type’: ‘部位’}, {‘start’: 14, ‘end’: 16, ‘type’: ‘部位’}, {‘start’: 18, ‘end’: 20, ‘type’: ‘部位’}, {‘start’: 24, ‘end’: 26, ‘type’: ‘部位’}, {‘start’: 47, ‘end’: 49, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 13, ‘end’: 16, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 20, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 44, ‘end’: 50, ‘type’: ‘社会学’}, {‘start’: 40, ‘end’: 42, ‘type’: ‘关系’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 51, ‘end’: 53, ‘type’: ‘社会学’}, {‘start’: 40, ‘end’: 42, ‘type’: ‘关系’}]}]
maxLen (int, optional): [description]. Defaults to 128.
- 返回
[description]
- 返回类型
[type]