tkitAutoTokenizerPosition package

Submodules

tkitAutoTokenizerPosition.AutoPos module

tkitAutoTokenizerPosition.AutoTokenizerPosition module

class tkitAutoTokenizerPosition.AutoTokenizerPosition.AutoTokenizerPosition(tokenizer)[源代码]

基类:object

用来处理只有关键词的ner数据

起始位置 tokenizer = BertTokenizer.from_pretrained(“clue/albert_chinese_tiny”) tokenizer = BertTokenizer.from_pretrained(“bert-base-chinese”) 使用这个可以减少不必要的麻烦 ## 安装

```

> pip install tkitAutoTokenizerPosition

# or

> pip install git+https://github.com/napoler/tkit-AutoTokenizerPosition

```

E_trans_to_C(string)[源代码]

[summary]

中文标点转换成英文

参数

string ([type]) – [description]

返回

[description]

返回类型

[type]

autoLen(text)[源代码]

[summary]

获取文本分词后位置

参数

text ([type]) – [description]

autoTypeWord(text, word, wType=None, startList=[])[源代码]

[summary]

参数
  • text ([type]) – [description]

  • word ([type]) – [description]

  • wType ([type], optional) – [description]. Defaults to None.

  • startList (list, optional) – [description]. Defaults to [].

clear(text)[源代码]

[summary]

清理文本中文问题

Args: text ([type]): [description]

filterPunctuation(x)[源代码]

[summary]

中文标点转换成英文

参数

x ([type]) – [description]

返回

[description]

返回类型

[type]

findAll(text, word)[源代码]

[summary]

获取词语在文字中的所有开始位置

参数
  • text ([type]) – [description]

  • word ([type]) – [description]

生成器

[type] – [description]

fixPosition(text, word, startList=[])[源代码]

[summary]

自动获取分词后起始位置 自动匹配所有存在的位置

传入位置可以限制查找的位置

参数
  • text ([type]) – [description]

  • word ([type]) – [description]

  • startList (list, optional) – [description]. Defaults to [].

生成器

[type] – [description]

getText(wordList)[源代码]
getWordList(text)[源代码]

[summary]

分词列表

Args: text ([type]): [description]

tkitAutoTokenizerPosition.bio module

构建BIO结构数据集

class tkitAutoTokenizerPosition.bio.autoBIO[源代码]

基类:object

[summary]

用于构建bio结构数据集

bulid(it)[源代码]

[summary]

it={‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘wordList’: [‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘@’, ‘有’, ‘研’, ‘究’, ‘显’, ‘示’, ‘,’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘[PAD]’, ‘(’, ‘24’, ‘[PAD]’, ‘小’, ‘时’, ‘内’, ‘)’, ‘可’, ‘以’, ‘降’, ‘低’, ‘梗’, ‘阻’, ‘性’, ‘胆’, ‘总’, ‘管’, ‘结’, ‘石’, ‘患’, ‘者’, ‘的’, ‘并’, ‘发’, ‘症’, ‘发’, ‘生’, ‘率’, ‘和’, ‘死’, ‘亡’, ‘率’, ‘;’, ‘[PAD]’, ‘但’, ‘是’, ‘,’, ‘对’, ‘于’, ‘无’, ‘胆’, ‘总’, ‘管’, ‘梗’, ‘阻’, ‘的’, ‘胆’, ‘汁’, ‘性’, ‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘患’, ‘者’, ‘,’, ‘不’, ‘需’, ‘要’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 69, ‘end’: 74, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 19, ‘type’: ‘检查’}, {‘start’: 85, ‘end’: 87, ‘type’: ‘检查’},

返回格式如下

{‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘wordList’: [‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘@’, ‘有’, ‘研’, ‘究’, ‘显’, ‘示’, ‘,’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘[PAD]’, ‘(’, ‘24’, ‘[PAD]’, ‘小’, ‘时’, ‘内’, ‘)’, ‘可’, ‘以’, ‘降’, ‘低’, ‘梗’, ‘阻’, ‘性’, ‘胆’, ‘总’, ‘管’, ‘结’, ‘石’, ‘患’, ‘者’, ‘的’, ‘并’, ‘发’, ‘症’, ‘发’, ‘生’, ‘率’, ‘和’, ‘死’, ‘亡’, ‘率’, ‘;’, ‘[PAD]’, ‘但’, ‘是’, ‘,’, ‘对’, ‘于’, ‘无’, ‘胆’, ‘总’, ‘管’, ‘梗’, ‘阻’, ‘的’, ‘胆’, ‘汁’, ‘性’, ‘急’, ‘性’, ‘胰’, ‘腺’, ‘炎’, ‘患’, ‘者’, ‘,’, ‘不’, ‘需’, ‘要’, ‘进’, ‘行’, ‘早’, ‘期’, ‘[PAD]’, ‘er’, ‘##cp’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 69, ‘end’: 74, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 19, ‘type’: ‘检查’}, {‘start’: 85, ‘end’: 87, ‘type’: ‘检查’}], ‘data’: {‘text’: ‘急性胰腺炎@有研究显示,进行早期 ERCP (24 小时内)可以降低梗阻性胆总管结石患者的并发症发生率和死亡率; 但是,对于无胆总管梗阻的胆汁性急性胰腺炎患者,不需要进行早期 ERCP。’, ‘spo_list’: [{‘Combined’: False, ‘predicate’: ‘影像学检查’, ‘subject’: ‘急性胰腺炎’, ‘subject_type’: ‘疾病’, ‘object’: {‘@value’: ‘ERCP’}, ‘object_type’: {‘@value’: ‘检查’}}]}, ‘tagList’: [‘B-疾病’, ‘M-疾病’, ‘M-疾病’, ‘M-疾病’, ‘E-疾病’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-检查’, ‘E-检查’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-疾病’, ‘M-疾病’, ‘M-疾病’, ‘M-疾病’, ‘E-疾病’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-检查’, ‘E-检查’, ‘O’]}

tkitAutoTokenizerPosition.span module

class tkitAutoTokenizerPosition.span.autoSpan(labelsList=[])[源代码]

基类:object

[summary] 用于处理生成span模型训练格式

autoSpan(datas, maxLen=128)[源代码]

[summary] 自动构建Span数据集

输入格式如下

参数

datas ([type]) – [description]

> datas=[{‘text’: ‘骨性关节炎@在其他关节(如踝关节和腕关节),骨性关节炎比较少见,并且一般有潜在的病因(如结晶性关节病、创伤)。’, ‘wordList’: [‘骨’, ‘性’, ‘关’, ‘节’, ‘炎’, ‘@’, ‘在’, ‘其’, ‘他’, ‘关’, ‘节’, ‘(’, ‘如’, ‘踝’, ‘关’, ‘节’, ‘和’, ‘腕’, ‘关’, ‘节’, ‘)’, ‘,’, ‘骨’, ‘性’, ‘关’, ‘节’, ‘炎’, ‘比’, ‘较’, ‘少’, ‘见’, ‘,’, ‘并’, ‘且’, ‘一’, ‘般’, ‘有’, ‘潜’, ‘在’, ‘的’, ‘病’, ‘因’, ‘(’, ‘如’, ‘结’, ‘晶’, ‘性’, ‘关’, ‘节’, ‘病’, ‘、’, ‘创’, ‘伤’, ‘)’, ‘。’], ‘tag’: [{‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 2, ‘end’: 4, ‘type’: ‘部位’}, {‘start’: 9, ‘end’: 11, ‘type’: ‘部位’}, {‘start’: 14, ‘end’: 16, ‘type’: ‘部位’}, {‘start’: 18, ‘end’: 20, ‘type’: ‘部位’}, {‘start’: 24, ‘end’: 26, ‘type’: ‘部位’}, {‘start’: 47, ‘end’: 49, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 13, ‘end’: 16, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 17, ‘end’: 20, ‘type’: ‘部位’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 44, ‘end’: 50, ‘type’: ‘社会学’}, {‘start’: 40, ‘end’: 42, ‘type’: ‘关系’}, {‘start’: 0, ‘end’: 5, ‘type’: ‘疾病’}, {‘start’: 22, ‘end’: 27, ‘type’: ‘疾病’}, {‘start’: 51, ‘end’: 53, ‘type’: ‘社会学’}, {‘start’: 40, ‘end’: 42, ‘type’: ‘关系’}]}]

maxLen (int, optional): [description]. Defaults to 128.

返回

[description]

返回类型

[type]

bulidSpanMatrix(data, maxLen=128)[源代码]

[summary]

构建span数据

> data=[{‘start’: 65, ‘end’: 70, ‘type’: ‘疾病’}]

参数
  • data ([type]) – [description]

  • maxLen (int, optional) – [description]. Defaults to 128.

返回

[description]

返回类型

[type]

Module contents