update corpus and add infer.

This commit is contained in:
xuming06 2018-04-11 12:45:27 +08:00
parent 77a383e645
commit a1829942e3
8 changed files with 112 additions and 216 deletions

View File

@ -11151,25 +11151,6 @@
<ERROR start_off="37" end_off="37" type="S"></ERROR> <ERROR start_off="37" end_off="37" type="S"></ERROR>
</DOC> </DOC>
<DOC>
<TEXT id="11163200405109523100546_2_1x2">
还有,如果有人在公共场所抽烟,那么周围的人很容易影响到抽烟的坏处,尤其是对生长的青少年来说这样的问题很甚,不仅害他们的身体,还害他们的精神。因为他们是还没成了完全的好精神,他们很容易模仿抽烟的人。
</TEXT>
<CORRECTION>
还有,如果有人在公共场所抽烟,那么周围的人很容易受抽烟的坏处影响,尤其是对生长的青少年来说这样的问题很严重,不仅损害他们的身体,还伤害他们的精神。因为他们是还没形成完整的好的思想,他们很容易模仿抽烟的人。
</CORRECTION>
<ERROR start_off="25" end_off="25" type="M"></ERROR>
<ERROR start_off="27" end_off="27" type="R"></ERROR>
<ERROR start_off="25" end_off="32" type="W"></ERROR>
<ERROR start_off="30" end_off="32" type="R"></ERROR>
<ERROR start_off="52" end_off="52" type="S"></ERROR>
<ERROR start_off="56" end_off="56" type="S"></ERROR>
<ERROR start_off="64" end_off="64" type="S"></ERROR>
<ERROR start_off="78" end_off="79" type="S"></ERROR>
<ERROR start_off="80" end_off="81" type="S"></ERROR>
<ERROR start_off="84" end_off="86" type="S"></ERROR>
</DOC>
<DOC> <DOC>
<TEXT id="11182200112576525100017_2_11x3"> <TEXT id="11182200112576525100017_2_11x3">
”我现在也忘不了那时感到的幸福感。这样以来,如果碰到困难时,我觉得以后一定打破这个困难,我相信我一定可以办好。本来我对什么事情也觉得乐观,碰到什么困难也总有一天一定可以克服。 ”我现在也忘不了那时感到的幸福感。这样以来,如果碰到困难时,我觉得以后一定打破这个困难,我相信我一定可以办好。本来我对什么事情也觉得乐观,碰到什么困难也总有一天一定可以克服。
@ -70200,23 +70181,6 @@
<ERROR start_off="13" end_off="13" type="S"></ERROR> <ERROR start_off="13" end_off="13" type="S"></ERROR>
</DOC> </DOC>
<DOC>
<TEXT id="73191200210529529251002_2_2x3">
我家里的问题幸好没有过得解不可治疗。一个方法可以解决代沟是很容易,只唯有“爱”什么都会有办法了。我妈妈担心姐姐的安全,会有这样说。
</TEXT>
<CORRECTION>
我家里的问题幸好没有解决不了的。一个可以很容易解决代沟的方法,只要有“爱”什么都会有办法了。我妈妈担心姐姐的安全,才会这样说。
</CORRECTION>
<ERROR start_off="11" end_off="12" type="R"></ERROR>
<ERROR start_off="15" end_off="18" type="S"></ERROR>
<ERROR start_off="30" end_off="30" type="R"></ERROR>
<ERROR start_off="22" end_off="33" type="W"></ERROR>
<ERROR start_off="34" end_off="34" type="M"></ERROR>
<ERROR start_off="36" end_off="36" type="S"></ERROR>
<ERROR start_off="61" end_off="61" type="M"></ERROR>
<ERROR start_off="62" end_off="62" type="R"></ERROR>
</DOC>
<DOC> <DOC>
<TEXT id="73208200405204525200337_2_5x1"> <TEXT id="73208200405204525200337_2_5x1">
因为对他们来说,虽然吸烟对身体不好,但实际上是他们的爱好、娱乐和放松的方法。 因为对他们来说,虽然吸烟对身体不好,但实际上是他们的爱好、娱乐和放松的方法。

View File

@ -1,3 +1,4 @@
<ROOT>
<DOC> <DOC>
<TEXT id="200305533533250052_2_2x1"> <TEXT id="200305533533250052_2_2x1">
而他们所讨论的就是佛教和西方社会对于“破坏性的情感”和近几年在美国非常流行的“情绪管理”有何而为而有着不同的说法和见解。 而他们所讨论的就是佛教和西方社会对于“破坏性的情感”和近几年在美国非常流行的“情绪管理”有何而为而有着不同的说法和见解。
@ -25672,7 +25673,7 @@ A和尚慢慢地了解B和尚的心情因此做出了决定。
<ERROR start_off="4" end_off="5" type="W"></ERROR> <ERROR start_off="4" end_off="5" type="W"></ERROR>
</DOC> </DOC>
<DOC> <DOC>
XT id="200105109525200822_2_4x3"> <TEXT id="200105109525200822_2_4x3">
如果充分的钱,我们似乎可以得到什么东西、什么服务。但是,这是真好的事情吗?如果我们失去实干苦干的独立精神,我们会变成笨头笨脑的人。 如果充分的钱,我们似乎可以得到什么东西、什么服务。但是,这是真好的事情吗?如果我们失去实干苦干的独立精神,我们会变成笨头笨脑的人。
</TEXT> </TEXT>
<CORRECTION> <CORRECTION>
@ -31742,7 +31743,6 @@ XT id="200105109525200822_2_4x3">
第一,我认为生命是上帝和父母给我们的,谁也不准随便放弃,了断生命。第二,选择“安乐死”就是逃避生活,放弃人生的意思。有这样的想法所以我不同意“安乐死”。 第一,我认为生命是上帝和父母给我们的,谁也不准随便放弃,了断生命。第二,选择“安乐死”就是逃避生活,放弃人生的意思。有这样的想法所以我不同意“安乐死”。
</CORRECTION> </CORRECTION>
<ERROR start_off="10" end_off="11" type="S"></ERROR> <ERROR start_off="10" end_off="11" type="S"></ERROR>
<ERROR start_off="29" end_off="
<ERROR start_off="46" end_off="46" type="R"></ERROR> <ERROR start_off="46" end_off="46" type="R"></ERROR>
<ERROR start_off="47" end_off="50" type="W"></ERROR> <ERROR start_off="47" end_off="50" type="W"></ERROR>
<ERROR start_off="66" end_off="66" type="R"></ERROR> <ERROR start_off="66" end_off="66" type="R"></ERROR>
@ -39739,6 +39739,7 @@ XT id="200105109525200822_2_4x3">
</TEXT> </TEXT>
<CORRECTION> <CORRECTION>
因为现在的中国发展越来越快,所以将来需要能说汉语的人。现在也是很多日本的公司依靠中国的公司。 因为现在的中国发展越来越快,所以将来需要能说汉语的人。现在也是很多日本的公司依靠中国的公司。
</CORRECTION>
<ERROR start_off="11" end_off="11" type="M"></ERROR> <ERROR start_off="11" end_off="11" type="M"></ERROR>
<ERROR start_off="8" end_off="12" type="W"></ERROR> <ERROR start_off="8" end_off="12" type="W"></ERROR>
<ERROR start_off="29" end_off="37" type="W"></ERROR> <ERROR start_off="29" end_off="37" type="W"></ERROR>
@ -42457,7 +42458,7 @@ XT id="200105109525200822_2_4x3">
<ERROR start_off="98" end_off="99" type="W"></ERROR> <ERROR start_off="98" end_off="99" type="W"></ERROR>
</DOC> </DOC>
<DOC> <DOC>
XT id="200307217523100105_2_2x1"> <TEXT id="200307217523100105_2_2x1">
但发明了“绿色食品”以后,产生了一个很大的问题,那就是随着“绿色食品”的增加却减小了农作物总产量,而现在世界上还有几亿人因缺少粮食而挨饿。 但发明了“绿色食品”以后,产生了一个很大的问题,那就是随着“绿色食品”的增加却减小了农作物总产量,而现在世界上还有几亿人因缺少粮食而挨饿。
</TEXT> </TEXT>
<CORRECTION> <CORRECTION>
@ -103412,21 +103413,21 @@ Title一封写给父母的信我来到汉城已经很久了。
<CORRECTION> <CORRECTION>
我希望这个世界再变成先想他人再想自己的那种社会。吸烟多么不好自己应该都知道,但就是戒不了,对于不吸烟的人来说,会多么烦啊!对公众利益的影响多么大啊!现在空气污染太严重太严重了,怎么清洁也是干净不了了。这些事情该政府出面解决了,应该采取严厉的惩罚措施,让吸烟者们受到处罚,让吸毒者们进监狱,这样的话就会保持干净。 我希望这个世界再变成先想他人再想自己的那种社会。吸烟多么不好自己应该都知道,但就是戒不了,对于不吸烟的人来说,会多么烦啊!对公众利益的影响多么大啊!现在空气污染太严重太严重了,怎么清洁也是干净不了了。这些事情该政府出面解决了,应该采取严厉的惩罚措施,让吸烟者们受到处罚,让吸毒者们进监狱,这样的话就会保持干净。
</CORRECTION> </CORRECTION>
<ERROR start_off="9"end_off="11"type="S"></ERROR> <ERROR start_off="9" end_off="11" type="S"></ERROR>
<ERROR start_off="32"end_off="35"type="W"></ERROR> <ERROR start_off="32" end_off="35" type="W"></ERROR>
<ERROR start_off="37"end_off="38"type="S"></ERROR> <ERROR start_off="37" end_off="38" type="S"></ERROR>
<ERROR start_off="71"end_off="72"type="R"></ERROR> <ERROR start_off="71" end_off="72" type="R"></ERROR>
<ERROR start_off="85"end_off="85"type="S"></ERROR> <ERROR start_off="85" end_off="85" type="S"></ERROR>
<ERROR start_off="88"end_off="88"type="S"></ERROR> <ERROR start_off="88" end_off="88" type="S"></ERROR>
<ERROR start_off="93"end_off="94"type="S"></ERROR> <ERROR start_off="93" end_off="94" type="S"></ERROR>
<ERROR start_off="110"end_off="111"type="S"></ERROR> <ERROR start_off="110" end_off="111" type="S"></ERROR>
<ERROR start_off="118"end_off="118"type="M"></ERROR> <ERROR start_off="118" end_off="118" type="M"></ERROR>
<ERROR start_off="120"end_off="120"type="M"></ERROR> <ERROR start_off="120" end_off="120" type="M"></ERROR>
<ERROR start_off="130"end_off="131"type="S"></ERROR> <ERROR start_off="130" end_off="131" type="S"></ERROR>
<ERROR start_off="132"end_off="133"type="S"></ERROR> <ERROR start_off="132" end_off="133" type="S"></ERROR>
<ERROR start_off="140"end_off="141"type="R"></ERROR> <ERROR start_off="140" end_off="141" type="R"></ERROR>
<ERROR start_off="148"end_off="149"type="R"></ERROR> <ERROR start_off="148" end_off="149" type="R"></ERROR>
<ERROR start_off="156"end_off="157"type="R"></ERROR> <ERROR start_off="156" end_off="157" type="R"></ERROR>
</DOC> </DOC>
<DOC> <DOC>
<TEXT id="200505205533200167_2_3x1"> <TEXT id="200505205533200167_2_3x1">
@ -110061,3 +110062,4 @@ Title一封写给父母的信我来到汉城已经很久了。
<ERROR start_off="91" end_off="91" type="R"></ERROR> <ERROR start_off="91" end_off="91" type="R"></ERROR>
<ERROR start_off="119" end_off="119" type="M"></ERROR> <ERROR start_off="119" end_off="119" type="M"></ERROR>
</DOC> </DOC>
</ROOT>

View File

@ -1,116 +0,0 @@
<ROOT>
<DOC>
<TEXT id="200405109523200554_2_1x1">
他们知不道吸烟对未成年年的影响会造成的各种害处。
</TEXT>
<CORRECTION>
他们不知道吸烟对未成年人会造成的各种伤害。
</CORRECTION>
<ERROR start_off="3" end_off="4" type="W"></ERROR>
<ERROR start_off="12" end_off="12" type="S"></ERROR>
<ERROR start_off="13" end_off="15" type="R"></ERROR>
<ERROR start_off="22" end_off="23" type="S"></ERROR>
</DOC>
<DOC>
<TEXT id="200505109634201470_2_4x2">
从此,父母亲就会教咱们爬行、走路、叫爸爸妈妈。到我们长大了,我们开始从妈妈爸爸身上模仿行为,譬如学习爸妈走路时的高雅步姿,坐姿、礼貌、习惯……渐渐地你会发觉一直好象是自己好奇,觉得有趣才会照着做,模仿着双亲,但不知不觉间他们影响到孩子们的不再是表面的行为,思想,心态、待人接物上我们都领受了不少,这的确会影响我们的成长。
</TEXT>
<CORRECTION>
从此,父母亲就会教我们爬行、走路、叫爸爸妈妈。到我们长大了,我们开始从妈妈爸爸身上模仿行为,譬如学习爸妈走路时的高雅步姿,坐姿、礼貌、习惯……渐渐地你会发觉好象是自己一直好奇,觉得有趣才会照着做,模仿着双亲,但不知不觉间他们影响到孩子们的不再是表面的行为,思想,心态、待人接物上我们都领受了不少,这的确会影响我们的成长。
</CORRECTION>
<ERROR start_off="10" end_off="11" type="S"></ERROR>
<ERROR start_off="79" end_off="85" type="W"></ERROR>
</DOC>
<DOC>
<TEXT id="200510111523200014_2_1x1">
有些不喜欢流行歌曲的人也说流行歌曲能引起不好的作用。
</TEXT>
<CORRECTION>
有些不喜欢流行歌曲的人也说流行歌曲能引起不好的后果。
</CORRECTION>
<ERROR start_off="24" end_off="25" type="S"></ERROR>
</DOC>
<DOC>
<TEXT id="200505204525200257_2_2x2">
如果它呈出不太香的颜色,那就意味着它颜色的来源——你,就是教给它呈出那样的味道的。如果你将一块“白布”成功地染上的话,会出什么样的颜色呢?
</TEXT>
<CORRECTION>
如果它呈出不太美的颜色,那就意味着它颜色的来源——你,就是教给它呈出那样的颜色的人没教好。如果你将一块“白布”成功地染上颜色的话,会出什么样的颜色呢?
</CORRECTION>
<ERROR start_off="8" end_off="8" type="S"></ERROR>
<ERROR start_off="34" end_off="35" type="S"></ERROR>
<ERROR start_off="41" end_off="41" type="M"></ERROR>
<ERROR start_off="57" end_off="57" type="M"></ERROR>
</DOC>
<DOC>
<TEXT id="200405109525200464_2_5x1">
这都是他们自己引起的,埋怨什么呢?
</TEXT>
<CORRECTION>
这都是他们自己造成的,埋怨什么呢?
</CORRECTION>
<ERROR start_off="8" end_off="9" type="S"></ERROR>
</DOC>
<DOC>
<TEXT id="200310576525200063_2_6x1">
他长大以后突然产生要跟女孩交玩儿的念头。
</TEXT>
<CORRECTION>
他长大以后突然产生要跟女孩儿玩耍的念头。
</CORRECTION>
<ERROR start_off="14" end_off="16" type="S"></ERROR>
</DOC>
<DOC>
<TEXT id="200307271523200070_2_2x1">
可是我觉得饥饿是用科学机技来能解决问题,所以我认为吃“绿色食品”是还是重要的问题。
</TEXT>
<CORRECTION>
可是我觉得饥饿是用科学技术能解决的问题,所以我认为吃“绿色食品”是更重要的问题。
</CORRECTION>
<ERROR start_off="12" end_off="13" type="S"></ERROR>
<ERROR start_off="14" end_off="14" type="R"></ERROR>
<ERROR start_off="18" end_off="18" type="M"></ERROR>
<ERROR start_off="34" end_off="35" type="S"></ERROR>
</DOC>
<DOC>
<TEXT id="200405109523100546_2_1x1">
在韩国最近很流行不允许的电视节目,这节目说公共场所抽烟是不道德的行为。
</TEXT>
<CORRECTION>
在韩国最近不允许抽烟的电视节目很流行,这些节目说在公共场所抽烟是不道德的行为。
</CORRECTION>
<ERROR start_off="6" end_off="16" type="W"></ERROR>
<ERROR start_off="12" end_off="12" type="M"></ERROR>
<ERROR start_off="19" end_off="19" type="M"></ERROR>
<ERROR start_off="22" end_off="22" type="M"></ERROR>
</DOC>
<DOC>
<TEXT id="200510302523100195_2_9x2">
如果他喜欢听什么,就能听什么。因为这种现象,因韩流而得到的经济利益也很多。
</TEXT>
<CORRECTION>
他喜欢听什么,就能听什么。因为这种现象,韩国因韩流而得到的经济利益也很多。
</CORRECTION>
<ERROR start_off="1" end_off="2" type="R"></ERROR>
<ERROR start_off="23" end_off="23" type="M"></ERROR>
</DOC>
<DOC>
<TEXT id="200505109922201218_2_2x1">
从环境里小孩子能快速地学或模仿他所见到的事物。
</TEXT>
<CORRECTION>
小孩子能从环境里快速地学或模仿他所见到的事物。
</CORRECTION>
<ERROR start_off="1" end_off="8" type="W"></ERROR>
</DOC>
<DOC>
<TEXT id="200505522522250013_2_7x1">
认识到结婚过程不满六个月,也可以说我的故事中我是主动的。
</TEXT>
<CORRECTION>
认识到结婚的过程不满六个月,也可以说在我的故事中我是主动的。
</CORRECTION>
<ERROR start_off="6" end_off="6" type="M"></ERROR>
<ERROR start_off="18" end_off="18" type="M"></ERROR>
</DOC>
</ROOT>

View File

@ -69,15 +69,5 @@ def get_max_len(word_ids):
return max(len(line) for line in word_ids) return max(len(line) for line in word_ids)
def test_reader(path): def load_test_id(dict_path):
print('Loading test data from %s' % path) return [''.join(line.strip().split()) for idx, line in enumerate(open(dict_path, 'r', encoding='utf-8').readlines())]
sids = []
contents = []
with open(path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip().split('\t')
sids = line[0].replace('(sid=', '').replace(')', '')
text = [w for w in line[1]]
sids.append(sids)
contents.append(text)
return sids, contents

View File

@ -6,15 +6,17 @@ import rnn_crf_config as config
from data_reader import get_max_len from data_reader import get_max_len
from data_reader import load_dict from data_reader import load_dict
from data_reader import load_reverse_dict from data_reader import load_reverse_dict
from data_reader import load_test_id
from data_reader import pad_sequence from data_reader import pad_sequence
from data_reader import vectorize_data from data_reader import vectorize_data
from rnn_crf_model import load_model from rnn_crf_model import load_model
def infer(save_model_path, test_word_path, test_label_path, def infer(save_model_path, test_id_path, test_word_path, test_label_path,
word_dict_path=None, label_dict_path=None, save_pred_path=None, word_dict_path=None, label_dict_path=None, save_pred_path=None,
batch_size=64, embedding_dim=100, rnn_hidden_dim=200): batch_size=64, embedding_dim=100, rnn_hidden_dim=200):
# load dict # load dict
test_ids = load_test_id(test_id_path)
word_ids_dict, ids_word_dict = load_dict(word_dict_path), load_reverse_dict(word_dict_path) word_ids_dict, ids_word_dict = load_dict(word_dict_path), load_reverse_dict(word_dict_path)
label_ids_dict, ids_label_dict = load_dict(label_dict_path), load_reverse_dict(label_dict_path) label_ids_dict, ids_label_dict = load_dict(label_dict_path), load_reverse_dict(label_dict_path)
# read data to index # read data to index
@ -30,14 +32,14 @@ def infer(save_model_path, test_word_path, test_label_path,
probs = model.predict(word_seq, batch_size=batch_size).argmax(-1) probs = model.predict(word_seq, batch_size=batch_size).argmax(-1)
assert len(probs) == len(label_seq) assert len(probs) == len(label_seq)
print('probs.shape:', probs.shape) print('probs.shape:', probs.shape)
save_preds(probs, ids_word_dict, label_ids_dict, ids_label_dict, word_seq, save_pred_path) save_preds(probs, test_ids, ids_word_dict, label_ids_dict, ids_label_dict, word_seq, save_pred_path)
def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out_path): def save_preds(preds, test_ids , ids_word_dict, label_ids_dict, ids_label_dict, X_test, out_path):
with open(out_path, 'w', encoding='utf-8') as f: with open(out_path, 'w', encoding='utf-8') as f:
for i in range(len(X_test)): for i in range(len(X_test)):
sent = X_test[i] sent = X_test[i]
# sent = sid_test[i] sid = test_ids[i]
sentence = ''.join([ids_word_dict[i] for i in sent if i > 0]) sentence = ''.join([ids_word_dict[i] for i in sent if i > 0])
label = [] label = []
for j in range(len(sent)): for j in range(len(sent)):
@ -45,7 +47,6 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
label.append(preds[i][j]) label.append(preds[i][j])
error_flag = False error_flag = False
is_correct = False is_correct = False
current_error = 0 current_error = 0
start_pos = 0 start_pos = 0
for k in range(len(label)): for k in range(len(label)):
@ -61,7 +62,7 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
label[k] != label_ids_dict['R'] and label[k] != label_ids_dict['S'] and \ label[k] != label_ids_dict['R'] and label[k] != label_ids_dict['S'] and \
label[k] != label_ids_dict['M'] and label[k] != label_ids_dict['W']): label[k] != label_ids_dict['M'] and label[k] != label_ids_dict['W']):
end_pos = k end_pos = k
f.write('%s, %d, %d, %s\n' % (sentence, start_pos, end_pos, ids_label_dict[current_error])) f.write('%s, %d, %d, %s\n' % (sid, start_pos, end_pos, ids_label_dict[current_error]))
error_flag = False error_flag = False
current_error = 0 current_error = 0
@ -70,17 +71,18 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
label[k] == label_ids_dict['R'] or label[k] == label_ids_dict['S'] or \ label[k] == label_ids_dict['R'] or label[k] == label_ids_dict['S'] or \
label[k] == label_ids_dict['M'] or label[k] == label_ids_dict['W']): label[k] == label_ids_dict['M'] or label[k] == label_ids_dict['W']):
end_pos = k end_pos = k
f.write('%s, %d, %d, %s\n' % (sentence, start_pos, end_pos, ids_label_dict[current_error])) f.write('%s, %d, %d, %s\n' % (sid, start_pos, end_pos, ids_label_dict[current_error]))
start_pos = k + 1 start_pos = k + 1
current_error = label[k] current_error = label[k]
if not is_correct: if not is_correct:
f.write('%s, correct\n' % (sentence)) f.write('%s, correct\n' % (sid))
print('done, size: %d' % len(X_test)) print('done, infer data size: %d' % len(X_test))
if __name__ == '__main__': if __name__ == '__main__':
infer(config.save_model_path, infer(config.save_model_path,
config.test_id_path,
config.test_word_path, config.test_word_path,
config.test_label_path, config.test_label_path,
word_dict_path=config.word_dict_path, word_dict_path=config.word_dict_path,

View File

@ -7,17 +7,19 @@ import rnn_crf_config as config
from utils.text_utils import segment from utils.text_utils import segment
def load_corpus_data(data_path): def parse_xml_file(path):
word_lst, label_lst = [], [] print('Parse data from %s' % path)
with open(data_path, 'r', encoding='utf-8') as f: id_lst, word_lst, label_lst = [], [], []
dom_tree = minidom.parse(f) with open(path, 'r', encoding='utf-8') as f:
dom_tree = minidom.parse(path)
docs = dom_tree.documentElement.getElementsByTagName('DOC') docs = dom_tree.documentElement.getElementsByTagName('DOC')
for doc in docs: for doc in docs:
# Input the text # Input the text
sentence = doc.getElementsByTagName('TEXT')[0]. \ text = doc.getElementsByTagName('TEXT')[0]. \
childNodes[0].data.strip() childNodes[0].data.strip()
text_id = doc.getElementsByTagName('TEXT')[0].getAttribute('id')
errors = doc.getElementsByTagName('ERROR') errors = doc.getElementsByTagName('ERROR')
# Find the error position and error type # Locate the error position and error type
locate_dict = {} locate_dict = {}
for error in errors: for error in errors:
start_off = error.getAttribute('start_off') start_off = error.getAttribute('start_off')
@ -26,7 +28,7 @@ def load_corpus_data(data_path):
for i in range(int(start_off) - 1, int(end_off)): for i in range(int(start_off) - 1, int(end_off)):
locate_dict[i] = error_type locate_dict[i] = error_type
# Segment with pos # Segment with pos
word_seq, pos_seq = segment(sentence, cut_type='char', pos=True) word_seq, pos_seq = segment(text, cut_type='char', pos=True)
word_arr, label_arr = [], [] word_arr, label_arr = [], []
for i in range(len(word_seq)): for i in range(len(word_seq)):
if i in locate_dict: if i in locate_dict:
@ -37,10 +39,61 @@ def load_corpus_data(data_path):
word_arr.append(word_seq[i]) word_arr.append(word_seq[i])
# Fill with pos tag # Fill with pos tag
label_arr.append(pos_seq[i]) label_arr.append(pos_seq[i])
id_lst.append(text_id)
word_lst.append(word_arr) word_lst.append(word_arr)
label_lst.append(label_arr) label_lst.append(label_arr)
return id_lst, word_lst, label_lst
return word_lst, label_lst
def parse_txt_file(input_path, truth_path):
print('Parse data from %s and %s' % (input_path, truth_path))
id_lst, word_lst, label_lst = [], [], []
# read truth file
truth_dict = {}
with open(truth_path, 'r', encoding='utf-8') as truth_f:
for line in truth_f:
parts = line.strip().split(',')
# Locate the error position
locate_dict = {}
if len(parts) == 4:
text_id = parts[0]
start_off = parts[1]
end_off = parts[2]
error_type = parts[3]
for i in range(int(start_off) - 1, int(end_off)):
locate_dict[i] = error_type
if text_id in truth_dict:
truth_dict[text_id].append(locate_dict)
else:
truth_dict[text_id] = [locate_dict]
# read input file and get tokenize
with open(input_path, 'r', encoding='utf-8') as input_f:
for line in input_f:
parts = line.strip().split('\t')
text_id = parts[0].replace('(sid=', '').replace(')', '')
text = parts[1]
# Segment with pos
word_seq, pos_seq = segment(text, cut_type='char', pos=True)
word_arr, label_arr = [], []
if text_id in truth_dict:
for locate_dict in truth_dict[text_id]:
for i in range(len(word_seq)):
if i in locate_dict:
word_arr.append(word_seq[i])
# Fill with error type
label_arr.append(locate_dict[i])
else:
word_arr.append(word_seq[i])
# Fill with pos tag
label_arr.append(pos_seq[i])
else:
word_arr = word_seq
label_arr = pos_seq
id_lst.append(text_id)
word_lst.append(word_arr)
label_lst.append(label_arr)
return id_lst, word_lst, label_lst
def transform_corpus_data(data_list, data_path): def transform_corpus_data(data_list, data_path):
@ -56,17 +109,19 @@ if __name__ == '__main__':
# train data # train data
train_words, train_labels = [], [] train_words, train_labels = [], []
for path in config.train_paths: for path in config.train_paths:
word_list, label_list = load_corpus_data(path) _, word_list, label_list = parse_xml_file(path)
train_words.extend(word_list) train_words.extend(word_list)
train_labels.extend(label_list) train_labels.extend(label_list)
transform_corpus_data(train_words, config.train_word_path) transform_corpus_data(train_words, config.train_word_path)
transform_corpus_data(train_labels, config.train_label_path) transform_corpus_data(train_labels, config.train_label_path)
# test data # test data
test_words, test_labels = [], [] test_ids, test_words, test_labels = [], [],[]
for path in config.test_paths: for input_path, truth_path in config.test_paths.items():
word_list, label_list = load_corpus_data(path) id_list, word_list, label_list = parse_txt_file(input_path, truth_path)
test_ids.extend(id_list)
test_words.extend(word_list) test_words.extend(word_list)
test_labels.extend(label_list) test_labels.extend(label_list)
transform_corpus_data(test_ids, config.test_id_path)
transform_corpus_data(test_words, config.test_word_path) transform_corpus_data(test_words, config.test_word_path)
transform_corpus_data(test_labels, config.test_label_path) transform_corpus_data(test_labels, config.test_label_path)

View File

@ -2,30 +2,30 @@
# Author: XuMing <xuming624@qq.com> # Author: XuMing <xuming624@qq.com>
# Brief: # Brief:
import os import os
output_dir = './output' output_dir = './output'
# CGED chinese corpus # CGED chinese corpus
train_paths = ['../data/cn/CGED/CGED18_HSK_TrainingSet.xml', train_paths = ['../data/cn/CGED/CGED18_HSK_TrainingSet.xml',
# '../data/cn/CGED/CGED17_HSK_TrainingSet.xml', '../data/cn/CGED/CGED17_HSK_TrainingSet.xml',
# '../data/cn/CGED/CGED16_HSK_TrainingSet.xml' '../data/cn/CGED/CGED16_HSK_TrainingSet.xml']
]
train_word_path = output_dir + '/train_words.txt' train_word_path = output_dir + '/train_words.txt'
train_label_path = output_dir + '/train_labels.txt' train_label_path = output_dir + '/train_labels.txt'
test_paths = ['../data/cn/CGED/CGED18_HSK_TestingSet.xml', test_paths = {'../data/cn/CGED/CGED16_HSK_Test_Input.txt': '../data/cn/CGED/CGED16_HSK_Test_Truth.txt',
# '../data/cn/CGED/CGED17_HSK_TestingSet.xml', '../data/cn/CGED/CGED17_HSK_Test_Input.txt': '../data/cn/CGED/CGED17_HSK_Test_Truth.txt'}
# '../data/cn/CGED/CGED16_HSK_TestingSet.xml'
]
test_word_path = output_dir + '/test_words.txt' test_word_path = output_dir + '/test_words.txt'
test_label_path = output_dir + '/test_labels.txt' test_label_path = output_dir + '/test_labels.txt'
test_id_path = output_dir + '/test_ids.txt'
# vocab # vocab
word_dict_path = output_dir + '/word_dict.txt' word_dict_path = output_dir + '/word_dict.txt'
label_dict_path = output_dir + '/label_dict.txt' label_dict_path = output_dir + '/label_dict.txt'
# config # config
batch_size = 64 batch_size = 64
epoch = 1 epoch = 10
embedding_dim = 100 embedding_dim = 100
rnn_hidden_dim = 200 rnn_hidden_dim = 200
cutoff_frequency = 0 cutoff_frequency = 5
save_model_path = output_dir + '/rnn_crf_model.h5' # Path of the model saved, default is output_path/model save_model_path = output_dir + '/rnn_crf_model.h5' # Path of the model saved, default is output_path/model
# infer # infer

View File

@ -8,7 +8,6 @@ import rnn_crf_config as config
from data_reader import build_dict from data_reader import build_dict
from data_reader import get_max_len from data_reader import get_max_len
from data_reader import load_dict from data_reader import load_dict
from data_reader import load_reverse_dict
from data_reader import pad_sequence from data_reader import pad_sequence
from data_reader import vectorize_data from data_reader import vectorize_data
from rnn_crf_model import create_model from rnn_crf_model import create_model