update corpus and add infer.
This commit is contained in:
parent
77a383e645
commit
a1829942e3
|
@ -11151,25 +11151,6 @@
|
||||||
<ERROR start_off="37" end_off="37" type="S"></ERROR>
|
<ERROR start_off="37" end_off="37" type="S"></ERROR>
|
||||||
</DOC>
|
</DOC>
|
||||||
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="11163200405109523100546_2_1x2">
|
|
||||||
还有,如果有人在公共场所抽烟,那么周围的人很容易影响到抽烟的坏处,尤其是对生长的青少年来说这样的问题很甚,不仅害他们的身体,还害他们的精神。因为他们是还没成了完全的好精神,他们很容易模仿抽烟的人。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
还有,如果有人在公共场所抽烟,那么周围的人很容易受抽烟的坏处影响,尤其是对生长的青少年来说这样的问题很严重,不仅损害他们的身体,还伤害他们的精神。因为他们是还没形成完整的好的思想,他们很容易模仿抽烟的人。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="25" end_off="25" type="M"></ERROR>
|
|
||||||
<ERROR start_off="27" end_off="27" type="R"></ERROR>
|
|
||||||
<ERROR start_off="25" end_off="32" type="W"></ERROR>
|
|
||||||
<ERROR start_off="30" end_off="32" type="R"></ERROR>
|
|
||||||
<ERROR start_off="52" end_off="52" type="S"></ERROR>
|
|
||||||
<ERROR start_off="56" end_off="56" type="S"></ERROR>
|
|
||||||
<ERROR start_off="64" end_off="64" type="S"></ERROR>
|
|
||||||
<ERROR start_off="78" end_off="79" type="S"></ERROR>
|
|
||||||
<ERROR start_off="80" end_off="81" type="S"></ERROR>
|
|
||||||
<ERROR start_off="84" end_off="86" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
|
|
||||||
<DOC>
|
<DOC>
|
||||||
<TEXT id="11182200112576525100017_2_11x3">
|
<TEXT id="11182200112576525100017_2_11x3">
|
||||||
”我现在也忘不了那时感到的幸福感。这样以来,如果碰到困难时,我觉得以后一定打破这个困难,我相信我一定可以办好。本来我对什么事情也觉得乐观,碰到什么困难也总有一天一定可以克服。
|
”我现在也忘不了那时感到的幸福感。这样以来,如果碰到困难时,我觉得以后一定打破这个困难,我相信我一定可以办好。本来我对什么事情也觉得乐观,碰到什么困难也总有一天一定可以克服。
|
||||||
|
@ -70200,23 +70181,6 @@
|
||||||
<ERROR start_off="13" end_off="13" type="S"></ERROR>
|
<ERROR start_off="13" end_off="13" type="S"></ERROR>
|
||||||
</DOC>
|
</DOC>
|
||||||
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="73191200210529529251002_2_2x3">
|
|
||||||
我家里的问题幸好没有过得解不可治疗。一个方法可以解决代沟是很容易,只唯有“爱”什么都会有办法了。我妈妈担心姐姐的安全,会有这样说。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
我家里的问题幸好没有解决不了的。一个可以很容易解决代沟的方法,只要有“爱”什么都会有办法了。我妈妈担心姐姐的安全,才会这样说。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="11" end_off="12" type="R"></ERROR>
|
|
||||||
<ERROR start_off="15" end_off="18" type="S"></ERROR>
|
|
||||||
<ERROR start_off="30" end_off="30" type="R"></ERROR>
|
|
||||||
<ERROR start_off="22" end_off="33" type="W"></ERROR>
|
|
||||||
<ERROR start_off="34" end_off="34" type="M"></ERROR>
|
|
||||||
<ERROR start_off="36" end_off="36" type="S"></ERROR>
|
|
||||||
<ERROR start_off="61" end_off="61" type="M"></ERROR>
|
|
||||||
<ERROR start_off="62" end_off="62" type="R"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
|
|
||||||
<DOC>
|
<DOC>
|
||||||
<TEXT id="73208200405204525200337_2_5x1">
|
<TEXT id="73208200405204525200337_2_5x1">
|
||||||
因为对他们来说,虽然吸烟对身体不好,但实际上是他们的爱好、娱乐和放松的方法。
|
因为对他们来说,虽然吸烟对身体不好,但实际上是他们的爱好、娱乐和放松的方法。
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
<ROOT>
|
||||||
<DOC>
|
<DOC>
|
||||||
<TEXT id="200305533533250052_2_2x1">
|
<TEXT id="200305533533250052_2_2x1">
|
||||||
而他们所讨论的就是佛教和西方社会对于“破坏性的情感”和近几年在美国非常流行的“情绪管理”有何而为而有着不同的说法和见解。
|
而他们所讨论的就是佛教和西方社会对于“破坏性的情感”和近几年在美国非常流行的“情绪管理”有何而为而有着不同的说法和见解。
|
||||||
|
@ -25672,7 +25673,7 @@ A和尚慢慢地了解B和尚的心情因此做出了决定。
|
||||||
<ERROR start_off="4" end_off="5" type="W"></ERROR>
|
<ERROR start_off="4" end_off="5" type="W"></ERROR>
|
||||||
</DOC>
|
</DOC>
|
||||||
<DOC>
|
<DOC>
|
||||||
XT id="200105109525200822_2_4x3">
|
<TEXT id="200105109525200822_2_4x3">
|
||||||
如果充分的钱,我们似乎可以得到什么东西、什么服务。但是,这是真好的事情吗?如果我们失去实干苦干的独立精神,我们会变成笨头笨脑的人。
|
如果充分的钱,我们似乎可以得到什么东西、什么服务。但是,这是真好的事情吗?如果我们失去实干苦干的独立精神,我们会变成笨头笨脑的人。
|
||||||
</TEXT>
|
</TEXT>
|
||||||
<CORRECTION>
|
<CORRECTION>
|
||||||
|
@ -31742,7 +31743,6 @@ XT id="200105109525200822_2_4x3">
|
||||||
第一,我认为生命是上帝和父母给我们的,谁也不准随便放弃,了断生命。第二,选择“安乐死”就是逃避生活,放弃人生的意思。有这样的想法所以我不同意“安乐死”。
|
第一,我认为生命是上帝和父母给我们的,谁也不准随便放弃,了断生命。第二,选择“安乐死”就是逃避生活,放弃人生的意思。有这样的想法所以我不同意“安乐死”。
|
||||||
</CORRECTION>
|
</CORRECTION>
|
||||||
<ERROR start_off="10" end_off="11" type="S"></ERROR>
|
<ERROR start_off="10" end_off="11" type="S"></ERROR>
|
||||||
<ERROR start_off="29" end_off="
|
|
||||||
<ERROR start_off="46" end_off="46" type="R"></ERROR>
|
<ERROR start_off="46" end_off="46" type="R"></ERROR>
|
||||||
<ERROR start_off="47" end_off="50" type="W"></ERROR>
|
<ERROR start_off="47" end_off="50" type="W"></ERROR>
|
||||||
<ERROR start_off="66" end_off="66" type="R"></ERROR>
|
<ERROR start_off="66" end_off="66" type="R"></ERROR>
|
||||||
|
@ -39739,6 +39739,7 @@ XT id="200105109525200822_2_4x3">
|
||||||
</TEXT>
|
</TEXT>
|
||||||
<CORRECTION>
|
<CORRECTION>
|
||||||
因为现在的中国发展越来越快,所以将来需要能说汉语的人。现在也是很多日本的公司依靠中国的公司。
|
因为现在的中国发展越来越快,所以将来需要能说汉语的人。现在也是很多日本的公司依靠中国的公司。
|
||||||
|
</CORRECTION>
|
||||||
<ERROR start_off="11" end_off="11" type="M"></ERROR>
|
<ERROR start_off="11" end_off="11" type="M"></ERROR>
|
||||||
<ERROR start_off="8" end_off="12" type="W"></ERROR>
|
<ERROR start_off="8" end_off="12" type="W"></ERROR>
|
||||||
<ERROR start_off="29" end_off="37" type="W"></ERROR>
|
<ERROR start_off="29" end_off="37" type="W"></ERROR>
|
||||||
|
@ -42457,7 +42458,7 @@ XT id="200105109525200822_2_4x3">
|
||||||
<ERROR start_off="98" end_off="99" type="W"></ERROR>
|
<ERROR start_off="98" end_off="99" type="W"></ERROR>
|
||||||
</DOC>
|
</DOC>
|
||||||
<DOC>
|
<DOC>
|
||||||
XT id="200307217523100105_2_2x1">
|
<TEXT id="200307217523100105_2_2x1">
|
||||||
但发明了“绿色食品”以后,产生了一个很大的问题,那就是随着“绿色食品”的增加却减小了农作物总产量,而现在世界上还有几亿人因缺少粮食而挨饿。
|
但发明了“绿色食品”以后,产生了一个很大的问题,那就是随着“绿色食品”的增加却减小了农作物总产量,而现在世界上还有几亿人因缺少粮食而挨饿。
|
||||||
</TEXT>
|
</TEXT>
|
||||||
<CORRECTION>
|
<CORRECTION>
|
||||||
|
@ -110061,3 +110062,4 @@ Title一封写给父母的信我来到汉城已经很久了。
|
||||||
<ERROR start_off="91" end_off="91" type="R"></ERROR>
|
<ERROR start_off="91" end_off="91" type="R"></ERROR>
|
||||||
<ERROR start_off="119" end_off="119" type="M"></ERROR>
|
<ERROR start_off="119" end_off="119" type="M"></ERROR>
|
||||||
</DOC>
|
</DOC>
|
||||||
|
</ROOT>
|
|
@ -1,116 +0,0 @@
|
||||||
<ROOT>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200405109523200554_2_1x1">
|
|
||||||
他们知不道吸烟对未成年年的影响会造成的各种害处。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
他们不知道吸烟对未成年人会造成的各种伤害。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="3" end_off="4" type="W"></ERROR>
|
|
||||||
<ERROR start_off="12" end_off="12" type="S"></ERROR>
|
|
||||||
<ERROR start_off="13" end_off="15" type="R"></ERROR>
|
|
||||||
<ERROR start_off="22" end_off="23" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200505109634201470_2_4x2">
|
|
||||||
从此,父母亲就会教咱们爬行、走路、叫爸爸妈妈。到我们长大了,我们开始从妈妈爸爸身上模仿行为,譬如学习爸妈走路时的高雅步姿,坐姿、礼貌、习惯……渐渐地你会发觉一直好象是自己好奇,觉得有趣才会照着做,模仿着双亲,但不知不觉间他们影响到孩子们的不再是表面的行为,思想,心态、待人接物上我们都领受了不少,这的确会影响我们的成长。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
从此,父母亲就会教我们爬行、走路、叫爸爸妈妈。到我们长大了,我们开始从妈妈爸爸身上模仿行为,譬如学习爸妈走路时的高雅步姿,坐姿、礼貌、习惯……渐渐地你会发觉好象是自己一直好奇,觉得有趣才会照着做,模仿着双亲,但不知不觉间他们影响到孩子们的不再是表面的行为,思想,心态、待人接物上我们都领受了不少,这的确会影响我们的成长。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="10" end_off="11" type="S"></ERROR>
|
|
||||||
<ERROR start_off="79" end_off="85" type="W"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200510111523200014_2_1x1">
|
|
||||||
有些不喜欢流行歌曲的人也说流行歌曲能引起不好的作用。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
有些不喜欢流行歌曲的人也说流行歌曲能引起不好的后果。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="24" end_off="25" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200505204525200257_2_2x2">
|
|
||||||
如果它呈出不太香的颜色,那就意味着它颜色的来源——你,就是教给它呈出那样的味道的。如果你将一块“白布”成功地染上的话,会出什么样的颜色呢?
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
如果它呈出不太美的颜色,那就意味着它颜色的来源——你,就是教给它呈出那样的颜色的人没教好。如果你将一块“白布”成功地染上颜色的话,会出什么样的颜色呢?
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="8" end_off="8" type="S"></ERROR>
|
|
||||||
<ERROR start_off="34" end_off="35" type="S"></ERROR>
|
|
||||||
<ERROR start_off="41" end_off="41" type="M"></ERROR>
|
|
||||||
<ERROR start_off="57" end_off="57" type="M"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200405109525200464_2_5x1">
|
|
||||||
这都是他们自己引起的,埋怨什么呢?
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
这都是他们自己造成的,埋怨什么呢?
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="8" end_off="9" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200310576525200063_2_6x1">
|
|
||||||
他长大以后突然产生要跟女孩交玩儿的念头。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
他长大以后突然产生要跟女孩儿玩耍的念头。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="14" end_off="16" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200307271523200070_2_2x1">
|
|
||||||
可是我觉得饥饿是用科学机技来能解决问题,所以我认为吃“绿色食品”是还是重要的问题。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
可是我觉得饥饿是用科学技术能解决的问题,所以我认为吃“绿色食品”是更重要的问题。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="12" end_off="13" type="S"></ERROR>
|
|
||||||
<ERROR start_off="14" end_off="14" type="R"></ERROR>
|
|
||||||
<ERROR start_off="18" end_off="18" type="M"></ERROR>
|
|
||||||
<ERROR start_off="34" end_off="35" type="S"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200405109523100546_2_1x1">
|
|
||||||
在韩国最近很流行不允许的电视节目,这节目说公共场所抽烟是不道德的行为。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
在韩国最近不允许抽烟的电视节目很流行,这些节目说在公共场所抽烟是不道德的行为。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="6" end_off="16" type="W"></ERROR>
|
|
||||||
<ERROR start_off="12" end_off="12" type="M"></ERROR>
|
|
||||||
<ERROR start_off="19" end_off="19" type="M"></ERROR>
|
|
||||||
<ERROR start_off="22" end_off="22" type="M"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200510302523100195_2_9x2">
|
|
||||||
如果他喜欢听什么,就能听什么。因为这种现象,因韩流而得到的经济利益也很多。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
他喜欢听什么,就能听什么。因为这种现象,韩国因韩流而得到的经济利益也很多。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="1" end_off="2" type="R"></ERROR>
|
|
||||||
<ERROR start_off="23" end_off="23" type="M"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200505109922201218_2_2x1">
|
|
||||||
从环境里小孩子能快速地学或模仿他所见到的事物。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
小孩子能从环境里快速地学或模仿他所见到的事物。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="1" end_off="8" type="W"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
<DOC>
|
|
||||||
<TEXT id="200505522522250013_2_7x1">
|
|
||||||
认识到结婚过程不满六个月,也可以说我的故事中我是主动的。
|
|
||||||
</TEXT>
|
|
||||||
<CORRECTION>
|
|
||||||
认识到结婚的过程不满六个月,也可以说在我的故事中我是主动的。
|
|
||||||
</CORRECTION>
|
|
||||||
<ERROR start_off="6" end_off="6" type="M"></ERROR>
|
|
||||||
<ERROR start_off="18" end_off="18" type="M"></ERROR>
|
|
||||||
</DOC>
|
|
||||||
</ROOT>
|
|
|
@ -69,15 +69,5 @@ def get_max_len(word_ids):
|
||||||
return max(len(line) for line in word_ids)
|
return max(len(line) for line in word_ids)
|
||||||
|
|
||||||
|
|
||||||
def test_reader(path):
|
def load_test_id(dict_path):
|
||||||
print('Loading test data from %s' % path)
|
return [''.join(line.strip().split()) for idx, line in enumerate(open(dict_path, 'r', encoding='utf-8').readlines())]
|
||||||
sids = []
|
|
||||||
contents = []
|
|
||||||
with open(path, 'r', encoding='utf-8') as f:
|
|
||||||
for line in f:
|
|
||||||
line = line.strip().split('\t')
|
|
||||||
sids = line[0].replace('(sid=', '').replace(')', '')
|
|
||||||
text = [w for w in line[1]]
|
|
||||||
sids.append(sids)
|
|
||||||
contents.append(text)
|
|
||||||
return sids, contents
|
|
|
@ -6,15 +6,17 @@ import rnn_crf_config as config
|
||||||
from data_reader import get_max_len
|
from data_reader import get_max_len
|
||||||
from data_reader import load_dict
|
from data_reader import load_dict
|
||||||
from data_reader import load_reverse_dict
|
from data_reader import load_reverse_dict
|
||||||
|
from data_reader import load_test_id
|
||||||
from data_reader import pad_sequence
|
from data_reader import pad_sequence
|
||||||
from data_reader import vectorize_data
|
from data_reader import vectorize_data
|
||||||
from rnn_crf_model import load_model
|
from rnn_crf_model import load_model
|
||||||
|
|
||||||
|
|
||||||
def infer(save_model_path, test_word_path, test_label_path,
|
def infer(save_model_path, test_id_path, test_word_path, test_label_path,
|
||||||
word_dict_path=None, label_dict_path=None, save_pred_path=None,
|
word_dict_path=None, label_dict_path=None, save_pred_path=None,
|
||||||
batch_size=64, embedding_dim=100, rnn_hidden_dim=200):
|
batch_size=64, embedding_dim=100, rnn_hidden_dim=200):
|
||||||
# load dict
|
# load dict
|
||||||
|
test_ids = load_test_id(test_id_path)
|
||||||
word_ids_dict, ids_word_dict = load_dict(word_dict_path), load_reverse_dict(word_dict_path)
|
word_ids_dict, ids_word_dict = load_dict(word_dict_path), load_reverse_dict(word_dict_path)
|
||||||
label_ids_dict, ids_label_dict = load_dict(label_dict_path), load_reverse_dict(label_dict_path)
|
label_ids_dict, ids_label_dict = load_dict(label_dict_path), load_reverse_dict(label_dict_path)
|
||||||
# read data to index
|
# read data to index
|
||||||
|
@ -30,14 +32,14 @@ def infer(save_model_path, test_word_path, test_label_path,
|
||||||
probs = model.predict(word_seq, batch_size=batch_size).argmax(-1)
|
probs = model.predict(word_seq, batch_size=batch_size).argmax(-1)
|
||||||
assert len(probs) == len(label_seq)
|
assert len(probs) == len(label_seq)
|
||||||
print('probs.shape:', probs.shape)
|
print('probs.shape:', probs.shape)
|
||||||
save_preds(probs, ids_word_dict, label_ids_dict, ids_label_dict, word_seq, save_pred_path)
|
save_preds(probs, test_ids, ids_word_dict, label_ids_dict, ids_label_dict, word_seq, save_pred_path)
|
||||||
|
|
||||||
|
|
||||||
def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out_path):
|
def save_preds(preds, test_ids , ids_word_dict, label_ids_dict, ids_label_dict, X_test, out_path):
|
||||||
with open(out_path, 'w', encoding='utf-8') as f:
|
with open(out_path, 'w', encoding='utf-8') as f:
|
||||||
for i in range(len(X_test)):
|
for i in range(len(X_test)):
|
||||||
sent = X_test[i]
|
sent = X_test[i]
|
||||||
# sent = sid_test[i]
|
sid = test_ids[i]
|
||||||
sentence = ''.join([ids_word_dict[i] for i in sent if i > 0])
|
sentence = ''.join([ids_word_dict[i] for i in sent if i > 0])
|
||||||
label = []
|
label = []
|
||||||
for j in range(len(sent)):
|
for j in range(len(sent)):
|
||||||
|
@ -45,7 +47,6 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
|
||||||
label.append(preds[i][j])
|
label.append(preds[i][j])
|
||||||
error_flag = False
|
error_flag = False
|
||||||
is_correct = False
|
is_correct = False
|
||||||
|
|
||||||
current_error = 0
|
current_error = 0
|
||||||
start_pos = 0
|
start_pos = 0
|
||||||
for k in range(len(label)):
|
for k in range(len(label)):
|
||||||
|
@ -61,7 +62,7 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
|
||||||
label[k] != label_ids_dict['R'] and label[k] != label_ids_dict['S'] and \
|
label[k] != label_ids_dict['R'] and label[k] != label_ids_dict['S'] and \
|
||||||
label[k] != label_ids_dict['M'] and label[k] != label_ids_dict['W']):
|
label[k] != label_ids_dict['M'] and label[k] != label_ids_dict['W']):
|
||||||
end_pos = k
|
end_pos = k
|
||||||
f.write('%s, %d, %d, %s\n' % (sentence, start_pos, end_pos, ids_label_dict[current_error]))
|
f.write('%s, %d, %d, %s\n' % (sid, start_pos, end_pos, ids_label_dict[current_error]))
|
||||||
|
|
||||||
error_flag = False
|
error_flag = False
|
||||||
current_error = 0
|
current_error = 0
|
||||||
|
@ -70,17 +71,18 @@ def save_preds(preds, ids_word_dict, label_ids_dict, ids_label_dict, X_test, out
|
||||||
label[k] == label_ids_dict['R'] or label[k] == label_ids_dict['S'] or \
|
label[k] == label_ids_dict['R'] or label[k] == label_ids_dict['S'] or \
|
||||||
label[k] == label_ids_dict['M'] or label[k] == label_ids_dict['W']):
|
label[k] == label_ids_dict['M'] or label[k] == label_ids_dict['W']):
|
||||||
end_pos = k
|
end_pos = k
|
||||||
f.write('%s, %d, %d, %s\n' % (sentence, start_pos, end_pos, ids_label_dict[current_error]))
|
f.write('%s, %d, %d, %s\n' % (sid, start_pos, end_pos, ids_label_dict[current_error]))
|
||||||
|
|
||||||
start_pos = k + 1
|
start_pos = k + 1
|
||||||
current_error = label[k]
|
current_error = label[k]
|
||||||
if not is_correct:
|
if not is_correct:
|
||||||
f.write('%s, correct\n' % (sentence))
|
f.write('%s, correct\n' % (sid))
|
||||||
print('done, size: %d' % len(X_test))
|
print('done, infer data size: %d' % len(X_test))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
infer(config.save_model_path,
|
infer(config.save_model_path,
|
||||||
|
config.test_id_path,
|
||||||
config.test_word_path,
|
config.test_word_path,
|
||||||
config.test_label_path,
|
config.test_label_path,
|
||||||
word_dict_path=config.word_dict_path,
|
word_dict_path=config.word_dict_path,
|
||||||
|
|
|
@ -7,17 +7,19 @@ import rnn_crf_config as config
|
||||||
from utils.text_utils import segment
|
from utils.text_utils import segment
|
||||||
|
|
||||||
|
|
||||||
def load_corpus_data(data_path):
|
def parse_xml_file(path):
|
||||||
word_lst, label_lst = [], []
|
print('Parse data from %s' % path)
|
||||||
with open(data_path, 'r', encoding='utf-8') as f:
|
id_lst, word_lst, label_lst = [], [], []
|
||||||
dom_tree = minidom.parse(f)
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
|
dom_tree = minidom.parse(path)
|
||||||
docs = dom_tree.documentElement.getElementsByTagName('DOC')
|
docs = dom_tree.documentElement.getElementsByTagName('DOC')
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
# Input the text
|
# Input the text
|
||||||
sentence = doc.getElementsByTagName('TEXT')[0]. \
|
text = doc.getElementsByTagName('TEXT')[0]. \
|
||||||
childNodes[0].data.strip()
|
childNodes[0].data.strip()
|
||||||
|
text_id = doc.getElementsByTagName('TEXT')[0].getAttribute('id')
|
||||||
errors = doc.getElementsByTagName('ERROR')
|
errors = doc.getElementsByTagName('ERROR')
|
||||||
# Find the error position and error type
|
# Locate the error position and error type
|
||||||
locate_dict = {}
|
locate_dict = {}
|
||||||
for error in errors:
|
for error in errors:
|
||||||
start_off = error.getAttribute('start_off')
|
start_off = error.getAttribute('start_off')
|
||||||
|
@ -26,7 +28,7 @@ def load_corpus_data(data_path):
|
||||||
for i in range(int(start_off) - 1, int(end_off)):
|
for i in range(int(start_off) - 1, int(end_off)):
|
||||||
locate_dict[i] = error_type
|
locate_dict[i] = error_type
|
||||||
# Segment with pos
|
# Segment with pos
|
||||||
word_seq, pos_seq = segment(sentence, cut_type='char', pos=True)
|
word_seq, pos_seq = segment(text, cut_type='char', pos=True)
|
||||||
word_arr, label_arr = [], []
|
word_arr, label_arr = [], []
|
||||||
for i in range(len(word_seq)):
|
for i in range(len(word_seq)):
|
||||||
if i in locate_dict:
|
if i in locate_dict:
|
||||||
|
@ -37,10 +39,61 @@ def load_corpus_data(data_path):
|
||||||
word_arr.append(word_seq[i])
|
word_arr.append(word_seq[i])
|
||||||
# Fill with pos tag
|
# Fill with pos tag
|
||||||
label_arr.append(pos_seq[i])
|
label_arr.append(pos_seq[i])
|
||||||
|
id_lst.append(text_id)
|
||||||
word_lst.append(word_arr)
|
word_lst.append(word_arr)
|
||||||
label_lst.append(label_arr)
|
label_lst.append(label_arr)
|
||||||
|
return id_lst, word_lst, label_lst
|
||||||
|
|
||||||
return word_lst, label_lst
|
|
||||||
|
def parse_txt_file(input_path, truth_path):
|
||||||
|
print('Parse data from %s and %s' % (input_path, truth_path))
|
||||||
|
id_lst, word_lst, label_lst = [], [], []
|
||||||
|
# read truth file
|
||||||
|
truth_dict = {}
|
||||||
|
with open(truth_path, 'r', encoding='utf-8') as truth_f:
|
||||||
|
for line in truth_f:
|
||||||
|
parts = line.strip().split(',')
|
||||||
|
# Locate the error position
|
||||||
|
locate_dict = {}
|
||||||
|
if len(parts) == 4:
|
||||||
|
text_id = parts[0]
|
||||||
|
start_off = parts[1]
|
||||||
|
end_off = parts[2]
|
||||||
|
error_type = parts[3]
|
||||||
|
for i in range(int(start_off) - 1, int(end_off)):
|
||||||
|
locate_dict[i] = error_type
|
||||||
|
if text_id in truth_dict:
|
||||||
|
truth_dict[text_id].append(locate_dict)
|
||||||
|
else:
|
||||||
|
truth_dict[text_id] = [locate_dict]
|
||||||
|
|
||||||
|
# read input file and get tokenize
|
||||||
|
with open(input_path, 'r', encoding='utf-8') as input_f:
|
||||||
|
for line in input_f:
|
||||||
|
parts = line.strip().split('\t')
|
||||||
|
text_id = parts[0].replace('(sid=', '').replace(')', '')
|
||||||
|
text = parts[1]
|
||||||
|
# Segment with pos
|
||||||
|
word_seq, pos_seq = segment(text, cut_type='char', pos=True)
|
||||||
|
word_arr, label_arr = [], []
|
||||||
|
if text_id in truth_dict:
|
||||||
|
for locate_dict in truth_dict[text_id]:
|
||||||
|
for i in range(len(word_seq)):
|
||||||
|
if i in locate_dict:
|
||||||
|
word_arr.append(word_seq[i])
|
||||||
|
# Fill with error type
|
||||||
|
label_arr.append(locate_dict[i])
|
||||||
|
else:
|
||||||
|
word_arr.append(word_seq[i])
|
||||||
|
# Fill with pos tag
|
||||||
|
label_arr.append(pos_seq[i])
|
||||||
|
else:
|
||||||
|
word_arr = word_seq
|
||||||
|
label_arr = pos_seq
|
||||||
|
id_lst.append(text_id)
|
||||||
|
word_lst.append(word_arr)
|
||||||
|
label_lst.append(label_arr)
|
||||||
|
return id_lst, word_lst, label_lst
|
||||||
|
|
||||||
|
|
||||||
def transform_corpus_data(data_list, data_path):
|
def transform_corpus_data(data_list, data_path):
|
||||||
|
@ -56,17 +109,19 @@ if __name__ == '__main__':
|
||||||
# train data
|
# train data
|
||||||
train_words, train_labels = [], []
|
train_words, train_labels = [], []
|
||||||
for path in config.train_paths:
|
for path in config.train_paths:
|
||||||
word_list, label_list = load_corpus_data(path)
|
_, word_list, label_list = parse_xml_file(path)
|
||||||
train_words.extend(word_list)
|
train_words.extend(word_list)
|
||||||
train_labels.extend(label_list)
|
train_labels.extend(label_list)
|
||||||
transform_corpus_data(train_words, config.train_word_path)
|
transform_corpus_data(train_words, config.train_word_path)
|
||||||
transform_corpus_data(train_labels, config.train_label_path)
|
transform_corpus_data(train_labels, config.train_label_path)
|
||||||
|
|
||||||
# test data
|
# test data
|
||||||
test_words, test_labels = [], []
|
test_ids, test_words, test_labels = [], [],[]
|
||||||
for path in config.test_paths:
|
for input_path, truth_path in config.test_paths.items():
|
||||||
word_list, label_list = load_corpus_data(path)
|
id_list, word_list, label_list = parse_txt_file(input_path, truth_path)
|
||||||
|
test_ids.extend(id_list)
|
||||||
test_words.extend(word_list)
|
test_words.extend(word_list)
|
||||||
test_labels.extend(label_list)
|
test_labels.extend(label_list)
|
||||||
|
transform_corpus_data(test_ids, config.test_id_path)
|
||||||
transform_corpus_data(test_words, config.test_word_path)
|
transform_corpus_data(test_words, config.test_word_path)
|
||||||
transform_corpus_data(test_labels, config.test_label_path)
|
transform_corpus_data(test_labels, config.test_label_path)
|
|
@ -2,30 +2,30 @@
|
||||||
# Author: XuMing <xuming624@qq.com>
|
# Author: XuMing <xuming624@qq.com>
|
||||||
# Brief:
|
# Brief:
|
||||||
import os
|
import os
|
||||||
|
|
||||||
output_dir = './output'
|
output_dir = './output'
|
||||||
|
|
||||||
# CGED chinese corpus
|
# CGED chinese corpus
|
||||||
train_paths = ['../data/cn/CGED/CGED18_HSK_TrainingSet.xml',
|
train_paths = ['../data/cn/CGED/CGED18_HSK_TrainingSet.xml',
|
||||||
# '../data/cn/CGED/CGED17_HSK_TrainingSet.xml',
|
'../data/cn/CGED/CGED17_HSK_TrainingSet.xml',
|
||||||
# '../data/cn/CGED/CGED16_HSK_TrainingSet.xml'
|
'../data/cn/CGED/CGED16_HSK_TrainingSet.xml']
|
||||||
]
|
|
||||||
train_word_path = output_dir + '/train_words.txt'
|
train_word_path = output_dir + '/train_words.txt'
|
||||||
train_label_path = output_dir + '/train_labels.txt'
|
train_label_path = output_dir + '/train_labels.txt'
|
||||||
test_paths = ['../data/cn/CGED/CGED18_HSK_TestingSet.xml',
|
test_paths = {'../data/cn/CGED/CGED16_HSK_Test_Input.txt': '../data/cn/CGED/CGED16_HSK_Test_Truth.txt',
|
||||||
# '../data/cn/CGED/CGED17_HSK_TestingSet.xml',
|
'../data/cn/CGED/CGED17_HSK_Test_Input.txt': '../data/cn/CGED/CGED17_HSK_Test_Truth.txt'}
|
||||||
# '../data/cn/CGED/CGED16_HSK_TestingSet.xml'
|
|
||||||
]
|
|
||||||
test_word_path = output_dir + '/test_words.txt'
|
test_word_path = output_dir + '/test_words.txt'
|
||||||
test_label_path = output_dir + '/test_labels.txt'
|
test_label_path = output_dir + '/test_labels.txt'
|
||||||
|
test_id_path = output_dir + '/test_ids.txt'
|
||||||
# vocab
|
# vocab
|
||||||
word_dict_path = output_dir + '/word_dict.txt'
|
word_dict_path = output_dir + '/word_dict.txt'
|
||||||
label_dict_path = output_dir + '/label_dict.txt'
|
label_dict_path = output_dir + '/label_dict.txt'
|
||||||
|
|
||||||
# config
|
# config
|
||||||
batch_size = 64
|
batch_size = 64
|
||||||
epoch = 1
|
epoch = 10
|
||||||
embedding_dim = 100
|
embedding_dim = 100
|
||||||
rnn_hidden_dim = 200
|
rnn_hidden_dim = 200
|
||||||
cutoff_frequency = 0
|
cutoff_frequency = 5
|
||||||
save_model_path = output_dir + '/rnn_crf_model.h5' # Path of the model saved, default is output_path/model
|
save_model_path = output_dir + '/rnn_crf_model.h5' # Path of the model saved, default is output_path/model
|
||||||
|
|
||||||
# infer
|
# infer
|
||||||
|
|
|
@ -8,7 +8,6 @@ import rnn_crf_config as config
|
||||||
from data_reader import build_dict
|
from data_reader import build_dict
|
||||||
from data_reader import get_max_len
|
from data_reader import get_max_len
|
||||||
from data_reader import load_dict
|
from data_reader import load_dict
|
||||||
from data_reader import load_reverse_dict
|
|
||||||
from data_reader import pad_sequence
|
from data_reader import pad_sequence
|
||||||
from data_reader import vectorize_data
|
from data_reader import vectorize_data
|
||||||
from rnn_crf_model import create_model
|
from rnn_crf_model import create_model
|
||||||
|
|
Loading…
Reference in New Issue