需求

  • 使用Python+pyahocorasick,匹配关键字,关键字大概在10-20个汉字之间。
  • 构建ahocorasick的文本,是从本地文件key_word的读入。格式如下:
Keyword Keyword
母婴专区<辅食<面仔/面条:婴幼儿,幼儿,婴儿,儿童,宝宝 面条,细面,粗面,手工面,蔬菜面,营养面,碎面,挂面,面仔

参考代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import ahocorasick
A = ahocorasick.Automaton()
titles = ['Hello Kitty3色蔬菜细面300克 婴儿幼儿营养面条宝宝辅食面条']
word_dict = {}
with open('categories.csv', 'r') as f:
for line in f.readlines():
line = line.strip()
word_key = line.split(':')[0]
word_value = list(line.split(':')[1].split('|'))
word_dict[word_key] = word_value
line = (line.split(':')[1].split('|'))
for word in line:
if word == "":
continue
A.add_word(word, word)
A.make_automaton()
for title in titles:
category = []
aa = A.iter(title)
ret = []
matches = {}
for (k,v) in aa:
matches[v] = 1
for (k,v) in matches.items():
ret.append(k)
for value in word_dict.items():
if ret[0] in value[1]:
category.append(value[0]) #关键字太多,所以写死了一个keyword匹配的结果
#print(ret[0], value[0], value[1])
print(category[0])