n-gram实践

对于加固的应用，可以用通过在沙箱穷举各种可能操作返回的API操作序列来建立模型，处理API序列可以用N-gram来分析语义。

发现网上好像很少有用中文说明具体使用，所以记录一下。

n-gram处理之后，实际上是把句子划分为不同的gram！

以2元词 2-gram（bigram为例）

'''
class Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=None, progress_per=10000, scoring="default", common_terms=frozenset)
'''

corpusList = ['我','爱','你','爱','你','我','爱'],['我','也','爱','你']
bigram = Phrases(corpusList, min_count=1, threshold=0.01, delimiter=b'~')

texts = [bigram[line] for line in corpusList]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# 列出划分结果
corpus
>>[[(0, 1), (1, 2), (2, 1)], [(2, 1), (3, 1), (4, 1)]]

# 列出标签
dictionary.token2id

>>{'也': 3, '你': 0, '我': 4, '我~爱': 1, '爱~你': 2}

处理过程
①对语料库corpusList（document）内每个句子（text）以大小为2的窗口进行滑动，

句子1：我爱爱你你爱爱你你我我爱

句子2：我也也爱爱你

统计两两出现次数（整个语料库），得到
我爱：2 爱你：3 你爱：1 你我：1 我也：1 也爱：1

（delimiter=b’~’参数规定了2-gram的命名方式，两个词以~为连接符号。这里后面就省略连接符号方便看）

②min_count:Ignore all words and bigrams with total collected count lower than this value.

表示一个2-gram最少必须大于的频数，频数小于等于2 的2-gram就不以此为划分，于是变成
我爱：2 爱你：3

threshold: Represent a score threshold for forming the phrases (higher means fewer phrases)

score评分方法见附录[1]，这里设置threshold为0.01非常小的目的就是不考虑阈值参数

于是“我~爱”将被ignore，“爱~我”能够保留

③开始划分
句子1：我爱你爱你我爱
句子2：我也爱你

未被划分到2-gram的，就是单字频数

于是用doc2bow转换为词袋后，词频列表为

句子1：我爱：2，爱你：1，你：1

句子2：爱你：1，我：1，也：1

附录

这个score有两种计算公式，默认使用Efficient Estimaton of Word Representations in Vector Space算法

$\frac{(count(worda\quad followed\quad by\quad wordb) - mincount) * N }{(count(worda) * count(wordb))} > threshold, where\quad N\quad is\quad the\quad total\quad vocabulary\quad size.$

，另一种是npmi(ormalized pointwise mutual information, from “Normalized (Pointwise) Mutual)

可以通过设置score参数选择评价方式

附录

Generative Adversarial Nets

Spring