【R干货】电影《王的盛宴》豆瓣短评浅析（含全部实现程序）

ppvke • 2014年8月26日 am12:00 • R语言

【R干货】电影《王的盛宴》豆瓣短评浅析（含全部实现程序）

本文大纲：

问题背景
数据获取
数据信息描述
分词分析
短评文本词汇关联分析
存在的问题

用的到R package有：

Rwordseg：中文分词

wordcloud：词频可视化

arules & arules ：关联分析及可视化

1. 问题背景

《王的盛宴》上映后，网络评论呈现两极化趋势,而负责该片宣传方则认为这其中暗藏“水军”搅局，为了挽回口碑，雇佣水军在豆瓣刷分。双方水军对战如何，只有获取到原始数据才能一探究竟。本文获取到豆瓣关于《王的盛宴》影评部分数据，并作简要分析。2. 数据获取数据的获取采用RCurl解析豆瓣的html网页，获取时间是2012-12-16，近期豆瓣有改版，解析程序需要修改才能适合新版豆瓣，程序就不再贴出。获取到短评文本的时间范围为:2011-08-19~2012-12-16,共9047条。豆瓣影评分为两种，一种是长篇大论，看上去很专业，占少数；一种是短评，几句话的点评而已，这类用户较多。在评分上，其中只评分无评论的用户占大多数，这类用户的数据比较难以获取。3. 数据信息描述

library(Rwordseg)

library(wordcloud)

library(arules)library(arulesViz)short <-read.csv(“kingdom.short.info.csv”)

复制代码

评论日期与评论数量趋势

times <- as.Date(short$comment_time)

par(bg = “grey”)

plot(table(as.Date(times)), xlab = “评论日期”, ylab = “评论数量”, main = “《王的盛宴》豆瓣短评评论趋势”, col = 2:5)

复制代码

<ignore_js_op>

在9047条评论中，来自9045个用户，其中11位用户已注销，其他用户都有对应的主页。一共有8391位用户给出了评分：其中5星684位，4星1042位，3星2329位，2星2040位，1星2296位。

rating <- short$rating

rting <- sort(table(rating), decreasing = T)

rate <- rting/sum(rting)

par(mar = c(0, 1, 2, 1))

pie(rate, labels = paste(names(rate), “星 “, format(rate * 100, digits = 3),”%”, sep = “”), col = rainbow(5))

title(main = “《王的盛宴》豆瓣短评五种评分用户占比”)

复制代码

<ignore_js_op>

4. 分词分析

本文只分析有评分用户的短评且短评长度大于1(含标点),共8354篇。短评文本长度（含标点）,大多数评论低于50个字，有5829篇，占69.77%，低于10个字的有1504篇，占18.0%。短评文本提取：

comment <- as.character(short$comment)

short <- short[!is.na(short$rating) & nchar(comment) > 1, ]

comment <- as.character(short$comment)

cmt.len <- nchar(comment)

# s1<-sort(table(cmt.len),decreasing=T);s2<-as.integer(names(s1))

复制代码

短评文本长度分布直方图：

par(mar = c(5, 2, 2, 1))

hist(cmt.len, freq = F, ylim = c(0, 0.025), col = “goldenrod2″, xlab = “短评文本的长度”,main = “短评长度分布直方图”)

lines(density(cmt.len), col = “tomato”)

复制代码

<ignore_js_op>

利用Rwordseg的segmentCN函数分词,词语长度至少为2。Rwordseg是中科院分词系统ictclas的开源版本Ansi的R接口。

f_cut <- function(x) {

library(Rwordseg)

unlist(strsplit(segmentCN(x, nature = T), ” “))

}

word_cut <- function(x, n = 1) {

x <- gsub(“[a-z]|\.”, “”, x)

x[nchar(x) > n]

}

comment.words <- lapply(comment, f_cut)

words <- lapply(comment.words, word_cut, 1) #8354

复制代码

去掉words词汇量为0的项,有效短评8061篇，其中最长的短评有55个词汇，其中只有一个词汇的有699篇，低于10个词汇的有4810篇。

# 去掉words词汇量为0的文本

cw.len <- unlist(lapply(words, length)) #8354

short2 <- short[cw.len > 0, ]

rating <- short2$rating

words2 <- words[cw.len > 0]

cw.len <- cw.len[cw.len > 0] #8028

# ss1<-sort(table(cw.len),decreasing=T);ss2<-as.integer(names(ss1))

短评词汇数量分布直方图：par(mar = c(5, 2, 2, 1))

hist(cw.len, freq = F, ylim = c(0, 0.096), col = “chocolate2″, main = “短评词汇数量分布”, xlab = “短评词汇数量”)

lines(density(cw.len), col = “red”)

复制代码

<ignore_js_op>

总共得到词语11627个,共出现频率92981，其中前500个占60.87%，前100个占35.22%，前300占52.21%,比二八定律更集中。长度至少为3的词语2920个，共出现9047，前100个占47.92%。

# 词频统计

all.words <- unlist(words2)

freq <- sort(table(all.words), decreasing = T)

words.name <- names(freq)

words.freq <- freq[]

sum(words.freq[1:500])/sum(words.freq)

## 词长至少为3

w3 <- all.words[nchar(all.words) > 2]

f3 <- sort(table(w3), decreasing = T)

w3.name <- names(f3)

w3.freq <- f3[]

复制代码

词长最小为2或3频率最高的200个词语，利用wordcloud绘制其词频标签云图分别为：

par(mar = c(0, 0, 3, 0), bg = “black”)

wordcloud(words.name, words.freq, scale = c(5, 1.5), min.freq = 1, max.words = 200, colors = rainbow(130))

title(main = “短评文本出现频率最高的200个词汇”, col.main = “orange”)

par(mar = c(0, 0, 3, 0), bg = “white”)

wordcloud(w3.name, w3.freq, scale = c(6, 1.5), min.freq = 1, max.words = 200, colors = rainbow(150))

title(main = “短评文本出现词汇长度至少为3频率最高的200个词汇”, col.main = “orange”)

复制代码

<ignore_js_op><ignore_js_op><ignore_js_op>不同评分的短评词频标签云图：

gp.cloud <- function(i, maxwords = 150, a = 1) {

gp_words <- words2[rating == i]

gp <- unlist(gp_words)

gpfreq <- sort(table(gp), decreasing = T)

gp.name <- names(gpfreq)

gp.freq <- gpfreq[]

png(paste(“gp0″, i, “.png”, sep = “”), width = 900 * a, height = 900 * a)

par(mar = c(0, 0, 4, 0), bg = “black”)

wordcloud(gp.name, gp.freq, scale = c(6, 1.5), min.freq = 2, max.words = maxwords,

colors = rainbow(ceiling(maxwords * 0.8)))

title(main = paste(“评分为”, i, “星的短评文本出现频率最高的”, maxwords,

“个词汇”), col.main = “white”)

dev.off()

}

gp.cloud(1, a = 0.8)

gp.cloud(2, a = 0.8)

gp.cloud(3, a = 0.8)

gp.cloud(4, a = 0.8)

复制代码

<ignore_js_op><ignore_js_op><ignore_js_op><ignore_js_op><ignore_js_op>

评分为1星的贬义词比较多，而评分为5星的褒义词比较突出。5. 短评文本词汇关联分析

对8061篇的词汇进行apriori关联分析，挖掘频繁项集，首先要对每篇短评的词汇去除重复。在最小支持度为0.008下，得到频繁项集416个,项集大于2的185个。

words_s <- lapply(words2, as.factor)

# 去除重复

words_s <- lapply(words2, unique)

trans <- as(words_s, “transactions”)

items <- apriori(trans, parameter = list(supp = 0.008, conf = 0.05, minlen = 1,

target = “frequent itemsets”), control = list(verbose = F))

# as(sort(items)[1:50], “data.frame”)

plot(items[size(items) > 1], method = “graph”, control = list(type = “items”, main = “短评的词汇关系，最小项集为2″))

复制代码

<ignore_js_op>

对不同评分的短评进行关联分析，其中supp = 0.01, conf = 0.05, minlen = 1：

gp.items <- function(i) {

gp_words <- words2[rating == i]

gp_words_s <- lapply(gp_words, as.factor)

gp_words_s <- lapply(gp_words, function(x) {

names(x) <- NULL

x

})

gp_words_s <- lapply(gp_words_s, unique)

gp.trans <- as(gp_words_s, “transactions”)

gp.trans

}

trans01 <- gp.items(1)

items01 <- apriori(trans01, parameter = list(supp = 0.01, conf = 0.05, minlen = 1,

target = “frequent itemsets”), control = list(verbose = F))

plot(items01, method = “graph”, control = list(type = “items”, main = “评分为1星的短评的词汇关系”))

#######################################

trans05 <- gp.items(5)

items05 <- apriori(trans05, parameter = list(supp = 0.01, conf = 0.05, minlen = 1,

target = “frequent itemsets”), control = list(verbose = F))

plot(items05, method = “graph”, control = list(type = “items”, main = “评分为5星的短评的词汇关系”))

复制代码

<ignore_js_op><ignore_js_op><ignore_js_op>

6. 存在的问题

在进行分析的过程中，发现不少问题：

1. 数据完整性问题。要判断是否有水军，需要评分用户比较详尽的信息，比如注册时间、看过多少部电影、进行过多少次评分，单独获取一部电影的评分用户难度比较大。

2. 分词问题。虽然使用Rwordseg能够得到较好的分词效果，但是包含着不少没有实际意义的词汇，这些词汇没有立场倾向，比如这样、那样。

3. 词汇的词性问题。虽然segmentCN能给出每个词语的词性，但是一个词语有多个词性，去除无意义词汇比较困难，需根据上下文判断，segmentCN的词性包括 “n”,“v”,“nr”, “r”,“a”,“m” , “d” ,“c”,“ns” ,“i”,“f”,“vn” ,“l”,“t” , “p” ,“ad” “b”,“s” ,“u” , “z” , “nz” ,“j” ,“o” , “mq” ,“an” ,“y”,“q”,“e” ,“nt”,“vd” ,“vq”,“rr”。

4. 用户聚类问题。本文最初试图利用词频对用户进行聚类，而词频矩阵十分稀疏，常见的聚类算法像kmeans、cmeans甚至集成聚类等无法得到有意义的结果，利用词频计算文本之间的相似度，即使取前300个词汇，PC的内存难以承受，最后放弃。当然，也许有文本挖掘专属方法可以解决这样的问题。

Hi，感谢你对PPV课的关注，PPV课是一个大数据学习平台，我们致力于挖掘大数据价值，探索大数据应用，培养大数据人才！请回复以下数字或字母获取你想要的内容。”CDA“： CDA数据分析师认证 ”DSH“ R语言读书会”SS“ 送书活动”1“：最新咨询”2“：最新网络课程”3“：最新培训”5“：沙龙活动”9“：获取帮助”0“：返回导航菜单

原文始发于微信公众号（PPV课数据科学社区）：【R干货】电影《王的盛宴》豆瓣短评浅析（含全部实现程序）

原创文章，作者：ppvke，如若转载，请注明出处：http://www.ppvke.com/archives/29425

数据分析数据挖掘数据科学

赞 (0)

0

你不是寂寞，你只是不想学习

« 上一篇 2014年8月25日 am12:00

【资讯】关注公益：一桶冰水能拯救多少人？——信息图

下一篇 » 2014年8月27日 am12:00

联系我们

4000-51-9191

在线咨询：

工作时间：周一至周五，9:30-18:30，节假日休息