Amazon Review Dataset数据集介绍
·
Amazon Review Dataset数据集记录了用户对亚马逊网站商品的评价,是推荐系统的经典数据集,并且Amazon一直在更新这个数据集,根据时间顺序,Amazon数据集可以分成三类:
- 2013 版 http://snap.stanford.edu/data/web-Amazon-links.html
- 2014版 http://jmcauley.ucsd.edu/data/amazon/index_2014.html
如果直接跳转到2018版,可换为访问http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ - 2018版 https://nijianmo.github.io/amazon/index.html
Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等子数据集,这些子数据集包含两类信息:
以2014版数据集为例:
-
商品信息描述
asin 商品id title 商品名称 price 价格 imUrl 商品图片链接 related 相关商品 salesRank 折扣信息 brand 品牌 categories 目录类别 官方例子:
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
-
用户评分记录数据
reviewerID 用户id asin 商品id reviewerName 用户名 helpful 有效评价率(helpfulness rating of the review, e.g. 2/3) reviewText 评价文本 overall 评分 summary 评价总结 unixReviewTime 评价时间戳 reviewTime 评价时间 { "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
Amazon数据集读取:
因为下载的数据是json文件,不易操作,这里主要介绍如何将json文件转化为csv格式文件。以2014版Amazon Electronics数据集的转化为例:
商品信息读取
import pickle
import pandas as pd
file_path = 'meta_Electronics.json'
fin = open(file_path, 'r')
df = {}
useless_col = ['imUrl','salesRank','related','title','description'] # 不想要的字段
i = 0
for line in fin:
d = eval(line)
for s in useless_col:
if s in d:
d.pop(s)
df[i] = d
i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('meta_Electronics.csv',index=False)
用户评分记录数据读取
file_path = 'Electronics_10.json'
fin = open(file_path, 'r')
df = {}
useless_col = ['reviewerName','reviewText','unixReviewTime','summary'] # 不想要的字段
i = 0
for line in fin:
d = eval(line)
for s in useless_col:
if s in d:
d.pop(s)
df[i] = d
i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('Electronics_10.csv',index=False)
推荐内容
阅读全文
AI总结
更多推荐
相关推荐
查看更多
A2A

谷歌开源首个标准智能体交互协议Agent2Agent Protocol(A2A)
ai-agents-for-beginners

这个项目是一个针对初学者的 AI 代理课程,包含 10 个课程,涵盖构建 AI 代理的基础知识。源项目地址:https://github.com/microsoft/ai-agents-for-beginners
n8n

n8n 是一个工作流自动化平台,它结合了代码的灵活性和无代码的高效性。支持 400+ 集成、原生 AI 功能以及公平开源许可,n8n 能让你在完全掌控数据和部署的前提下,构建强大的自动化流程。源项目地址:https://github.com/n8n-io/n8n
所有评论(0)