This document describes a project called FARROT that filters and aggregates Amazon product review ratings and totals by state or time period. It ingests review and member location data from Stanford and Illinois datasets, transforms the data using MapReduce, and stores aggregated results in an HBase database with schemas organized by product, state, and time dimension for efficient querying. The design uses HBase and bucketing to optimize for reads, scans and scalability at the cost of additional storage.
4. Data set
Stanford SNAP Amazon reviews
35GB
35M reviews
University of Illinois Amazon member info
142MB
Member location information
joeme 92 5/26 Cleveland, OH United States Joseph M. Kotow B00006HAXW
OH
7. Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
BT0S0V006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDHE9a-ppyBase
N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I
have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
8. Pipeline
PIG to CLEAN,
JOIN and
AGGREGATE
rating reviews and
totals
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
BT0S0V006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDHE9a-ppyBase
N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I
have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
10. HBase Schema
Table Schemas:
PRODUCTID_STATE,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYYEAR_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYMONTH_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYDAY_EPOCH,
TOTAL REVIEWS, AVG RATING
• Example:
B00003CWT6_CA_BYMONTH_1008115200000
11. Retrospective
Design Considerations
• HBase was used for optimizations for reads, range
scans, and scalability
• Data was bucketed by state and different time
intervals for query performance by avoiding the cost
of recalculating aggregates at the expense of storage
• Java MR was used to convert multi-row reviews to
tabular format
Future
• Scrape Amazon for new reviews
• Filter and display reviews
12. About me – Andy Lai
UC Berkeley (B.S. Electrical Engineering &
Computer Science)
SJSU (M.S. Engineering)
Software Engineer (DB2, Relational
database)
Interests: