Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Property based Testing - generative data & executable domain rules
Next
Download to read offline and view in fullscreen.

4

Share

Big Data - architectural concerns for the new age

Download to read offline

A brief introduction to Big Data and why should care about polyglot storage

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Big Data - architectural concerns for the new age

  1. 1. Big Data architectural concerns for the new age Sunday, 2 December 12
  2. 2. Debasish Ghosh CTO (a Nomura Research Institute group company) Sunday, 2 December 12
  3. 3. @debasishg on Twitter code @ http://github.com/debasishg blog @ Ruminations of a Programmer http://debasishg.blogspot.com Sunday, 2 December 12
  4. 4. some numbers .. Sunday, 2 December 12
  5. 5. Facebook reaches 1 billion active users Sunday, 2 December 12
  6. 6. Sunday, 2 December 12
  7. 7. Sunday, 2 December 12
  8. 8. some more numbers .. Sunday, 2 December 12
  9. 9. • Walmart handles 1M transactions per hour • Google processes 24PB of data per day • AT&T transfers 30PB of data per day • 90 trillion emails are sent every year • World of Warcraft uses 1.3PB of storage Sunday, 2 December 12
  10. 10. Big Data - the positive feedback cycle 1 new technologies make using big data 2 efficient more adoption of big data 3 generation of more big data Sunday, 2 December 12
  11. 11. new technologies .. new architectural concerns Sunday, 2 December 12
  12. 12. new ways to store data Sunday, 2 December 12
  13. 13. new techniques to retrieve data Sunday, 2 December 12
  14. 14. new ways to scale reads & writes Sunday, 2 December 12
  15. 15. transparent to the application Sunday, 2 December 12
  16. 16. new ways to consume data Sunday, 2 December 12
  17. 17. new techniques to analyze data Sunday, 2 December 12
  18. 18. new ways to visualize data Sunday, 2 December 12
  19. 19. at Web scale Sunday, 2 December 12
  20. 20. The Database Landscape so far .. • relational database - the bedrock of enterprise data • irrespective of application development paradigm • object-relational-mapping considered to be the panacea for impedance mismatch Sunday, 2 December 12
  21. 21. blogger, big geek and architectural consultant “Object Relational Mapping is the Vietnam of Computer Science” - Ted Neward (2006) Sunday, 2 December 12
  22. 22. RDBMS & Big Data • once the data volume crosses the limit of a single server, you shard / partition • sharding implies a lookup node for the hash code => SPOF • cross shard joins, transactions don’t scale Sunday, 2 December 12
  23. 23. RDBMS & Big Data • Cost of distributed transactions • synchronization overhead • 2 phase commit is a blocking protocol (can block indefinitely) • as slow as the slowest DB node + network latency Sunday, 2 December 12
  24. 24. RDBMS & Big Data • Master/Slave replication • synchronous replication => slow • asynchronous replication => can lose data • writing to master is a bottleneck and SPOF Sunday, 2 December 12
  25. 25. Need Distributed Databases • data is automatically partitioned • transparent to the application • add capacity without downtime • failure tolerant Sunday, 2 December 12
  26. 26. 2 famous papers .. • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: Amazon’s highly scalable key/value store, 2007 Sunday, 2 December 12
  27. 27. Addressing 2 Approaches • Bigtable: “how can we build a distributed database on top of GFS ?” • Dynamo: “how can we build a distributed hash table appropriate for data center ?” Sunday, 2 December 12
  28. 28. Big Data recommendations • reduce accidental complexity in processing data • be less rigid (no rigid schema) • store data in a format closer to the domain model • hence no universal data model .. Sunday, 2 December 12
  29. 29. Polyglot Storage • unfortunately came to be known as NoSQL databases • document oriented (MongoDB, CouchDB) • key/value (Dynamo, Bigtable, Riak, Cassandra,Voldemort) • data structure based (redis) • graph based (Neo4J) Sunday, 2 December 12
  30. 30. reduced impedance mismatch richer modeling closer to capabilities domain model Sunday, 2 December 12
  31. 31. Asynchronous Replication to RDBMS using Message Oriented Middleware Sunday, 2 December 12
  32. 32. Hybrid Oracle MongoDB storage over Messaging backbone Sunday, 2 December 12
  33. 33. Relational Database is just another option, not the only option when data set is BIG and semantically rich Sunday, 2 December 12
  34. 34. 10 things never to do with a Relational Database • Search • Media Repository • Recommendation • Email • High Frequency Trading • Classification ad • Product Cataloging • Time Series / Forecasting • User group / ACLs • Log Analysis Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational- database-206944?page=0,0 Sunday, 2 December 12
  35. 35. Scalability, Availability .. • ACID => BASE • Anti-entropy • CAP Theorem & • Gossip Protocol Eventual Consistency • Consistent Hashing • Vector Clocks • Hinted Hand-off & Read repair Sunday, 2 December 12
  36. 36. CAP Theorem • Consistency, Availability & Partition Tolerance • You can have only 2 of these in a distributed system • Eric Brewer postulated this quite some time back Sunday, 2 December 12
  37. 37. ACID => BASE • Basic Availability Soft-state Eventual consistency • Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state. • It’s ok to use stale data and it’s ok to give approximate answers Sunday, 2 December 12
  38. 38. Consistent Hashing Sunday, 2 December 12
  39. 39. Big Data in the wild • Hadoop • started as a batch processing engine (HDFS & Map/Reduce) • with bigger and bigger data, you need to make them available to users at near real time • stream processing, CEP .. Sunday, 2 December 12
  40. 40. a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems complementing Map/Reduce Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data in Hadoop analysis programs, coupled with infrastructure for evaluating these programs. Cloudera Impala real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing Sunday, 2 December 12
  41. 41. Real time queries in Hadoop • currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop • expensive and may need lots of data movement between the database & the Hadoop clusters Sunday, 2 December 12
  42. 42. .. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current base .. Sunday, 2 December 12
  43. 43. Shark from UC Berkeley • a large scale data warehouse system for Spark, compatible with Hive • supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3 Sunday, 2 December 12
  44. 44. BI and Analytics • making Big Data available to developers • API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps) • analyzing user behaviors, network monitoring, log processing, recommenders, AI .. Sunday, 2 December 12
  45. 45. Machine Learning • personalization • social network analysis • pattern discovery - click patterns, recommendations, ratings • apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter .. Sunday, 2 December 12
  46. 46. Summary • Big Data will grow bigger - we need to embrace the changes in architecture • An RDBMS is NOT the panacea - pick your data model that’s closest to your domain • It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware Sunday, 2 December 12
  47. 47. Summary • Go for decentralized architectures, avoid SPOFs • With the big volumes of data, streaming is your friend Sunday, 2 December 12
  48. 48. Thank You! Sunday, 2 December 12
  49. 49. http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for- market-research/ http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htm http://www.emich.edu/chhs/about-researchMETHODS.html http://docs.basho.com/riak/latest/references/appendices/concepts/ Sunday, 2 December 12
  • kaustavchoudhury9

    Jun. 22, 2014
  • romualdbassinot

    Dec. 25, 2013
  • gkaul

    Mar. 20, 2013
  • TakeshiWatanabe2

    Dec. 11, 2012

A brief introduction to Big Data and why should care about polyglot storage

Views

Total views

2,807

On Slideshare

0

From embeds

0

Number of embeds

11

Actions

Downloads

95

Shares

0

Comments

0

Likes

4

×