SlideShare une entreprise Scribd logo
1  sur  39
Large-scale Messaging at IMVU Jon Watte Technical Director, IMVU Inc @jwatte
Presentation Overview Describe the problem Low-latency game messaging and state distribution Survey available solutions Quick mention of also-rans Dive into implementation Erlang! Discuss gotchas Speculate about the future
From Chat to Games
Context Caching Web Servers HTTP Load Balancers Databases Caching Long Poll Load Balancers Game Servers HTTP
What Do We Want? Any-to-any messaging with ad-hoc structure Chat; Events; Input/Control Lightweight (in-RAM) state maintenance Scores; Dice; Equipment
New Building Blocks Queues provide a sane view of distributed state for developers building games Two kinds of messaging: Events (edge triggered, “messages”) State (level triggered, “updates”) Integrated into a bigger system
From Long-poll to Real-time Caching Web Servers Load Balancers Databases Caching Long Poll Load Balancers Game Servers Connection Gateways Message Queues Today’s Talk
Functions Game Server HTTP Validation users/requests Notification Client Connect Listen message/state/user Send message/state Create/delete     queue/mount Join/remove user Send message/state Queue
Performance Requirements Simultaneous user count: 80,000 when we started 150,000 today 1,000,000 design goal Real-time performance (the main driving requirement) Lower than 100ms end-to-end through the system Queue creates and join/leaves (kill a lot of contenders) >500,000 creates/day when started >20,000,000 creates/day design goal
Also-rans: Existing Wheels AMQP, JMS: Qpid, Rabbit, ZeroMQ, BEA, IBM etc Poor user and authentication model Expensive queues IRC Spanning Tree; Netsplits; no state XMPP / Jabber Protocol doesn’t scale in federation Gtalk, AIM, MSN Msgr, Yahoo Msgr If only we could buy one of these!
Our Wheel is Rounder! Inspired by the 1,000,000-user mochiweb app http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 A purpose-built general system Written in Erlang
Section: Implementation Journey of a message Anatomy of a queue Scaling across machines Erlang
The Journey of a Message
Queue Node Gateway The Journey of a Message Gateway for User Queue Node Queue Process Message in Queue: /room/123 Mount: chat Data: Hello, World! Find node for /room/123 Find queue /room/123 List of subscribers Gateway Validation Gateway Gateway for User Forward message
Anatomy of a Queue Queue Name: /room/123 Mount Type: message Name: chat User A: I win. User B: OMG Pwnies! User A: Take that! … Subscriber List User A @ Gateway C User B @ Gateway B Mount Type: state Name: scores User A: 3220  User B: 1200
A Single Machine Isn’t Enough 1,000,000 users, 1 machine? 25 GB/s memory bus 40 GB memory (40 kB/user) Touched twice per message one message per is 3,400 ms
Scale Across Machines Gateway Queues Gateway Queues Internet Gateway Queues Consistent Hashing Gateway Queues
Consistent Hashing The Gateway maps queue name -> node This is done using a fixed hash function A prefix of the output bits of the hash function is used as a look-up into a table, with a minimum of 8 buckets per node Load differential is 8:9 or better (down to 15:16) Updating the map of buckets -> nodes is managed centrally Hash(“/room/123”) = 0xaf5… Node A Node B Node C Node D Node E Node F
Consistent Hash Table Update Minimizes amount of traffic moved If nodes have more than 8 buckets, steal 1/N of all buckets from those with the most and assign to new target If not, split each bucket, then steal 1/N of all buckets and assign to new target
Erlang Developed in ‘80s by Ericsson for phone switches Reliability, scalability, and communications Prolog-based functional syntax (no braces!) 25% the code of equivalent C++ Parallel Communicating Processes Erlang processes much cheaper than C++ threads (Almost) No Mutable Data No data race conditions Each process separately garbage collected
Example Erlang Process % spawn process MyCounter = spawn(my_module, counter, [0]). % increment counter MyCounter! {add, 1}. % get valueMyCounter! {get, self()}; receive     {value, MyCounter, Value} ->           Value end. % stop processMyCounter! stop. counter(stop) ->  stopped; counter(Value) ->NextValue = receive    {get, Pid} ->Pid! {value, self(), Value},          Value;    {add, Delta} ->          Value + Delta;    stop ->           stop;    _ ->          Valueend,  counter(NextValue).  % tail recursion
Section: Details Load Management Marshalling RPC / Call-outs Hot Adds and Fail-over The Boss! Monitoring
Load Management Gateway Queues Internet Gateway Queues HAProxy HAProxy Gateway Queues Consistent Hashing Gateway Queues
Marshalling message MsgG2cResult {     required uint32 op_id = 1;     required uint32 status = 2;     optional string error_message = 3; }
RPC Web Server PHP HTTP + JSON admin Gateway Message Queue Erlang
Call-outs Web Server PHP HTTP + JSON Message Queue Gateway Erlang Mount Credentials Rules
Management Gateway Queues The Boss Gateway Queues Gateway Queues Consistent Hashing Gateway Queues
Monitoring Example counters: Number of connected users Number of queues Messages routed per second Round trip time for routed messages Distributed clock work-around! Disconnects and other error events
Hot Add Node
Section: Problem Cases User goes silent Second user connection Node crashes Gateway crashes Reliable messages Firewalls Build and test
User Goes Silent Some TCP connections will stop(bad WiFi, firewalls, etc) We use a ping message Both ends separately detect ping failure This means one end detects it before the other
Second User Connection Currently connected usermakes a new connection To another gateway because of load balancing Auser-specific queue arbitrates Queues are serializedthere is always a winner
State is ephemeralit’s lost when machine is lost A user “management queue”contains all subscription state If the home queue node dies, the user is logged out If a queue the user is subscribed to dies, the user is auto-unsubscribed (client has to deal) Node Crashes
Gateway Crashes When a gateway crashesclient will reconnect Historyallow us to avoid re-sending for quick reconnects The application above the queue API doesn’t notice Erlang message send does not report error Monitor nodes to remove stale listeners
Reliable Messages “If the user isn’t logged in, deliver the next log-in.” Hidden at application server API level, stored in database Return “not logged in” Signal to store message in database Hook logged-in call-out Re-check the logged in state after storing to database (avoids a race)
Firewalls HTTP long-poll has one main strength: It works if your browser works Message Queue uses a different protocol We still use ports 80 (“HTTP”) and 443 (“HTTPS”) This makes us horriblepeople We try a configured proxy with CONNECT We reach >99% of existing customers Future improvement: HTTP Upgrade/101
Build and Test Continuous Integration and Continuous Deployment Had to build our own systems ErlangIn-place Code Upgrades Too heavy, designed for “6 month” upgrade cycles Use fail-over instead (similar to Apache graceful) Load testing at scale “Dark launch” to existing users
Section: Future Replication Similar to fail-over Limits of Scalability (?) M x N (Gateways x Queues) stops at some point Open Source We would like to open-source what we can Protobuf for PHP and Erlang? IMQ core? (not surrounding application server)
Q&A Survey If you found this helpful, please circle “Excellent” If this sucked, don’t circle “Excellent” Questions? @jwatte jwatte@imvu.com IMVU is a great place to work, and we’re hiring!

Contenu connexe

Tendances

Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
 
RedisConf18 - Writing modular & encapsulated Redis code
RedisConf18 - Writing modular & encapsulated Redis codeRedisConf18 - Writing modular & encapsulated Redis code
RedisConf18 - Writing modular & encapsulated Redis codeRedis Labs
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips confluent
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
 
Realtime classroom analytics powered by apache druid
Realtime classroom analytics powered by apache druid Realtime classroom analytics powered by apache druid
Realtime classroom analytics powered by apache druid Karthik Deivasigamani
 
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Frank Kelly
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...StreamNative
 
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...confluent
 
Pulsar Summit Asia - Running a secure pulsar cluster
Pulsar Summit Asia -  Running a secure pulsar clusterPulsar Summit Asia -  Running a secure pulsar cluster
Pulsar Summit Asia - Running a secure pulsar clusterShivji Kumar Jha
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archivingnullhandle
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Marco Tusa
 

Tendances (20)

Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
RedisConf18 - Writing modular & encapsulated Redis code
RedisConf18 - Writing modular & encapsulated Redis codeRedisConf18 - Writing modular & encapsulated Redis code
RedisConf18 - Writing modular & encapsulated Redis code
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Realtime classroom analytics powered by apache druid
Realtime classroom analytics powered by apache druid Realtime classroom analytics powered by apache druid
Realtime classroom analytics powered by apache druid
 
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming millions of Contact Center interactions in (near) real-time with Pu...
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
 
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
 
Pulsar Summit Asia - Running a secure pulsar cluster
Pulsar Summit Asia -  Running a secure pulsar clusterPulsar Summit Asia -  Running a secure pulsar cluster
Pulsar Summit Asia - Running a secure pulsar cluster
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...
 

Similaire à Message Queuing on a Large Scale: IMVUs stateful real-time message queue for chat and games

The experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkThe experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkgeorge.james
 
Network Programming in Java
Network Programming in JavaNetwork Programming in Java
Network Programming in JavaTushar B Kute
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systemsSri Prasanna
 
Network programming in Java
Network programming in JavaNetwork programming in Java
Network programming in JavaTushar B Kute
 
Interoperable Web Services with JAX-WS and WSIT
Interoperable Web Services with JAX-WS and WSITInteroperable Web Services with JAX-WS and WSIT
Interoperable Web Services with JAX-WS and WSITCarol McDonald
 
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect ToolboxWebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect ToolboxWebCamp
 
Network programming in Java
Network programming in JavaNetwork programming in Java
Network programming in JavaTushar B Kute
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
21. Application Development and Administration in DBMS
21. Application Development and Administration in DBMS21. Application Development and Administration in DBMS
21. Application Development and Administration in DBMSkoolkampus
 
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...Ontico
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
The experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkThe experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkgeorge.james
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
Taking a Quantum Leap with Html 5 WebSocket
Taking a Quantum Leap with Html 5 WebSocketTaking a Quantum Leap with Html 5 WebSocket
Taking a Quantum Leap with Html 5 WebSocketShahriar Hyder
 
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)Don't call us - we'll push - cross tier push architecture (JavaOne 2011)
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)Lucas Jellema
 

Similaire à Message Queuing on a Large Scale: IMVUs stateful real-time message queue for chat and games (20)

The experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkThe experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare network
 
Real-time ASP.NET with SignalR
Real-time ASP.NET with SignalRReal-time ASP.NET with SignalR
Real-time ASP.NET with SignalR
 
Network Programming in Java
Network Programming in JavaNetwork Programming in Java
Network Programming in Java
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Network programming in Java
Network programming in JavaNetwork programming in Java
Network programming in Java
 
Velocity 2010 - ATS
Velocity 2010 - ATSVelocity 2010 - ATS
Velocity 2010 - ATS
 
Interoperable Web Services with JAX-WS and WSIT
Interoperable Web Services with JAX-WS and WSITInteroperable Web Services with JAX-WS and WSIT
Interoperable Web Services with JAX-WS and WSIT
 
Thaker q3 2008
Thaker q3 2008Thaker q3 2008
Thaker q3 2008
 
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect ToolboxWebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
WebCamp 2016: PHP.Алексей Петров.PHP at Scale: System Architect Toolbox
 
Network programming in Java
Network programming in JavaNetwork programming in Java
Network programming in Java
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
21. Application Development and Administration in DBMS
21. Application Development and Administration in DBMS21. Application Development and Administration in DBMS
21. Application Development and Administration in DBMS
 
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...
Проксирование HTTP-запросов web-акселератором / Александр Крижановский (Tempe...
 
Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
The experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare networkThe experiences of migrating a large scale, high performance healthcare network
The experiences of migrating a large scale, high performance healthcare network
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Taking a Quantum Leap with Html 5 WebSocket
Taking a Quantum Leap with Html 5 WebSocketTaking a Quantum Leap with Html 5 WebSocket
Taking a Quantum Leap with Html 5 WebSocket
 
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)Don't call us - we'll push - cross tier push architecture (JavaOne 2011)
Don't call us - we'll push - cross tier push architecture (JavaOne 2011)
 

Dernier

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Message Queuing on a Large Scale: IMVUs stateful real-time message queue for chat and games

  • 1. Large-scale Messaging at IMVU Jon Watte Technical Director, IMVU Inc @jwatte
  • 2. Presentation Overview Describe the problem Low-latency game messaging and state distribution Survey available solutions Quick mention of also-rans Dive into implementation Erlang! Discuss gotchas Speculate about the future
  • 3. From Chat to Games
  • 4. Context Caching Web Servers HTTP Load Balancers Databases Caching Long Poll Load Balancers Game Servers HTTP
  • 5. What Do We Want? Any-to-any messaging with ad-hoc structure Chat; Events; Input/Control Lightweight (in-RAM) state maintenance Scores; Dice; Equipment
  • 6. New Building Blocks Queues provide a sane view of distributed state for developers building games Two kinds of messaging: Events (edge triggered, “messages”) State (level triggered, “updates”) Integrated into a bigger system
  • 7. From Long-poll to Real-time Caching Web Servers Load Balancers Databases Caching Long Poll Load Balancers Game Servers Connection Gateways Message Queues Today’s Talk
  • 8. Functions Game Server HTTP Validation users/requests Notification Client Connect Listen message/state/user Send message/state Create/delete queue/mount Join/remove user Send message/state Queue
  • 9. Performance Requirements Simultaneous user count: 80,000 when we started 150,000 today 1,000,000 design goal Real-time performance (the main driving requirement) Lower than 100ms end-to-end through the system Queue creates and join/leaves (kill a lot of contenders) >500,000 creates/day when started >20,000,000 creates/day design goal
  • 10. Also-rans: Existing Wheels AMQP, JMS: Qpid, Rabbit, ZeroMQ, BEA, IBM etc Poor user and authentication model Expensive queues IRC Spanning Tree; Netsplits; no state XMPP / Jabber Protocol doesn’t scale in federation Gtalk, AIM, MSN Msgr, Yahoo Msgr If only we could buy one of these!
  • 11. Our Wheel is Rounder! Inspired by the 1,000,000-user mochiweb app http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 A purpose-built general system Written in Erlang
  • 12. Section: Implementation Journey of a message Anatomy of a queue Scaling across machines Erlang
  • 13. The Journey of a Message
  • 14. Queue Node Gateway The Journey of a Message Gateway for User Queue Node Queue Process Message in Queue: /room/123 Mount: chat Data: Hello, World! Find node for /room/123 Find queue /room/123 List of subscribers Gateway Validation Gateway Gateway for User Forward message
  • 15. Anatomy of a Queue Queue Name: /room/123 Mount Type: message Name: chat User A: I win. User B: OMG Pwnies! User A: Take that! … Subscriber List User A @ Gateway C User B @ Gateway B Mount Type: state Name: scores User A: 3220 User B: 1200
  • 16. A Single Machine Isn’t Enough 1,000,000 users, 1 machine? 25 GB/s memory bus 40 GB memory (40 kB/user) Touched twice per message one message per is 3,400 ms
  • 17. Scale Across Machines Gateway Queues Gateway Queues Internet Gateway Queues Consistent Hashing Gateway Queues
  • 18. Consistent Hashing The Gateway maps queue name -> node This is done using a fixed hash function A prefix of the output bits of the hash function is used as a look-up into a table, with a minimum of 8 buckets per node Load differential is 8:9 or better (down to 15:16) Updating the map of buckets -> nodes is managed centrally Hash(“/room/123”) = 0xaf5… Node A Node B Node C Node D Node E Node F
  • 19. Consistent Hash Table Update Minimizes amount of traffic moved If nodes have more than 8 buckets, steal 1/N of all buckets from those with the most and assign to new target If not, split each bucket, then steal 1/N of all buckets and assign to new target
  • 20. Erlang Developed in ‘80s by Ericsson for phone switches Reliability, scalability, and communications Prolog-based functional syntax (no braces!) 25% the code of equivalent C++ Parallel Communicating Processes Erlang processes much cheaper than C++ threads (Almost) No Mutable Data No data race conditions Each process separately garbage collected
  • 21. Example Erlang Process % spawn process MyCounter = spawn(my_module, counter, [0]). % increment counter MyCounter! {add, 1}. % get valueMyCounter! {get, self()}; receive {value, MyCounter, Value} -> Value end. % stop processMyCounter! stop. counter(stop) -> stopped; counter(Value) ->NextValue = receive {get, Pid} ->Pid! {value, self(), Value}, Value; {add, Delta} -> Value + Delta; stop -> stop; _ -> Valueend, counter(NextValue). % tail recursion
  • 22. Section: Details Load Management Marshalling RPC / Call-outs Hot Adds and Fail-over The Boss! Monitoring
  • 23. Load Management Gateway Queues Internet Gateway Queues HAProxy HAProxy Gateway Queues Consistent Hashing Gateway Queues
  • 24. Marshalling message MsgG2cResult { required uint32 op_id = 1; required uint32 status = 2; optional string error_message = 3; }
  • 25. RPC Web Server PHP HTTP + JSON admin Gateway Message Queue Erlang
  • 26. Call-outs Web Server PHP HTTP + JSON Message Queue Gateway Erlang Mount Credentials Rules
  • 27. Management Gateway Queues The Boss Gateway Queues Gateway Queues Consistent Hashing Gateway Queues
  • 28. Monitoring Example counters: Number of connected users Number of queues Messages routed per second Round trip time for routed messages Distributed clock work-around! Disconnects and other error events
  • 30. Section: Problem Cases User goes silent Second user connection Node crashes Gateway crashes Reliable messages Firewalls Build and test
  • 31. User Goes Silent Some TCP connections will stop(bad WiFi, firewalls, etc) We use a ping message Both ends separately detect ping failure This means one end detects it before the other
  • 32. Second User Connection Currently connected usermakes a new connection To another gateway because of load balancing Auser-specific queue arbitrates Queues are serializedthere is always a winner
  • 33. State is ephemeralit’s lost when machine is lost A user “management queue”contains all subscription state If the home queue node dies, the user is logged out If a queue the user is subscribed to dies, the user is auto-unsubscribed (client has to deal) Node Crashes
  • 34. Gateway Crashes When a gateway crashesclient will reconnect Historyallow us to avoid re-sending for quick reconnects The application above the queue API doesn’t notice Erlang message send does not report error Monitor nodes to remove stale listeners
  • 35. Reliable Messages “If the user isn’t logged in, deliver the next log-in.” Hidden at application server API level, stored in database Return “not logged in” Signal to store message in database Hook logged-in call-out Re-check the logged in state after storing to database (avoids a race)
  • 36. Firewalls HTTP long-poll has one main strength: It works if your browser works Message Queue uses a different protocol We still use ports 80 (“HTTP”) and 443 (“HTTPS”) This makes us horriblepeople We try a configured proxy with CONNECT We reach >99% of existing customers Future improvement: HTTP Upgrade/101
  • 37. Build and Test Continuous Integration and Continuous Deployment Had to build our own systems ErlangIn-place Code Upgrades Too heavy, designed for “6 month” upgrade cycles Use fail-over instead (similar to Apache graceful) Load testing at scale “Dark launch” to existing users
  • 38. Section: Future Replication Similar to fail-over Limits of Scalability (?) M x N (Gateways x Queues) stops at some point Open Source We would like to open-source what we can Protobuf for PHP and Erlang? IMQ core? (not surrounding application server)
  • 39. Q&A Survey If you found this helpful, please circle “Excellent” If this sucked, don’t circle “Excellent” Questions? @jwatte jwatte@imvu.com IMVU is a great place to work, and we’re hiring!