SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
No bid left behind
My day to day handling a resilient real time bidding platform in a JVM environment. 
Marc de Palol
Trovit
Hey hi,
• Studied here (good to be back)
• Some research on supercomputing
• Moved to London, discovered Hadoop & intensive
data systems.
• Came back, still in the ‘Data Engineering’ stuff.
A classified search engine for property, jobs, cars, products and holiday rentals
• 180 Million ads,
• 170 Tb in the cluster
• 65 Million uniques / 170 Million visits
• 10 apps (iOS, Android)
• Cool office in Barcelona.
have a look at http://www.trovit.es
Real Time Bidding
It’s about selling ads.
• Per impression basis.
• Programmatic instantaneous auction
We are using ‘DoubleClick Ad Exchange’ (Google)
• Response under 100 ms.
• If 15% of our responses are invalid or timed out,
we stop getting bid requests progressively
Currently 10.000 QPS.
This system, literally, spends money. So, it must be rock solid.
Our system is coded carefully, with love and tests.
Still, sh*t happens.*t Happens
Resiliency
The ability to recover from unexpected errors.
The ability to sleep at night.
Detect Recover Warn
Detect Recover Warn
Monitoring
Resiliency
Patterns
Notifications
Monitoring, in a sensible way
• Logging with ‘mailAppender’
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR
• Logging with ‘mailAppender’
Probably, no e-mail when you’ve got an OOM.
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR
Let’s talk about OOM for a
minute.
Let’s talk about OOM for a
minute.
ps ax | grep java
Let’s talk about OOM for a
minute.
ps ax | grep java
JVMOpts=“-
XX:OnOutOfMemoryError=
/usr/local/bin/slack-msg.sh"
🚫
👍
Some cool ideas for improving memory usage
• byte[] serialization in objects ❗
• Varying Memory Conditions ❗
• Logging with ‘mailAppender’
• Bad when OOM.
• Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work
• Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work
• Supervision with actors
• If you’re using Akka
• control flow != data flow
Our Monitoring:
• Nagios.
• Logging (to Sentry)
• Heartbeats with real work.
• graphite comparison
Our Monitoring:
• Nagios.
• Logging (to Sentry)
• Heartbeats with real work.
• graphite comparison
Have graphs
Now we know that something
is going wrong.
Recovery
Bad data in the system
or / and
Errors in the system
Data errors.
Roll back (when possible)
• Keeping different versions in the DB.
• Keep the old version around.
• Know how to do a rollback.
Data errors.
Roll back (when possible)
• Keeping different versions in the DB.
• Keep the old version around.
• Know how to do a rollback.
Checks & Asserts with google guava.
checkArgument(i >= 0,
"Argument was %s but expected nonnegative", i);
checkArgument(i < j,
"Expected i < j, but %s > %s", i, j);
checkNotNull(myList,
"List should not be null")
checkState(object.isValid(),
"Object is not valid")
System errors
These happen mostly between system integrations.
• Your code and the DB.
• Your code and the 3rd party library.
• Your code and the queue.
DBs, a necessary supervillain
• Lost connection.
• Timeouts
• Can give you corrupted data.
• Can give you 0 data.
• Can give you too much data.
Circuit Breaker and his friend,
the Bulkhead Pattern.
Circuit Breaker
Our Beloved
CircuitBreakers
Bulkhead
Once the circuit breaker is open,
• Notify
• Try again! maybe.
• Try to avoid DOS your own system.
• Exponential retry.
• Failover
• Restart
Some other bits and pieces:
• Tight coupling leads to fast propagation of errors.
• Event driven stuff
• Complete parameter checking
• Avoid SPF’s. Pretty please.
• Stateless is better.
• Bounded queues!
Your turn.
mdepalol@trovit.com
@lant
[]
http://www.maxisciences.com/destruction/wallpaper

Contenu connexe

En vedette

En vedette (6)

Competing to be unique
Competing to be uniqueCompeting to be unique
Competing to be unique
 
Hfile
HfileHfile
Hfile
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
Erlang containers
Erlang containersErlang containers
Erlang containers
 
State of the art introduction
State of the art introductionState of the art introduction
State of the art introduction
 
Netty from the trenches
Netty from the trenchesNetty from the trenches
Netty from the trenches
 

Similaire à No bid left behind: Building a resilient real time bidding platform

Trending with Purpose
Trending with PurposeTrending with Purpose
Trending with PurposeJason Dixon
 
Big Data Berlin - Criteo
Big Data Berlin - CriteoBig Data Berlin - Criteo
Big Data Berlin - CriteoSofian Djamaa
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowGiuseppe Gaviani
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tieagiamas
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48Jozef Képesi
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebula Project
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installationsNETWAYS
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tpseudor00t overflow
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopJesse Vincent
 
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Nick Galbreath
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)Clancy Childs
 
Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Andy Sykes
 
The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015Will Gage
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati
 

Similaire à No bid left behind: Building a resilient real time bidding platform (20)

Trending with Purpose
Trending with PurposeTrending with Purpose
Trending with Purpose
 
Big Data Berlin - Criteo
Big Data Berlin - CriteoBig Data Berlin - Criteo
Big Data Berlin - Criteo
 
How Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with SnowplowHow Gousto is moving to just-in-time personalization with Snowplow
How Gousto is moving to just-in-time personalization with Snowplow
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tie
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
 
Your app works slowly. Now what?
Your app works slowly. Now what?Your app works slowly. Now what?
Your app works slowly. Now what?
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
 
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
 
Message passing
Message passingMessage passing
Message passing
 
Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)
 
The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 

Dernier

Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 

Dernier (20)

Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 

No bid left behind: Building a resilient real time bidding platform

  • 1. No bid left behind My day to day handling a resilient real time bidding platform in a JVM environment.  Marc de Palol Trovit
  • 2. Hey hi, • Studied here (good to be back) • Some research on supercomputing • Moved to London, discovered Hadoop & intensive data systems. • Came back, still in the ‘Data Engineering’ stuff.
  • 3. A classified search engine for property, jobs, cars, products and holiday rentals • 180 Million ads, • 170 Tb in the cluster • 65 Million uniques / 170 Million visits • 10 apps (iOS, Android) • Cool office in Barcelona. have a look at http://www.trovit.es
  • 4. Real Time Bidding It’s about selling ads. • Per impression basis. • Programmatic instantaneous auction
  • 5. We are using ‘DoubleClick Ad Exchange’ (Google) • Response under 100 ms. • If 15% of our responses are invalid or timed out, we stop getting bid requests progressively
  • 7. This system, literally, spends money. So, it must be rock solid. Our system is coded carefully, with love and tests.
  • 9. Resiliency The ability to recover from unexpected errors. The ability to sleep at night.
  • 10.
  • 11.
  • 12.
  • 15. Monitoring, in a sensible way
  • 16. • Logging with ‘mailAppender’ log4j.appender.mail=org.apache.log4j.net.SMTPAppender log4j.appender.mail.SMTPHost=localhost log4j.appender.mail.From=Error <error-bla@trovit.com> log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE log4j.appender.mail.layout=org.apache.log4j.PatternLayout log4j.appender.mail.threshold=ERROR
  • 17. • Logging with ‘mailAppender’ Probably, no e-mail when you’ve got an OOM. log4j.appender.mail=org.apache.log4j.net.SMTPAppender log4j.appender.mail.SMTPHost=localhost log4j.appender.mail.From=Error <error-bla@trovit.com> log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE log4j.appender.mail.layout=org.apache.log4j.PatternLayout log4j.appender.mail.threshold=ERROR
  • 18. Let’s talk about OOM for a minute.
  • 19. Let’s talk about OOM for a minute. ps ax | grep java
  • 20. Let’s talk about OOM for a minute. ps ax | grep java JVMOpts=“- XX:OnOutOfMemoryError= /usr/local/bin/slack-msg.sh" 🚫 👍
  • 21. Some cool ideas for improving memory usage • byte[] serialization in objects ❗ • Varying Memory Conditions ❗
  • 22. • Logging with ‘mailAppender’ • Bad when OOM.
  • 23. • Logging with ‘mailAppender’ • Bad when OOM. • Heartbeat • Doing some real work
  • 24. • Logging with ‘mailAppender’ • Bad when OOM. • Heartbeat • Doing some real work • Supervision with actors • If you’re using Akka • control flow != data flow
  • 25. Our Monitoring: • Nagios. • Logging (to Sentry) • Heartbeats with real work. • graphite comparison
  • 26. Our Monitoring: • Nagios. • Logging (to Sentry) • Heartbeats with real work. • graphite comparison
  • 28. Now we know that something is going wrong.
  • 30. Bad data in the system or / and Errors in the system
  • 31. Data errors. Roll back (when possible) • Keeping different versions in the DB. • Keep the old version around. • Know how to do a rollback.
  • 32. Data errors. Roll back (when possible) • Keeping different versions in the DB. • Keep the old version around. • Know how to do a rollback.
  • 33. Checks & Asserts with google guava. checkArgument(i >= 0, "Argument was %s but expected nonnegative", i); checkArgument(i < j, "Expected i < j, but %s > %s", i, j); checkNotNull(myList, "List should not be null") checkState(object.isValid(), "Object is not valid")
  • 34. System errors These happen mostly between system integrations. • Your code and the DB. • Your code and the 3rd party library. • Your code and the queue.
  • 35. DBs, a necessary supervillain • Lost connection. • Timeouts • Can give you corrupted data. • Can give you 0 data. • Can give you too much data.
  • 36. Circuit Breaker and his friend, the Bulkhead Pattern.
  • 37.
  • 41. Once the circuit breaker is open, • Notify • Try again! maybe. • Try to avoid DOS your own system. • Exponential retry. • Failover • Restart
  • 42. Some other bits and pieces: • Tight coupling leads to fast propagation of errors. • Event driven stuff • Complete parameter checking • Avoid SPF’s. Pretty please. • Stateless is better. • Bounded queues!