Hotels.com's Journey to Becoming an Algorithmic Business...Exponential Growth in Data Science Whilst Migrating to Apache Spark+Cloud All at the Same Time with Matt Fryer
In the last year Hotels.com has begun it’s journey to becoming an algorithmic business. Matt will talk about their experiences of exponential growth in Data Science Algorithms whilst at the same time the team have migrated to using Spark as their core underlying architecture from SAS / SQL, migrated to the cloud from on-premise are transforming the capability of the data science function. He will also highlight the key enablers that have made this successful including CEO support, the internal concepts of organic intelligence and how Databricks has helped make this happen. He will also highlight the pitfalls on the journey.
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Hotels.com's Journey to Becoming an Algorithmic Business...Exponential Growth in Data Science Whilst Migrating to Apache Spark+Cloud All at the Same Time with Matt Fryer
1. Confidential - do not distribute
Hotels.com’s journey to becoming
anAlgorithmic Business
Matthew Fryer
VP,ChiefDataScienceOfficer
mfryer@hotels.com
2. Confidential - do not distribute
Part of Expedia, Inc. family
>385,000 properties
89 countries
39 languages
>30m Hotels.com Rewards Members
Home of Captain Obvious
Billions of Recommendations, based on real-time Data per day
Hotels.com
5. Confidential - do not distribute
5
Data Science Engineering Front End Development
6. Confidential - do not distribute
“Artificial Intelligence Will Be
Travel’s Next Big Thing”
Barry Diller
Chairman & Senior Executive,
Expedia, Inc.
3M’s are disruptive
technology
Mobile
Messaging / NLP
Machine Learning
8. Confidential - do not distribute 8
Core Elements of our Data Science Cloud Platform
Databricks Unified Platform
Maestro – Our Internally Developed
Platform on AWS
(EMR, Spark, R-Studio, Intellij,SBT, Jupyter,
Zeppelin, Unit / QA, Metastore,Apache Airflow,
Keras, Tensorflow)
Proof of Concept on Google
Cloud, Beam, Spark &
Tensorflow
9. Confidential - do not distribute
Databricks Unified Platform
Chart is in 1 hour blocks, y axis = number of 32 core instances
9
• Key asset to the success of data science at Hotels.com
• Key in driving up data scientist productivity / efficiency / flexibility
• Helps make our data science lifecycle operate much easier and
faster driving speed to market
• Reliable / secure + facilitates ‘Highly Elastic’ workflows exploiting
cost effective spot instance on AWS.
10. Confidential - do not distribute
The hidden secret of data science and AI
Typically data scientists are investing large amounts of
time in feature / data engineering areas which are ripe for
a technology solution
10
11. Confidential - do not distribute 11
ALPs – Algorithm Lifecycle Pipeline Service
The end to end ML Platform
12. Confidential - do not distribute
Site Data
TrainingScoring & serving
Hotels.com
Training
Real-time
scoring / bandit
Ingestion
Cache
Service
Data pipelines
Data set generation,
feature extraction
Reporting
Train & deploy
model
Update feedback loop
with CTR, GP etc.
Clickstream
Experiment
Experiment
Store &
serve scores
Assign
variant
Calculate scores
Data
pipelines
Frameworks
& Platforms
Lifecycle
/ Deploy
Develop and
maintain ML/ AI
pipelines
Methods to
research & exploit
ML & AI
innovation
Implement ML / AI
in production
Data
capture
Accessible data
13. Confidential - do not distribute
Reference: The Influence of Visuals in Online Hotel Research and Booking Behaviour
Images are an important factor while choosing a hotel
13
0% 10% 20% 30% 40% 50% 60% 70% 80%
Loyalty Program
Reviews
Hotel Brand
Star Rating
Destination Info
Images
Hotel Info
Factors other than price/location
Very Imporant/Important Important Very Important
14. Confidential - do not distribute
Computer Vision problems we try to tackle
14
Near Duplicate Detection
Scene Classification Image Ranking
16. Confidential - do not distribute 16
GPU’s quickly became key, took a large effort to optimize using
Keras + Tensorflow (Inception v3 + ResNet)
493
67
20
7
4
1
10
100
1000
12-CPU 1-GPU 1-GPU +
limited cache
16-GPU +
limited cache
16-GPU + full
cache
Days CIFAR2
Expedia Small
15
2.5
0
5
10
15
20
16-GPU + full cache Optimized
Days
17. Confidential - do not distribute
Near Duplicate Detection: Real world examples
17
Non-Duplicates – probability 100%
Non-Duplicates – probability 95.91%
Duplicates – probability 97.98%
Duplicates – probability 98.43%
18. Confidential - do not distribute
ROOM/BATHROOM
Using the model: Real world examples
18
EXTERIOR/HOTEL INTERIOR/SEATING_LO
BBY
ROOM/LIVING_ROOM
ROOM/GUESTROOM
FACILITIES/DINING
INTERIOR/SEATING_LOBBY
FACILITIES/POOL
19. Confidential - do not distribute
Accuracy & Confusion Matrix
19
• After many manual / long
winded iterations and
regularization processes
tuning hyperparameters
• We achieved good
accuracy and low
confusion matrix
20. Confidential - do not distribute
Optimizing the photo order for improved customer
experiences
20
Original Model
Reference: Radisson Blu Edwardian Berkshire Hotel, London
21. Confidential - do not distribute
Finding the right hotel in our marketplace is core to
our customers needs.
22. Confidential - do not distribute
Kensington
Bloomsbury
Heathrow
Canary
Wharf
Paddington
Westminster
London City
Airport
Chelsea
Battersea
Wimbledon
Wembley
City of
London
As an example different user segments like to stay in
different locations
23. Confidential - do not distribute 23
Utility
Utility
Utility
just browsing! BOOK!Intent
(click)
24. Confidential - do not distribute
Thank you
mfryer@hotels.com
https://uk.linkedin.com/in/matthewfryer
@mattfryer
Notes de l'éditeur
Comments
I checked the IR website for the latest data that we have made public (including Annual Report)
Was planning to only briefly linger on this slide, will call out a few data points especially recommendation volume + loyalty member etc
Comments (This slide has a build on it, you can see it by slideshow view)
General thankyou to Spark Summit and Databricks for inviting me
Share goal of presentation, eg highlight focus on transforming customer experiences with algorithms Hotels.com / our move to spark / cloud in the last year and share some of the interesting things we are doing
Link to the slide : highlight it used to feel like there was data everywhere the size of the torch is growing every day
Comments
Create / Build of new Data Science Function, Move to Public Cloud (mainly AWS + some Azure / GCP) from On Prem, Move to Spark from SAS / Core Hadoop in all in the last year
As per the title , comment we are entering a golden age of data science where we can now use data to find patterns, build algo to help customer experiences,
Imagine the world when we are enter adulthood aka maturity
Given the potential I think we are all toddlers with so much more to learn and figure out
Better to be fast first (example of testing and freedom to innovate) and ideally often being correct is a bonus!
Comments
It has taken complete teamwork from across the business to deliver success and well aligned pipelines
i) Built in creating a data science function in the last 2 years, it is team effort and data science / algorithms sit on the back on the workhorse of engineers
ii) This allows algorithms to make choices and understand patterns to optimize for customer experiences rather than limited optimization.
iii) Part of the secret has been matching data scientists with dedicated data, network, devops and software engineers on the platform
iv) Create a community (big group hug) to share approach and work together for success
Overall we have >20 amazing data scientist / >15 dedicated data science engineers + growing fast
+ 100’s of analysts and engineers
Comments
machine learning and artificial intelligence will combine to manage companies’ big data troves and there will be layers of innovation “tacked onto distribution systems.”
Key has been support from the very top of the company
Call out support from the very top has been vital to move forward at pace with wider organization alignment
Dara’s comment from last earning call of the 3 M’s and organic Intelligence
I think AI is a good deal down the road. I think right now, we are more dependent on OI, organic intelligence, here; of folks here at the company. I
think as far as disruptive technology, I do like to talk about the 3 Ms, and it's not disruptive. It's just happening. One is mobile for us. And right now
with most brands, over 1/3 of our transactions are mobile. Over half of our traffic is mobile. And the cool thing about mobile is it's always on and
it gives you location context. The second M for us that's emerging especially in the APAC markets, are messaging. And what messaging does for
us is it allows two-way communication at any time, but it also combines identity with that communication. And once you have identity, you can
start communicating with someone on a one-to-one basis. Most of our systems right now are built to serve the average. This is a consumer where
you come to Expedia, most of our systems are built to serve the average consumer. Now more and more we can optimize to the specific customer,
and you combine that with a third M, which is machine learning, it is only possible to optimize to the individual based on very significant amounts
of data, very significant amounts of interaction so that you can start treating every single customer in a different way. You can go back to the olden
days when your travel agent knew exactly what you wanted. This is going to be disruptive, but it's going to be a slow disruption as we learn more
Comments
i)Hotels.com / Expedia in the first travel revolution, empowered consumers globally with the transparency of price, variety, choice and content
ii)Machine Learning / Algorithms are creating the next travel revolution transforming consumer experiences and effectively powering the turnaround of the travel agent
iii) Future of having the conversation with the travel agent with a modern twist. to having a messaging (20 years we re-invented travel and democratised travel and information (green screen around), now we can with data science and spark power personalised experiences and give customers access to the best experience of both.
iv) Machine Learning is now at the strategic core to the growth and future of Hotels.com
Databricks
Optimised for Data Science
Easy to use UI (Notebook)
Advanced Job Scheduling
Spot pool capability
Great for algorithm development & feature engineering (aka ETL)
Awesome support from Spark Engineers
Maestro – AWS Platform
Integrated platform for large model development and deployment
Advanced cluster support including Maven / Artifactory
Maestro Framework (Internal extension to Spark ML)
Individual environment per Data Scientist
Fast to R&D / Fully ephemeral
Google
Launched PoC on Google Cloud
Evaluate Google approach to AI / Machine Learning including
Tensorflow
GPU
NLP / Vision API’s
ML Engine
Datalab notebooks
Apache Bean / Dataflow
It is the code-base responsible for building Machine Learning models for HCOM.
It is developed in-house using Scala 2.11 & Spark 2.0 (migrating to 2.1)
It is a ML framework which:
Standardizes and speeds up the way we build models.
Provides all the necessary tools for training, testing & validating models
Google PoC
Facilitated use of ‘extreme elasticity’ incl spot instances
Saving on cost whilst using huge compute power
Speed to market dramatically increases
Spot instance costs ~10-20% of On Demand cost
Databricks is making things easy and doing image classification across the Expedia portfolio
Highlight value prop that you get w Databricks over open source Spark
Works out of the box. Elasticity, ease of use, notebooks, etc.
Across millions of hotel and user submitted images
Critical use case on mobile
Comment
Highlight images are not algo optimized historically
+ Now have 100 of thousands of User Photos to categorise and sort
Built in Spark and Tensorflow, Convolution Neural Net Approach with some surprisingly good accuracy, would recommend everyone to try there hand at deep learning.
Target: Detect near-duplicate images on the PDP.
Dataset: A synthetic dataset produced by applying transformations on hotel photos. Size ~ 6 million images.
Network: A custom Siamese network on top of the Scene Detection classifier.
Results: 99.97% accuracy on the synthetic dataset. Validated on real world images.
Important to use your own data, we obtained 82% from off the shelf deep learning API’s (such as Google Vision API etc.)
Linking in customer feedback loops with the neural nets to begin optimizing the most relevant sort of images for different customers.
Comments
Personalisation especially MicroSegmentation is crucial, taking max signals, spark has enabled us to cope with the scale of data
Popularity balanced with Diversity / Quality and Niche Customer Needs
All in the context of linking searches of users doing 4-9 different searches.
Size increase of 20x the data, covers attribution of all customer clicks (typically 4-9 searches per user)
10x the data columns
Facilitates personalization / microsegments