SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
FeatureByte
Simplify Feature Engineering in your Data Warehouse
Table of contents
What is a Machine Learning Feature?
Feature Engineering
Feature Engineering Challenges
Feature Engineering with FeatureByte
System Architecture
FeatureByte Catalog
Intuitive Data modeling
Track Assets in Catalog
Compute Historical Features
Serving Features
What’s coming next?
Visit us to find out more!
Features makes up the input data used to train machine learning
models and compute predictions.
Each row in the data is an example, which contains:
Features
Model inputs or predictors ( X )
Target (for training only)
Model output ( Y )
Features and targets are functions of an observation:
Observation:
Time of the observation (Point-in-time)
Entities involved in the observation
e.g. Customer, Product, Employee, Transaction etc
Ex X1 X2 X3 Y
1 1.1 0 0.9 1
2 0.7 1 0.1 0
3 2.9 0 0.5 0
X -> Y
What is a Machine Learning Feature?
Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
What is a Machine Learning Feature?
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)
Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
What is a Machine Learning Feature?
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)
Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150
What is a Machine Learning Feature?
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)
Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150
Cust + Prod Features
Time since
last purchase 11d
Time since
last purchase 63d
What is a Machine Learning Feature?
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)
Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150
Cust + Prod Features
Time since
last purchase 11d
Time since
last purchase 63d
Target
Purchase True
Purchase False
What is a Machine Learning Feature?
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)
Feature Engineering
Process to transform raw data to create features for machine learning
Training pipeline
Populate features and target for a set of observations
Large number of point-in-times (capture seasonality, better
generalization)
Large number of observations
Fit ML model
Serving pipeline
Populate features for a set of observations
Usually one point-in-time
Fewer observations, low latency required for some use cases
Make prediction using ML model
Feature Engineering
Process to transform raw data to create features for machine learning
Feature formulation and materialization
Data wrangling tools not tailored for feature engineering
Time-awareness handling needs to be implemented
Easy to introduce bugs and target leakage
Can be computationally / memory expensive if not
optimized
Transfering large datasets for experimentation
Feature Engineering Challenges
1 # read observations and transactions table
2 transactions_df = pd.read_parquet("transactions.parquet")
3 observation_df = pd.read_parquet("observations.parquet")
4
5 df = observation_df.drop_duplicates(
6 ["AccountID", "POINT_IN_TIME"]
7 ).merge(
8 transactions_df, on="AccountID", how="inner"
9 )
10 mask = (
11 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") &
12 (df.POINT_IN_TIME > df.Timestamp)
13 )
14 features_df = df[mask].groupby(
15 ["AccountID", "POINT_IN_TIME"]
16 )["Amount"].sum()
17
18 observation_df = observation_df.merge(
19 features_df,
20 on=["AccountID", "POINT_IN_TIME"],
21 how="left",
22 )
Training vs serving consistency
Features computed at serving time may not be consistent
with training due to imperfect data availability
Unrealistic training accuracy
Impact can be severe if model depends heavily on very
recent data not available during serving
Serving pipeline may require separate implementation
Longer time-to-production
Inconsistency with training
Feature Engineering Challenges
1 # Is this realistic in serving?
2 mask = (
3 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") &
4 (df.POINT_IN_TIME > df.Timestamp)
5 )
6
7 # Only consider records at least 30 min old?
8 shifted_PIT = df.POINT_IN_TIME - pd.Timedelta("30m")
9 mask = (
10 (shifted_PIT - df.Timestamp) < pd.Timedelta("7d") &
11 (shifted_PIT > df.Timestamp)
12 )
Isolation, inconsistency and redundancy
Sources, features can be used for different use cases
Consistent data semantics, data cleaning in different
projects
Sharing and reuse of features
Sharing code -> code duplication and redundant
computation
Propagation of bug fixes
Feature Stores an emerging trend
Single producer multiple consumer
Addresses many problems above, introduces new
challenges
Sharing limited to feature retrieval, some solutions
manage materialization
Feature Engineering Challenges
Feature Engineering with FeatureByte
Open Source Feature Platform
Centralized platform for feature engineering
Manage, share, reuse and track assets (tables, features, featurelists etc)
Experiment with feature engineering quickly
Create, save and retrieve features
Create feature lists using existing + new features
Compute historical features for training
Deploy feature lists for serving
Feature-centric Design
Feature are self-contained assets
Data sources, cleaning operations, definition and
refresh cadence
Contains everything needed to support training +
serving pipelines
Immutable
Track dependencies, usage, status and changes
Get historical features and deploy for online serving easily
Track request outputs with provenance
Feature Engineering with FeatureByte
1 most_freq_weekday_28d = catalog.get_feature(
2 "Most Frequent weekday Over the Last 28d"
3 )
4 # Get feature info and explicit code
5 most_freq_weekday_28d.info()
6 most_freq_weekday_28d.definition
Materialization in the Data Warehouse
Access source databases in the warehouse
Store feature cache and output tables in the warehouse
Manage storage + compute using SQL
Reduce security risks by avoiding bulk data export /
duplication / exposure
Supports Spark, DataBricks, Snowflake
Storage and computation optimization
Cache partial aggregates for more efficient computation
Store cache and online values instead of all historical
feature values
Feature Engineering with FeatureByte
Python SDK for feature creation
Built-in time-awareness for lookups and joins
Windowed aggregations based on request point-in-time
Automatically emulate serving time behavior in historical
features to minimize train / test inconsistency
Scalable compute in warehouse with optimized SQL
Feature Engineering with FeatureByte
1 # get table views
2 credit_card = catalog.get_view("CREDITCARD")
3 card_transactions = catalog.get_view("CARDTRANSACTIONS")
4
5 # join tables
6 card_transactions = card_transactions.join(credit_card)
7
8 # define spending features
9 cust_spend_features = card_transactions.groupby(
10 "BankCustomerID"
11 ).aggregate_over(
12 value_column="Amount",
13 method=fb.AggFunc.SUM,
14 windows=["7d"],
15 feature_names=["total_spend_7d"]
16 )
17
18 # preview features
19 cust_spend_features.preview(observation_set=observation_set)
System Architecture
Component Packaging Purpose
Python SDK Python Package Connects to the API service to provide feature authoring and management functionality through python classes and functions.
API Service Docker Container REST-API service that validates and executes requests, queries data warehouses, and stores data.
Worker Docker Container Executes asynchronous or scheduled tasks.
MongoDB Docker Container Store metadata for created assets.
Redis Docker Container Broker and queue for workers, messenger service for publishing progress updates.
Query Graph Transpiler Python Package Construct data transformation steps as a query graph, which can be transpiled to platform-specific SQL.
Source Tables Data Warehouse Tables used as data sources for feature engineering.
Feature Store Data Warehouse Database that store data used to support feature serving.
FeatureByte Catalog
A catalog stores tables, entities, features and other ML assets that can be reused, tracked and shared.
Intuitive Data modeling
Information about data model is captured during table registration and entity tagging
1 # register SCD table from the warehouse
2 credit_card = data_source.get_source_table(
3 "DATASETS", "CREDITCARD", "CREDITCARD"
4 ).create_scd_table(
5 "CREDITCARD",
6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo",
7 )
8
9 # register event table from the warehouse
10 card_transactions = data_source.get_source_table(
11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS"
12 ).create_event_table(
13 "CARDTRANSACTIONS",
14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp",
15 )
16
17 # tag entities in table columns
18 credit_card.BankCustomerId.as_entity("Customer")
19 credit_card.AccountID.as_entity("Account")
20 card_transactions.AccountID.as_entity("Account")
Intuitive Data modeling
Information about data model is captured during table registration and entity tagging
9 # register event table from the warehouse
10 card_transactions = data_source.get_source_table(
11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS"
12 ).create_event_table(
13 "CARDTRANSACTIONS",
14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp",
15 )
1 # register SCD table from the warehouse
2 credit_card = data_source.get_source_table(
3 "DATASETS", "CREDITCARD", "CREDITCARD"
4 ).create_scd_table(
5 "CREDITCARD",
6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo",
7 )
8
16
17 # tag entities in table columns
18 credit_card.BankCustomerId.as_entity("Customer")
19 credit_card.AccountID.as_entity("Account")
20 card_transactions.AccountID.as_entity("Account")
Intuitive Data modeling
Information about data model is captured during table registration and entity tagging
17 # tag entities in table columns
18 credit_card.BankCustomerId.as_entity("Customer")
19 credit_card.AccountID.as_entity("Account")
20 card_transactions.AccountID.as_entity("Account")
1 # register SCD table from the warehouse
2 credit_card = data_source.get_source_table(
3 "DATASETS", "CREDITCARD", "CREDITCARD"
4 ).create_scd_table(
5 "CREDITCARD",
6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo",
7 )
8
9 # register event table from the warehouse
10 card_transactions = data_source.get_source_table(
11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS"
12 ).create_event_table(
13 "CARDTRANSACTIONS",
14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp",
15 )
16
Entity Relationships
Child-parent for feature serving
Table Relationships:
Primary and foreign keys
Table types
Event, Item, Slowly Changing, Dimension
Determine time-awareness in joins, enforce guardrails
Column Semantics
Timestamp, primary key, timezone offset
Supports timezone handling, smart joins
Intuitive Data modeling
Information about data model is captured during table registration and entity tagging
Track Assets in Catalog
Saved tables
Saved features
Compute Historical Features
Features can be accessed from the warehouse or downloaded to parquet file
1 my_favorite_features.compute_historical_features(
2 observation_set=observation_df
3 )
Serving Features
Deploy a featurelist for serving
Retrieve features using REST-API request
Track feature job status
1 # Create and enable new deployment
2 deployment = my_favorite_features.deploy()
3 deployment.enable()
1 curl -X POST 
2 -H 'Content-Type: application/json' 
3 -H 'active-catalog-id: 64708919ea4c4876a77d2b80' 
4 -d '{"entity_serving_names": [{"GROCERYINVOICEGUID": "d1b5d3ae-f37b-4864-a56d-d70d81641577"}]}' 
5 http://featurebyte_service/api/v1/deployment/6478b57bb68c91fb84f1e156/online_features
What’s coming next?
User Defined Functions
Register functions in the data warehouse with the SDK for expanded functionality
Access functions outside of the warehouse (e.g. pre-trained models) using external functions
Target Creation
Define, save and reuse targets
Materialize along with features
Low Latency Serving
Deployments maintain jobs for feature refresh but serving is not scalable or low latency
Low-latency serving will use key-value stores and in-memory processing for on-demand computation
Automated Feature Discovery
Recommend features based on semantics and relationships
Visit us to find out more!
https://github.com/featurebyte/featurebyte
Documentation · Website

Contenu connexe

Similaire à Simplify Feature Engineering in Your Data Warehouse

Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreAccelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreDatabricks
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQueryMárton Kodok
 
BigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLBigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLMárton Kodok
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
 
Kevin Bengtson Portfolio
Kevin Bengtson PortfolioKevin Bengtson Portfolio
Kevin Bengtson PortfolioKbengt521
 
Developing Next-Gen Enterprise Web Application
Developing Next-Gen Enterprise Web ApplicationDeveloping Next-Gen Enterprise Web Application
Developing Next-Gen Enterprise Web ApplicationMark Gu
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic StackRochelle Sonnenberg
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
MVC Design Pattern in JavaScript by ADMEC Multimedia Institute
MVC Design Pattern in JavaScript by ADMEC Multimedia InstituteMVC Design Pattern in JavaScript by ADMEC Multimedia Institute
MVC Design Pattern in JavaScript by ADMEC Multimedia InstituteRavi Bhadauria
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
Transforming Feature Ideas into Machine Learning Inputs
Transforming Feature Ideas into Machine Learning InputsTransforming Feature Ideas into Machine Learning Inputs
Transforming Feature Ideas into Machine Learning InputsFeatureByte
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0WSO2
 

Similaire à Simplify Feature Engineering in Your Data Warehouse (20)

Venu-Sage X3-resume
Venu-Sage  X3-resumeVenu-Sage  X3-resume
Venu-Sage X3-resume
 
Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreAccelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
BigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLBigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery ML
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
 
Kevin Bengtson Portfolio
Kevin Bengtson PortfolioKevin Bengtson Portfolio
Kevin Bengtson Portfolio
 
Developing Next-Gen Enterprise Web Application
Developing Next-Gen Enterprise Web ApplicationDeveloping Next-Gen Enterprise Web Application
Developing Next-Gen Enterprise Web Application
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
MVC Design Pattern in JavaScript by ADMEC Multimedia Institute
MVC Design Pattern in JavaScript by ADMEC Multimedia InstituteMVC Design Pattern in JavaScript by ADMEC Multimedia Institute
MVC Design Pattern in JavaScript by ADMEC Multimedia Institute
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Transforming Feature Ideas into Machine Learning Inputs
Transforming Feature Ideas into Machine Learning InputsTransforming Feature Ideas into Machine Learning Inputs
Transforming Feature Ideas into Machine Learning Inputs
 
Resume
ResumeResume
Resume
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Gui Report Studio in java
Gui Report Studio in javaGui Report Studio in java
Gui Report Studio in java
 
JKSQL
JKSQLJKSQL
JKSQL
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0
 

Dernier

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Dernier (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Simplify Feature Engineering in Your Data Warehouse

  • 2. Table of contents What is a Machine Learning Feature? Feature Engineering Feature Engineering Challenges Feature Engineering with FeatureByte System Architecture FeatureByte Catalog Intuitive Data modeling Track Assets in Catalog Compute Historical Features Serving Features What’s coming next? Visit us to find out more!
  • 3. Features makes up the input data used to train machine learning models and compute predictions. Each row in the data is an example, which contains: Features Model inputs or predictors ( X ) Target (for training only) Model output ( Y ) Features and targets are functions of an observation: Observation: Time of the observation (Point-in-time) Entities involved in the observation e.g. Customer, Product, Employee, Transaction etc Ex X1 X2 X3 Y 1 1.1 0 0.9 1 2 0.7 1 0.1 0 3 2.9 0 0.5 0 X -> Y What is a Machine Learning Feature?
  • 4. Observation Point-in- time 2020-05-23 Customer Product Point-in- time 2020-05-23 Customer Product What is a Machine Learning Feature? Example: Use ML to recommend products to existing customers Train model to predict future purchases based on past observations Require examples with different targets (purchase / no purchase)
  • 5. Observation Point-in- time 2020-05-23 Customer Product Point-in- time 2020-05-23 Customer Product Customer Features Age 23 Purchases past 2 weeks 3 Age 47 Purchases past 2 weeks 1 What is a Machine Learning Feature? Example: Use ML to recommend products to existing customers Train model to predict future purchases based on past observations Require examples with different targets (purchase / no purchase)
  • 6. Observation Point-in- time 2020-05-23 Customer Product Point-in- time 2020-05-23 Customer Product Customer Features Age 23 Purchases past 2 weeks 3 Age 47 Purchases past 2 weeks 1 Product Features Color Pink Sales past 2 weeks 368 Color Gray Sales past 2 weeks 150 What is a Machine Learning Feature? Example: Use ML to recommend products to existing customers Train model to predict future purchases based on past observations Require examples with different targets (purchase / no purchase)
  • 7. Observation Point-in- time 2020-05-23 Customer Product Point-in- time 2020-05-23 Customer Product Customer Features Age 23 Purchases past 2 weeks 3 Age 47 Purchases past 2 weeks 1 Product Features Color Pink Sales past 2 weeks 368 Color Gray Sales past 2 weeks 150 Cust + Prod Features Time since last purchase 11d Time since last purchase 63d What is a Machine Learning Feature? Example: Use ML to recommend products to existing customers Train model to predict future purchases based on past observations Require examples with different targets (purchase / no purchase)
  • 8. Observation Point-in- time 2020-05-23 Customer Product Point-in- time 2020-05-23 Customer Product Customer Features Age 23 Purchases past 2 weeks 3 Age 47 Purchases past 2 weeks 1 Product Features Color Pink Sales past 2 weeks 368 Color Gray Sales past 2 weeks 150 Cust + Prod Features Time since last purchase 11d Time since last purchase 63d Target Purchase True Purchase False What is a Machine Learning Feature? Example: Use ML to recommend products to existing customers Train model to predict future purchases based on past observations Require examples with different targets (purchase / no purchase)
  • 9. Feature Engineering Process to transform raw data to create features for machine learning
  • 10. Training pipeline Populate features and target for a set of observations Large number of point-in-times (capture seasonality, better generalization) Large number of observations Fit ML model Serving pipeline Populate features for a set of observations Usually one point-in-time Fewer observations, low latency required for some use cases Make prediction using ML model Feature Engineering Process to transform raw data to create features for machine learning
  • 11. Feature formulation and materialization Data wrangling tools not tailored for feature engineering Time-awareness handling needs to be implemented Easy to introduce bugs and target leakage Can be computationally / memory expensive if not optimized Transfering large datasets for experimentation Feature Engineering Challenges 1 # read observations and transactions table 2 transactions_df = pd.read_parquet("transactions.parquet") 3 observation_df = pd.read_parquet("observations.parquet") 4 5 df = observation_df.drop_duplicates( 6 ["AccountID", "POINT_IN_TIME"] 7 ).merge( 8 transactions_df, on="AccountID", how="inner" 9 ) 10 mask = ( 11 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") & 12 (df.POINT_IN_TIME > df.Timestamp) 13 ) 14 features_df = df[mask].groupby( 15 ["AccountID", "POINT_IN_TIME"] 16 )["Amount"].sum() 17 18 observation_df = observation_df.merge( 19 features_df, 20 on=["AccountID", "POINT_IN_TIME"], 21 how="left", 22 )
  • 12. Training vs serving consistency Features computed at serving time may not be consistent with training due to imperfect data availability Unrealistic training accuracy Impact can be severe if model depends heavily on very recent data not available during serving Serving pipeline may require separate implementation Longer time-to-production Inconsistency with training Feature Engineering Challenges 1 # Is this realistic in serving? 2 mask = ( 3 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") & 4 (df.POINT_IN_TIME > df.Timestamp) 5 ) 6 7 # Only consider records at least 30 min old? 8 shifted_PIT = df.POINT_IN_TIME - pd.Timedelta("30m") 9 mask = ( 10 (shifted_PIT - df.Timestamp) < pd.Timedelta("7d") & 11 (shifted_PIT > df.Timestamp) 12 )
  • 13. Isolation, inconsistency and redundancy Sources, features can be used for different use cases Consistent data semantics, data cleaning in different projects Sharing and reuse of features Sharing code -> code duplication and redundant computation Propagation of bug fixes Feature Stores an emerging trend Single producer multiple consumer Addresses many problems above, introduces new challenges Sharing limited to feature retrieval, some solutions manage materialization Feature Engineering Challenges
  • 14. Feature Engineering with FeatureByte Open Source Feature Platform Centralized platform for feature engineering Manage, share, reuse and track assets (tables, features, featurelists etc) Experiment with feature engineering quickly Create, save and retrieve features Create feature lists using existing + new features Compute historical features for training Deploy feature lists for serving
  • 15. Feature-centric Design Feature are self-contained assets Data sources, cleaning operations, definition and refresh cadence Contains everything needed to support training + serving pipelines Immutable Track dependencies, usage, status and changes Get historical features and deploy for online serving easily Track request outputs with provenance Feature Engineering with FeatureByte 1 most_freq_weekday_28d = catalog.get_feature( 2 "Most Frequent weekday Over the Last 28d" 3 ) 4 # Get feature info and explicit code 5 most_freq_weekday_28d.info() 6 most_freq_weekday_28d.definition
  • 16. Materialization in the Data Warehouse Access source databases in the warehouse Store feature cache and output tables in the warehouse Manage storage + compute using SQL Reduce security risks by avoiding bulk data export / duplication / exposure Supports Spark, DataBricks, Snowflake Storage and computation optimization Cache partial aggregates for more efficient computation Store cache and online values instead of all historical feature values Feature Engineering with FeatureByte
  • 17. Python SDK for feature creation Built-in time-awareness for lookups and joins Windowed aggregations based on request point-in-time Automatically emulate serving time behavior in historical features to minimize train / test inconsistency Scalable compute in warehouse with optimized SQL Feature Engineering with FeatureByte 1 # get table views 2 credit_card = catalog.get_view("CREDITCARD") 3 card_transactions = catalog.get_view("CARDTRANSACTIONS") 4 5 # join tables 6 card_transactions = card_transactions.join(credit_card) 7 8 # define spending features 9 cust_spend_features = card_transactions.groupby( 10 "BankCustomerID" 11 ).aggregate_over( 12 value_column="Amount", 13 method=fb.AggFunc.SUM, 14 windows=["7d"], 15 feature_names=["total_spend_7d"] 16 ) 17 18 # preview features 19 cust_spend_features.preview(observation_set=observation_set)
  • 18. System Architecture Component Packaging Purpose Python SDK Python Package Connects to the API service to provide feature authoring and management functionality through python classes and functions. API Service Docker Container REST-API service that validates and executes requests, queries data warehouses, and stores data. Worker Docker Container Executes asynchronous or scheduled tasks. MongoDB Docker Container Store metadata for created assets. Redis Docker Container Broker and queue for workers, messenger service for publishing progress updates. Query Graph Transpiler Python Package Construct data transformation steps as a query graph, which can be transpiled to platform-specific SQL. Source Tables Data Warehouse Tables used as data sources for feature engineering. Feature Store Data Warehouse Database that store data used to support feature serving.
  • 19. FeatureByte Catalog A catalog stores tables, entities, features and other ML assets that can be reused, tracked and shared.
  • 20. Intuitive Data modeling Information about data model is captured during table registration and entity tagging 1 # register SCD table from the warehouse 2 credit_card = data_source.get_source_table( 3 "DATASETS", "CREDITCARD", "CREDITCARD" 4 ).create_scd_table( 5 "CREDITCARD", 6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo", 7 ) 8 9 # register event table from the warehouse 10 card_transactions = data_source.get_source_table( 11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS" 12 ).create_event_table( 13 "CARDTRANSACTIONS", 14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp", 15 ) 16 17 # tag entities in table columns 18 credit_card.BankCustomerId.as_entity("Customer") 19 credit_card.AccountID.as_entity("Account") 20 card_transactions.AccountID.as_entity("Account")
  • 21. Intuitive Data modeling Information about data model is captured during table registration and entity tagging 9 # register event table from the warehouse 10 card_transactions = data_source.get_source_table( 11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS" 12 ).create_event_table( 13 "CARDTRANSACTIONS", 14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp", 15 ) 1 # register SCD table from the warehouse 2 credit_card = data_source.get_source_table( 3 "DATASETS", "CREDITCARD", "CREDITCARD" 4 ).create_scd_table( 5 "CREDITCARD", 6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo", 7 ) 8 16 17 # tag entities in table columns 18 credit_card.BankCustomerId.as_entity("Customer") 19 credit_card.AccountID.as_entity("Account") 20 card_transactions.AccountID.as_entity("Account")
  • 22. Intuitive Data modeling Information about data model is captured during table registration and entity tagging 17 # tag entities in table columns 18 credit_card.BankCustomerId.as_entity("Customer") 19 credit_card.AccountID.as_entity("Account") 20 card_transactions.AccountID.as_entity("Account") 1 # register SCD table from the warehouse 2 credit_card = data_source.get_source_table( 3 "DATASETS", "CREDITCARD", "CREDITCARD" 4 ).create_scd_table( 5 "CREDITCARD", 6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo", 7 ) 8 9 # register event table from the warehouse 10 card_transactions = data_source.get_source_table( 11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS" 12 ).create_event_table( 13 "CARDTRANSACTIONS", 14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp", 15 ) 16
  • 23. Entity Relationships Child-parent for feature serving Table Relationships: Primary and foreign keys Table types Event, Item, Slowly Changing, Dimension Determine time-awareness in joins, enforce guardrails Column Semantics Timestamp, primary key, timezone offset Supports timezone handling, smart joins Intuitive Data modeling Information about data model is captured during table registration and entity tagging
  • 24. Track Assets in Catalog Saved tables Saved features
  • 25. Compute Historical Features Features can be accessed from the warehouse or downloaded to parquet file 1 my_favorite_features.compute_historical_features( 2 observation_set=observation_df 3 )
  • 26. Serving Features Deploy a featurelist for serving Retrieve features using REST-API request Track feature job status 1 # Create and enable new deployment 2 deployment = my_favorite_features.deploy() 3 deployment.enable() 1 curl -X POST 2 -H 'Content-Type: application/json' 3 -H 'active-catalog-id: 64708919ea4c4876a77d2b80' 4 -d '{"entity_serving_names": [{"GROCERYINVOICEGUID": "d1b5d3ae-f37b-4864-a56d-d70d81641577"}]}' 5 http://featurebyte_service/api/v1/deployment/6478b57bb68c91fb84f1e156/online_features
  • 27. What’s coming next? User Defined Functions Register functions in the data warehouse with the SDK for expanded functionality Access functions outside of the warehouse (e.g. pre-trained models) using external functions Target Creation Define, save and reuse targets Materialize along with features Low Latency Serving Deployments maintain jobs for feature refresh but serving is not scalable or low latency Low-latency serving will use key-value stores and in-memory processing for on-demand computation Automated Feature Discovery Recommend features based on semantics and relationships
  • 28. Visit us to find out more! https://github.com/featurebyte/featurebyte Documentation · Website