SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Code Cleanup
A Data Scientist’s Guide to
✨Sparkling Code✨
Corrie Bartelheimer
Senior Data Scientist
“Why should I care about clean code?”
– a young data scientist
“Why don’t they care about clean code?”
– an engineer working with data scientists
Lies we tell ourselves
● “I’m only gonna run this query once”
● “This won’t change”
● “No one's gonna need this again”
● “I don’t think we’ll run this analysis
again”
● “We can just deploy the notebook to
production”
Reality
● “Can you check what happens if ….?”
● “Let me add just this one small thing”
● “Can you have a look at these old notebook
from your former coworker?”
● “Probably not!”
We read
more code
than we
write!
“Dirty” Code Slows Down:
● Takes more time to
understand
● Harder to change
● Often hides bugs
● Makes reproducibility
difficult
What makes code easy to read?
Meaningful
Names
def transform(data):
for k, v in data.items():
data[k] = round(v * 0.9, 2)
return data
Meaningful
Names
def transform(data):
for k, v in data.items():
# calculate new price
data[k] = round(v * 0.9, 2)
return data
Meaningful
Names
● Use descriptive names
○ Longer > shorter
○ Searchable
● Explain code through
naming
● Avoid generic names like
transform, item, ..
etc
def discount_prices(room_prices, discount=0.1):
for room, price in room_prices.items():
new_price = price * (1 - discount)
room_prices[room] = round(new_price, 2)
return room_prices
Use Short
Functions!
def compute_features(df_map, other_data):
destination_cols = ["destination_id", "cdestination_name",
"latitude","longitude",]
keep_cols = ["some_id","another_id","hotel_id"]
# keep only unique hotels
data = df_map[[*keep_cols, *destination_cols]].drop_duplicates(
subset=["some_id", "destination_id"]
)
# mean of features
mean_features = (
df_map.groupby("some_id")[["stars","review_score"]].mean().reset_index()
)
# add prices
df = (
df_map[["some_id", "other_id"]]
.dropna(subset=["some_id", "other_id"])
.drop_duplicates()
)
room_price = ...
min_price = (
pd.merge(room_price, df, on="some_id", how="inner")
.groupby(["other_id", "currency"])
.agg(min_price=("price", np.min))
.assign(min_price_usd=lambda x: x["min_price"].copy())
.reset_index()
)
# align index
data = pd.merge(hotel_df, min_price, on="some_id", how="left")
# add room count
rooms_count = (
df_map.groupby("some_id")["other_id"].nunique().rename("rooms_count")
)
data = pd.merge(data, rooms_count.reset_index(), on="some_id", how="left")
Use Short
Functions!
● Structure Long code into
shorter functions
Also, most comments are
bad
● Comments tend to go
stale
● Instead of comments,
explain the what in the
function name
● Keep comments to
explain the why
def compute_features(df_map, other_data):
destination_cols = ["destination_id", "cdestination_name",
"latitude","longitude",]
keep_cols = ["some_id","another_id","hotel_id"]
# keep only unique hotels
data = df_map[[*keep_cols, *destination_cols]].drop_duplicates(
subset=["some_id", "destination_id"]
)
# mean of features
mean_features = (
df_map.groupby("some_id")[["stars","review_score"]].mean().reset_index()
)
# add prices
df = (
df_map[["some_id", "other_id"]]
.dropna(subset=["some_id", "other_id"])
.drop_duplicates()
)
room_price = ...
min_price = (
pd.merge(room_price, df, on="some_id", how="inner")
.groupby(["other_id", "currency"])
.agg(min_price=("price", np.min))
.assign(min_price_usd=lambda x: x["min_price"].copy())
.reset_index()
)
# align index
data = pd.merge(hotel_df, min_price, on="some_id", how="left")
# add room count
rooms_count = (
df_map.groupby("some_id")["other_id"].nunique().rename("rooms_count")
)
data = pd.merge(data, rooms_count.reset_index(), on="some_id", how="left")
def compute_features(df_map, other_data):
data = get_unique_hotels(df_map)
mean_features = get_mean_features(df_map)
data = add_mean_price(data, mean_features)
data = add_room_count(data, other_data)
return data
Avoid Mixing
Abstraction
Layers
Don’t put low-level
operations on the same level
as high-level functions
bad_ids = [hotel_ + str(id_) for id_ in [123, 42, 888]]
df = df[~df.hotel_name.isin(bad_ids)]
features = compute_features(df)
Avoid Mixing
Abstraction
Layers
● Group operations
together that have the
same level of
abstraction
● Structure code into
abstraction hierarchies
by using functions
● Same for variable
names
bad_ids = [hotel_ + str(id_) for id_ in [123, 42, 888]]
df = df[~df.hotel_name.isin(bad_ids)]
features = compute_features(df)
df_filtered = filter_out_bad_ids(df)
features = compute_features(df_filtered)
Better:
Ain’t got no time for this
Measure Prioritize
Visualize
Or someone else…
Measuring
Code Complexity
Function Length
● Checks length of a
function
● Checks number of input
and output parameter
def some_long_function(
first_parameter: int,
second_parameter: int,
third_parameter: int,
):
first_parameter = (
first_parameter +
second_parameter +
third_parameter
)
first_parameter = (
first_parameter -
second_parameter +
third_parameter
)
first_parameter = (
first_parameter +
second_parameter -
third_parameter
)
first_parameter = (
first_parameter
second_parameter
third_parameter
)
return first_parameter
Cognitive
Complexity
● Increments for breaks
in the flow
○ loops &
conditionals
○ catch, switch
statements
○ breaks, continue
● Increments for nested
structures
def f(a, b):
if a:
for i in range(b):
if b:
return 1
Abstraction Layer
● Only allow generic
variable names in small
functions
○ items, var, variables,
result, …
○ The more generic,
the smaller the
function
def foo(variables)
items = []
for var in variables:
items += [var]
return items
Expression
Complexity
● Measures complexity of
expressions
● Rule of Thumb: if
expression goes over
multiple lines, consider
splitting it
if (df.sold_out.any() and
df[~(df.is_new_hotel & ~df.is_covid)
| df.price_value.isna()].any()):
do_something(df)
All of these
available as flake8
plugin:
- flake8-cognitive-complexity
- flake8-adjustable-complexity
- flake8-expression-complexity
- flake8-functions
Now on to collect some
data…
Compute
Complexity
Heuristics
● Used the flake8
implementation to compute
complexity heuristics
● Depending on needs, decide
how often to run
● Jupyter notebook might be
good enough 🤪
● Store in your favorite
database
❤
Visualize & Prioritize
How are your repositories
developing over time?
Visualize & Prioritize
● How are your repositories
developing over time?
● Which files/functions should
we tackle first?
● Set aside some time to focus
on improving your code base.
● As little as 2hrs a months
every month can have a big
impact
Good candidate to
refactor
Visualize & Prioritize
● How are your repositories
developing over time?
● Which files/functions should
we tackle first?
● Set aside some time to focus
on improving your code base.
● As little as 2hrs a months
every month can have a big
impact
After Refactoring:
👏👏👏
In the end, it’s about
culture
How to foster a clean code culture in your team:
● Regularly sync & discuss which code needs
improvement
● Fixed time per month to work on code
quality
● Use pair programming for additional
knowledge sharing
● Learn about best practices through e.g.
reading groups
Thank you for your time!
@corrieaar corrie-bartelheimer www.samples-of-thoughts.com
github.com/corriebar/code-complexity
Resources
● Clean Code [ videos] by Uncle Bob
○ All in Java but still worth a read
○ Recommended Chapters: 1-5 and 17. Chp 6-10 go deeper into data structures,
classes etc. 14-16 are Java very specific and less interesting for python
developers
● Notebook to compute complexity

Contenu connexe

Tendances

Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe
 Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe
Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’EuropeASCAME
 
AirBNB - Belonging in the City
AirBNB - Belonging in the CityAirBNB - Belonging in the City
AirBNB - Belonging in the Citymiramarcus
 
Présentation yelloh! village
Présentation yelloh! villagePrésentation yelloh! village
Présentation yelloh! villageVENICA Céline
 
☆Lx pantos company profile (eng) huawei_20220127
☆Lx pantos company profile (eng) huawei_20220127☆Lx pantos company profile (eng) huawei_20220127
☆Lx pantos company profile (eng) huawei_20220127eghaesa
 
Complete Research Guide on Airbnb like App Development: History, Market Poten...
Complete Research Guide on Airbnb like App Development: History, Market Poten...Complete Research Guide on Airbnb like App Development: History, Market Poten...
Complete Research Guide on Airbnb like App Development: History, Market Poten...RipenApps Technologies Pvt. Ltd.
 
Systèmes multi agents concepts et mise en oeuvre avec le middleware jade
Systèmes multi agents concepts et mise en oeuvre avec le middleware jadeSystèmes multi agents concepts et mise en oeuvre avec le middleware jade
Systèmes multi agents concepts et mise en oeuvre avec le middleware jadeENSET, Université Hassan II Casablanca
 
ZenLedger.io Early Seed Deck
ZenLedger.io Early Seed DeckZenLedger.io Early Seed Deck
ZenLedger.io Early Seed DeckZenLedger.io
 

Tendances (10)

Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe
 Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe
Tanger: Porte du Maroc sur l’Afrique, l’Amérique et l’Europe
 
AirBNB - Belonging in the City
AirBNB - Belonging in the CityAirBNB - Belonging in the City
AirBNB - Belonging in the City
 
Presentation technologie
Presentation technologiePresentation technologie
Presentation technologie
 
Présentation yelloh! village
Présentation yelloh! villagePrésentation yelloh! village
Présentation yelloh! village
 
☆Lx pantos company profile (eng) huawei_20220127
☆Lx pantos company profile (eng) huawei_20220127☆Lx pantos company profile (eng) huawei_20220127
☆Lx pantos company profile (eng) huawei_20220127
 
Airbnb case study
Airbnb case studyAirbnb case study
Airbnb case study
 
Complete Research Guide on Airbnb like App Development: History, Market Poten...
Complete Research Guide on Airbnb like App Development: History, Market Poten...Complete Research Guide on Airbnb like App Development: History, Market Poten...
Complete Research Guide on Airbnb like App Development: History, Market Poten...
 
Systèmes multi agents concepts et mise en oeuvre avec le middleware jade
Systèmes multi agents concepts et mise en oeuvre avec le middleware jadeSystèmes multi agents concepts et mise en oeuvre avec le middleware jade
Systèmes multi agents concepts et mise en oeuvre avec le middleware jade
 
Staffbus Pitch Deck
Staffbus Pitch DeckStaffbus Pitch Deck
Staffbus Pitch Deck
 
ZenLedger.io Early Seed Deck
ZenLedger.io Early Seed DeckZenLedger.io Early Seed Deck
ZenLedger.io Early Seed Deck
 

Similaire à Code Cleanup: A Data Scientist's Guide to Sparkling Code

A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScriptJorg Janke
 
DataWeave 2.0 Language Fundamentals
DataWeave 2.0 Language FundamentalsDataWeave 2.0 Language Fundamentals
DataWeave 2.0 Language FundamentalsJoshua Erney
 
Spark Structured APIs
Spark Structured APIsSpark Structured APIs
Spark Structured APIsKnoldus Inc.
 
Webinar: Avoiding Sub-optimal Performance in your Retail Application
Webinar: Avoiding Sub-optimal Performance in your Retail ApplicationWebinar: Avoiding Sub-optimal Performance in your Retail Application
Webinar: Avoiding Sub-optimal Performance in your Retail ApplicationMongoDB
 
Lotusphere 2007 AD505 DevBlast 30 LotusScript Tips
Lotusphere 2007 AD505 DevBlast 30 LotusScript TipsLotusphere 2007 AD505 DevBlast 30 LotusScript Tips
Lotusphere 2007 AD505 DevBlast 30 LotusScript TipsBill Buchan
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Robust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksRobust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksStoyan Nikolov
 
How and Where in GLORP
How and Where in GLORPHow and Where in GLORP
How and Where in GLORPESUG
 
Glorp Tutorial Guide
Glorp Tutorial GuideGlorp Tutorial Guide
Glorp Tutorial GuideESUG
 
Sharable of qualities of clean code
Sharable of qualities of clean codeSharable of qualities of clean code
Sharable of qualities of clean codeEman Mohamed
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyDouglas Reith
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Structured web programming
Structured web programmingStructured web programming
Structured web programmingahfast
 

Similaire à Code Cleanup: A Data Scientist's Guide to Sparkling Code (20)

A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
 
DataWeave 2.0 Language Fundamentals
DataWeave 2.0 Language FundamentalsDataWeave 2.0 Language Fundamentals
DataWeave 2.0 Language Fundamentals
 
Spark Structured APIs
Spark Structured APIsSpark Structured APIs
Spark Structured APIs
 
Webinar: Avoiding Sub-optimal Performance in your Retail Application
Webinar: Avoiding Sub-optimal Performance in your Retail ApplicationWebinar: Avoiding Sub-optimal Performance in your Retail Application
Webinar: Avoiding Sub-optimal Performance in your Retail Application
 
Lotusphere 2007 AD505 DevBlast 30 LotusScript Tips
Lotusphere 2007 AD505 DevBlast 30 LotusScript TipsLotusphere 2007 AD505 DevBlast 30 LotusScript Tips
Lotusphere 2007 AD505 DevBlast 30 LotusScript Tips
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Robust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksRobust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time Checks
 
How and Where in GLORP
How and Where in GLORPHow and Where in GLORP
How and Where in GLORP
 
Clean Code 2
Clean Code 2Clean Code 2
Clean Code 2
 
Glorp Tutorial Guide
Glorp Tutorial GuideGlorp Tutorial Guide
Glorp Tutorial Guide
 
Sharable of qualities of clean code
Sharable of qualities of clean codeSharable of qualities of clean code
Sharable of qualities of clean code
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Php melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddyPhp melb cqrs-ddd-predaddy
Php melb cqrs-ddd-predaddy
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Structured web programming
Structured web programmingStructured web programming
Structured web programming
 

Dernier

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Dernier (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Code Cleanup: A Data Scientist's Guide to Sparkling Code

  • 1. Code Cleanup A Data Scientist’s Guide to ✨Sparkling Code✨ Corrie Bartelheimer Senior Data Scientist
  • 2. “Why should I care about clean code?” – a young data scientist
  • 3. “Why don’t they care about clean code?” – an engineer working with data scientists
  • 4. Lies we tell ourselves ● “I’m only gonna run this query once” ● “This won’t change” ● “No one's gonna need this again” ● “I don’t think we’ll run this analysis again” ● “We can just deploy the notebook to production” Reality ● “Can you check what happens if ….?” ● “Let me add just this one small thing” ● “Can you have a look at these old notebook from your former coworker?” ● “Probably not!”
  • 5. We read more code than we write! “Dirty” Code Slows Down: ● Takes more time to understand ● Harder to change ● Often hides bugs ● Makes reproducibility difficult
  • 6. What makes code easy to read?
  • 7. Meaningful Names def transform(data): for k, v in data.items(): data[k] = round(v * 0.9, 2) return data
  • 8. Meaningful Names def transform(data): for k, v in data.items(): # calculate new price data[k] = round(v * 0.9, 2) return data
  • 9. Meaningful Names ● Use descriptive names ○ Longer > shorter ○ Searchable ● Explain code through naming ● Avoid generic names like transform, item, .. etc def discount_prices(room_prices, discount=0.1): for room, price in room_prices.items(): new_price = price * (1 - discount) room_prices[room] = round(new_price, 2) return room_prices
  • 10. Use Short Functions! def compute_features(df_map, other_data): destination_cols = ["destination_id", "cdestination_name", "latitude","longitude",] keep_cols = ["some_id","another_id","hotel_id"] # keep only unique hotels data = df_map[[*keep_cols, *destination_cols]].drop_duplicates( subset=["some_id", "destination_id"] ) # mean of features mean_features = ( df_map.groupby("some_id")[["stars","review_score"]].mean().reset_index() ) # add prices df = ( df_map[["some_id", "other_id"]] .dropna(subset=["some_id", "other_id"]) .drop_duplicates() ) room_price = ... min_price = ( pd.merge(room_price, df, on="some_id", how="inner") .groupby(["other_id", "currency"]) .agg(min_price=("price", np.min)) .assign(min_price_usd=lambda x: x["min_price"].copy()) .reset_index() ) # align index data = pd.merge(hotel_df, min_price, on="some_id", how="left") # add room count rooms_count = ( df_map.groupby("some_id")["other_id"].nunique().rename("rooms_count") ) data = pd.merge(data, rooms_count.reset_index(), on="some_id", how="left")
  • 11. Use Short Functions! ● Structure Long code into shorter functions Also, most comments are bad ● Comments tend to go stale ● Instead of comments, explain the what in the function name ● Keep comments to explain the why def compute_features(df_map, other_data): destination_cols = ["destination_id", "cdestination_name", "latitude","longitude",] keep_cols = ["some_id","another_id","hotel_id"] # keep only unique hotels data = df_map[[*keep_cols, *destination_cols]].drop_duplicates( subset=["some_id", "destination_id"] ) # mean of features mean_features = ( df_map.groupby("some_id")[["stars","review_score"]].mean().reset_index() ) # add prices df = ( df_map[["some_id", "other_id"]] .dropna(subset=["some_id", "other_id"]) .drop_duplicates() ) room_price = ... min_price = ( pd.merge(room_price, df, on="some_id", how="inner") .groupby(["other_id", "currency"]) .agg(min_price=("price", np.min)) .assign(min_price_usd=lambda x: x["min_price"].copy()) .reset_index() ) # align index data = pd.merge(hotel_df, min_price, on="some_id", how="left") # add room count rooms_count = ( df_map.groupby("some_id")["other_id"].nunique().rename("rooms_count") ) data = pd.merge(data, rooms_count.reset_index(), on="some_id", how="left") def compute_features(df_map, other_data): data = get_unique_hotels(df_map) mean_features = get_mean_features(df_map) data = add_mean_price(data, mean_features) data = add_room_count(data, other_data) return data
  • 12. Avoid Mixing Abstraction Layers Don’t put low-level operations on the same level as high-level functions bad_ids = [hotel_ + str(id_) for id_ in [123, 42, 888]] df = df[~df.hotel_name.isin(bad_ids)] features = compute_features(df)
  • 13. Avoid Mixing Abstraction Layers ● Group operations together that have the same level of abstraction ● Structure code into abstraction hierarchies by using functions ● Same for variable names bad_ids = [hotel_ + str(id_) for id_ in [123, 42, 888]] df = df[~df.hotel_name.isin(bad_ids)] features = compute_features(df) df_filtered = filter_out_bad_ids(df) features = compute_features(df_filtered) Better:
  • 14. Ain’t got no time for this
  • 18. Function Length ● Checks length of a function ● Checks number of input and output parameter def some_long_function( first_parameter: int, second_parameter: int, third_parameter: int, ): first_parameter = ( first_parameter + second_parameter + third_parameter ) first_parameter = ( first_parameter - second_parameter + third_parameter ) first_parameter = ( first_parameter + second_parameter - third_parameter ) first_parameter = ( first_parameter second_parameter third_parameter ) return first_parameter
  • 19. Cognitive Complexity ● Increments for breaks in the flow ○ loops & conditionals ○ catch, switch statements ○ breaks, continue ● Increments for nested structures def f(a, b): if a: for i in range(b): if b: return 1
  • 20. Abstraction Layer ● Only allow generic variable names in small functions ○ items, var, variables, result, … ○ The more generic, the smaller the function def foo(variables) items = [] for var in variables: items += [var] return items
  • 21. Expression Complexity ● Measures complexity of expressions ● Rule of Thumb: if expression goes over multiple lines, consider splitting it if (df.sold_out.any() and df[~(df.is_new_hotel & ~df.is_covid) | df.price_value.isna()].any()): do_something(df)
  • 22. All of these available as flake8 plugin: - flake8-cognitive-complexity - flake8-adjustable-complexity - flake8-expression-complexity - flake8-functions
  • 23. Now on to collect some data…
  • 24. Compute Complexity Heuristics ● Used the flake8 implementation to compute complexity heuristics ● Depending on needs, decide how often to run ● Jupyter notebook might be good enough 🤪 ● Store in your favorite database ❤
  • 25. Visualize & Prioritize How are your repositories developing over time?
  • 26. Visualize & Prioritize ● How are your repositories developing over time? ● Which files/functions should we tackle first? ● Set aside some time to focus on improving your code base. ● As little as 2hrs a months every month can have a big impact Good candidate to refactor
  • 27. Visualize & Prioritize ● How are your repositories developing over time? ● Which files/functions should we tackle first? ● Set aside some time to focus on improving your code base. ● As little as 2hrs a months every month can have a big impact After Refactoring: 👏👏👏
  • 28. In the end, it’s about culture How to foster a clean code culture in your team: ● Regularly sync & discuss which code needs improvement ● Fixed time per month to work on code quality ● Use pair programming for additional knowledge sharing ● Learn about best practices through e.g. reading groups
  • 29. Thank you for your time! @corrieaar corrie-bartelheimer www.samples-of-thoughts.com github.com/corriebar/code-complexity
  • 30. Resources ● Clean Code [ videos] by Uncle Bob ○ All in Java but still worth a read ○ Recommended Chapters: 1-5 and 17. Chp 6-10 go deeper into data structures, classes etc. 14-16 are Java very specific and less interesting for python developers ● Notebook to compute complexity