SlideShare une entreprise Scribd logo
1  sur  87
The history of
fire escapes
Tanya Reilly (@whereistanya)
This talk was a keynotes at DevOps Days New York in 2018. If you like watching
videos more than reading, the video is at
https://www.youtube.com/watch?v=02KEKtc-5Dc.
Abstract:
When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug,
we usually have a contingency plan. We reduce damage, redirect traffic, page someone,
drop low-priority requests, follow documented procedures. But why do many failures still
come as a surprise? In this talk, we look at some real life analogs to preventing and
managing software failures. Fire partitions. Public safety campaigns. Smoke alarms.
Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. What can we
learn from the real world about expecting failure and designing for it?
“ When we first dropped our
bags on apartment floors…
Welcome To New York
Taylor Swift
My name is Tanya and I'm an immigrant here -- I moved here from Ireland ten years
ago -- and one of the things I love about New York City is that you move here, and it’s
immediately your city. The number one criterion for being a New Yorker is wanting to
be a New Yorker. I love that. The greatest city in the world, Lin Manuel Miranda said
so :-)
I'm a Site Reliability Engineer and I’m especially interested in what happens when
things fail, and the contingency plans we use to recover when something breaks. And
a few months ago I was thinking about that a lot and walking around the city -- which
is *beautiful* in September/October, the soft light makes the buildings look gorgeous
-- and I started really noticing the fire escapes. They’re a contingency plan too.
They’re for incident response. You don’t use them until all of your regular methods of
getting out of the building have failed.
So I started reading about fire escapes.
Content Warning
Fire and deaths caused by
fire.
Before I talk about fire escapes, let’s talk content. I'm looking at disaster prevention
and disaster recovery in software, by looking at parallels in building fires. This will
include stories of some of the worst fires in the history of the city.
I've intentionally kept this talk as low on vivid details as possible, but we'll be looking
at the reasons fires started, the stuff that helped them spread and how people died.
There's also some pictures of buildings on fire. Nothing lurid, but there are pictures.
If you have raw feelings related to recent fires like Ghost Ship or the
devastating fire in the Bronx just after Christmas, this could be rough.
If you'd be more comfortable skipping this one, you should do that with my blessing.
On the next slide, I'll even tell you what I'm going to say, so you don't miss anything:
“
4
Tony Fischer
CC BY-2.0
● focus on better buildings,
not better fire escapes
● focus on better software,
not better incident response
● software needs a fire code
tl;dr
Here's my thesis:
● fire escapes are a hacky bit of afterthought tacked on to the outside of a
building. If you're using fire escapes, it's worth making them as good as
possible, but you’ll prevent more fires if you build better buildings.
● Similarly, incident response is often a hacky bit of afterthought tacked on long
after software is released. Again, great incident response can help you recover
faster than if you don’t have it but… you’ll prevent more outages if you build
better software.
● Finally, buildings have an extremely detailed fire code, but we don't really have
an extremely detailed systems engineering code for software, and maybe we
should have.
If you'd rather not read more about fires, that is ok. Stop here! Otherwise, keep going.
Image: Tony Fischer. CC BY-2.0. https://flic.kr/p/72Lhz1
Chinatown
David Ohmer
CC BY 2.0
Fire escapes were really only built in New York City for a hundred years. They weren't
common until the 1860s, and in the 1960s they stopped being allowed on new
construction. But in that time, they thoroughly changed the face of the city. And they
seem to have really captured the imagination of people who live here. If you search
for 'fire escapes nyc' on flickr you get 13000 pictures.
Image: David Ohmer. CC BY 2.0. https://flic.kr/p/oybSWP
Claudia Heidelberger
CC BY-ND 2.0
Greenwich
Village
There's some debate now about whether we should start removing them in places
where the building has been upgraded, or whether they should be preserved as part
of the city's history. I think at least some of them should be preserved. Look how
beautiful that is!
Image: Claudia Heidelberger. CC BY-ND 2.0. https://flic.kr/p/oqYYv1
East
Village
Dan DeLuca
CC BY-2.0
Here's another lovely one. They made an effort to have it match the style of the
building, not feel like a separate thing tacked on at the end. And I think that's key.
Image: Dan DeLuca. CC BY 2.0. https://flic.kr/p/76Jmb2
“ fire escapes were haphazardly attached
to the most elaborately designed
facades, with no consideration given to
the relationship between the two. The
facade was within the realm of
architecture and the fire escape in the
realm of the law
Richard Plunz, a History of
Housing in New York City
But most of the time, the people adding the fire escape didn't think of it as part of the
building. It was an afterthought. As this quote says, the facade of the building was
architecture but the fire escape was law.
It was an external contingency plan, not part of the main structure. And I think that's
part of why fire escapes ended up not being successful.
Quote: https://books.google.com/books?id=fcKlDAAAQBAJ&pg=PA24
A brief history
of NYC fires!
(With apologies to actual historians)
But I'm jumping to the end. Let's look at the evolution of New York City's fire code.
By the way, I'm not an expert on buildings or fire escapes or the history of New York
City. I read a lot to prepare this talk, and I'll link references along the way, but you
should not consider this a reliable source of historical information. There may be
errors.
Financial
District
1835
On to the history. We’re skipping the great fire of 1776, and jumping straight to 1835
and the Financial District.
This was a commercial, not residential area, and as a result the number of fatalities
was comparatively low -- two people -- I mean, still, two people, but this is mostly
remembered as a fire that cost a LOT of money. The total cost of the damage was
$20 million, which to put it into context, was three times the cost of the entire Erie
Canal, which had opened ten years earlier. Almost 700 buildings were destroyed.
The city had 26 fire insurance companies. This fire put 23 of them out of business.
Image: Library of Congress. Public domain.
https://en.wikipedia.org/wiki/Great_Fire_of_New_York#/media/File:The_Great_Fire_of_the_Cit
y_of_New_York_Dec_16_1835.jpg
What happened?
● contingency plans failed
● no failure domains
● exhausted incident responders
1835
A gas pipe burst in a maze of warehouses. These warehouses were full of extremely
expensive, extremely flammable things for sale: lace, silks, musical instruments, and
so on. It was winter with gale force winds and the fire spread very quickly through the
wooden buildings. Inside two hours it covered 17 city blocks (or 13 acres), most of the
financial district.
The city's water supplies were low and it was a freezing night in December. Before
the fire fighters could pull water from the rivers they had to cut through ice.
At the time it was common to use gunpowder to level buildings and stop the fire
spreading. But there had been a fire two days earlier and they were out of
gunpowder. That fire had been bad: it involved the entire fire department of 1500
people, and they were still exhausted. I’ve seen no literature that says they did badly
-- fire fighters tend to be complete badasses -- but nobody does their best work when
tired. Still, they fought the fire for 15 hours until marines from the Brooklyn Navy Yard
arrived with more gunpowder and made a barrier by blowing up some buildings along
Wall street.
So, two contingency plans failed. When there's a fire, we'll spray water on it and use
gunpowder. But there wasn't enough of either. And having no gunpowder meant no
failure domains, nothing to stop the fire spreading.
Outcome: better incident response
● a bigger, better, non-volunteer fire department
● reliable water: Croton dam and aqueduct
1835
As a result of the fire, the number of firefighters was increased,and they got better
equipment. They stopped using volunteer fire fighters, only professionals. And they
built the Croton Dam and Aqueduct. It was built because of the fire, but a
reliable water source is good for lots of reasons!
● they rebuilt in stone
Outcome: better buildings
1835
As well as better incident response, they took the opportunity to make
a more resilient city. The fire spread fast because the buildings were made
of wood. They rebuilt with stone and brick.
And this paid off, ten years later, when there was another enormous fire. The
great fire of 1845 was very bad -- thirty people died -- but it didn’t spread
as far or as fast, because it slowed down when it hit those new brick buildings.
1860Tenement Fires
The population of New York City doubled every decade between 1800 and 1880.
Maybe you've seen this with teams and software systems: when you're growing
rapidly, it's easy to build some culture problems and some technical debt. This was
certainly true in this case: landlords made more accommodation by splitting big rooms
into many smaller ones, called tenements, mostly without light or ventilation. In the
1860s, more than half the city -- nearly 500,000 people -- lived in tenements.
These were horrible places to live. They were filthy, and riddled with crime and
disease, and every report about them mentioned that they were fire traps. A New
York Times article in 1860 said “If a skillful man, with a deadly hatred of his race in his
heart, sat down to plan a human residence in which to entrap and destroy those who
should dwell in it, it is extremely probable that if he had seen these houses in West
Forty-fifth-street he would take them as a model.“
Image: Moncrief. Public domain.
https://commons.wikimedia.org/wiki/File:LowerEastSideTenements.JPG
Quote:
http://www.nytimes.com/1860/03/29/news/destructive-fires-four-tenement-houses-destroyed-t
wo-mothers-eight-children.html?pagewanted=all
What happened?
● bakery in the basement
● clutter
● no isolation
● obsolete contingency plans
1860
In 1960, two bad tenement fires happened back to back, killing at least twenty people.
The first one, on Elm Street (now Lafayette), started in the bakery on the ground floor
of a six storey building. The wooden stairway burned away, trapping people on the
top floor. They could get to the roof, but this building was four storeys higher than its
neighbours, so there was nowhere to go. The baker was storing a lot of hay and wood
shavings and when they burned, they made dense smoke, which killed some of the
people who lived on the top floors before the fire even got up there.
A month later, on West 45th Street, four houses burned. All four of these had roof
hatches called scuttles, which would have let people escape across the roofs, but
they were missing their ladders so people couldn't get up there. The roofs were
canvas covered in pitch, so when the fire reached them, it spread quickly across the
buildings. No isolation.
These escape plans -- the ladders and scuttles and escaping across the roof -- had
worked fine for a previous iteration of shorter NYC buildings, but they hadn't been
updated for the new shape of the city. I'm sure people had noticed, but until there was
a disaster, it didn't get priority.
Outcome: better buildings
● An Act to Provide Against Unsafe Buildings in the City of
New York
● fire-proof stairs
1860
The city immediately passed a law to make the tenements more robust against fire.
They even put an injunction on new tenement construction until the law was passed.
Now houses for more than eight families (kind of specific) had to have fire-proof stairs
either inside or outside the building.
What’s frustrating about this is that four years earlier a commission had reported that,
if there was a fire, tenants on the 6th and 7th floors of tenements had basically zero
chance of survival. They recommended fireproof stairs. But nothing happened until a
bunch of people died.
1867The Tenement
House Act
Tenements
must have fire
escapes!
What does that
mean?
¯_(ツ)_/¯
Seven years later, the Tenement House act was passed. This act had extremely good
goals. It was extremely unsuccessful.
The act said that tenements had to have fire escapes, but it didn't really spell out what
that meant. Buildings had to have a fire escape, but they didn't have to make anyone
safer! So landlords put up fire escapes that couldn’t hold the number of people in the
house, or that weren’t well attached to the walls, or that fed into tiny spaces that
couldn’t hold all the people. And what even was a fire escape? Did a rusty ladder
count? Absolutely!
Let's take a diversion and look at some fire escape patents. I will admit that these are
not especially relevant to devops but they're delightful, so humour me.
Image: Detroit Publishing Co., publisher. Public domain.
https://commons.wikimedia.org/wiki/File:New_York,_N.Y.,_yard_of_tenement_LOC_det.4a185
86.jpg
Things that are fire
escapes
William
Houghton,
1891
This is a ladder with a counterweight. Imagine climbing down from the 7th floor of
your building on one of these. With your six children. In a dress that went to your
ankles.
Image: Scientific American. Public domain.
https://en.wikipedia.org/wiki/Fire_escape#/media/File:Houghton%27s_Fire_Escape_1877.jpg
Things that are fire escapes
Mary
McArthur,
1904
This is a kind of rope ladder that attaches to a window sill.
Patent: http://www.google.com.pg/patents/US800934
Things that are fire escapes
William
Bedinger,
1915
This is a parachute that rolls up very small. The idea was that you'd carry it with you
everywhere in case you were in any tall building fire situations.
Patent: https://www.google.com/patents/US1168465
Things that are fire escapes
21
Henry Vieregg
1902
"A person desiring to escape seizes one member of the cord, rope, or chain, as
shown in Fig. 1, and forthwith jumps out of the window. [...]"
Like, I am looking at this thing and do not feel like I could forthwith jump out of
anything.
Patent: https://www.google.com/patents/US708846
Things that are fire escapes
22
Anna
Gonnelly
1887
This is a bridge that you can sling from your roof to another building. It has side rails,
so it's only moderately terrifying.
Patent: https://www.google.com/patents/US368816
Things that are fire escapes
Pasquale
Nigro
1909
This one is just fantastically ludicrous. But good if you want to fight supervillain crime,
I guess?
All of these patents were granted, btw.
Patent: https://www.google.com/patents/US912152
Things that are fire escapes
BB
Openheimer
1879
You might think that this is just a parachute helmet. It is not. It is a parachute helmet
and a pair of very bouncy shoes.
Patent: https://www.google.com/patents/US221855
.
Things that are fire escapes
Nicholas
Borgfeldt
1882
Finally, I've read this patent three times and I'm fairly convinced that the guy invented
a rope. It's the most Silicon Valley invention of 1882.
Though, let's be clear, rope was a popular kind of fire escape. In fact, it was the state
of the art for hotels.
Patent: https://www.google.com/patents/US267399
The New Rope
Fire-Escape Law
for Hotels
I don't mean a ladder made of rope, I mean literally a rope. Every hotel room had to
have a rope and that was the only fire escape. Even at the time, people found that
pretty terrible.
This is a snarky cartoon from a magazine called Puck, published in 1887, of a whole
lot of people trying to use the ropes.
Image: https://books.google.com/books?id=XwAjAQAAMAAJ&pg=PA48. Pre 1923 so public
domain.
The escape plan
only works for
one of these
people --->
Puck Magazine, 1887
This lady is saying "Slide down a rope in my night-dress, with every body looking at
me? Never! I'll be cremated first!" And it's a fair objection! These escape plans are
designed for the easiest case: someone with good upper body strength and agility
who isn't wearing a skirt or carrying a child. If your disaster plan only works for the
easiest case, it's not a good plan.
I want to emphasise here that a rope is still better than nothing. In fact, probably every
one of these fire escapes, even Mister Parachute Hat, is better than nothing. Once
the fire has started, you’ll be glad of whatever you have. But these escape plans are
not where I would put my efforts if I wanted to have fewer people die in fires. But this
is what the law focused on.
Image: https://books.google.com/books?id=XwAjAQAAMAAJ&pg=PA48 Pre 1923 so public
domain.
1867The Tenement
House Act
Tenements also must have
windows!
What does that mean?
¯_(ツ)_/¯
Even with fire escapes, tenements were still terrible. They were badly constructed,
overcrowded, there was no ventilation, and -- I find this amazing -- it was perfectly
legal to store lots of combustible materials in them.
One other thing the tenement act said was that every room now had to have a
window. And just like “what even is a fire escape” it didn’t define “what even is a
window”. So the landlords cut holes in interior walls between rooms and called them
"interior windows".
A decade later, the law said sigh, ok, exterior windows. So landlords started
constructing buildings with air shafts, little narrow gaps between buildings.
Bear in mind that there’s no indoor plumbing in these tenements and the bathroom is
maybe down six flights of stairs and now you have an air shaft, so you can imagine
how that goes. One article I read described the air shaft as “festering tubes of
disease” (very poetic), and said that they provided just enough oxygen (and, maybe,
methane) to help spread fires more quickly. So that’s something.
Anyway, many of the fire escapes just led down to these air shafts and there was no
way out from there.
Image: U.S. National Archives and Records Administration. Public domain.
https://en.wikipedia.org/wiki/Old_Law_Tenement#/media/File:Airshaft_of_a_dumbbell_teneme
nt,_New_York_City,_taken_from_the_roof,_ca._1900_-_NARA_-_535468.jpg
1871More Tenement
House Acts!
Carla Geisser CC BY THANK YOU CARLA <3
By 1871, iron fire escapes were common and of course people were using them as
extra space. Kids played and slept out on them. People aired their mattresses there
and hung laundry. You still see that now -- they're used for bikes and gardening and
barbecue space and cat runs. All of that has been illegal since 1871.
A later law said that fire escapes had to have a cast-iron sign saying that you could be
fined for obstructing your fire escape. You still see those signs in some places in the
city. And it was fair, because usable fire escapes are better than unusable ones.
But, again, it was still perfectly legal to run your explosive business out of a tenement
basement and tons of residential fires started because of people deep frying crullers.
And anyway, the regulations were mostly not enforced, so people didn't pay much
attention.
Image: Carla Geisser. Used with permission.
1876Brooklyn
Theater Fire
1876
In 1876, and this is a staggering number: 278 people died in Cadman Plaza in the third
worst theater fire in US history. (The worst won't happen until in 1903 in Chicago, so in
1876 this is the worst one ever.)
The final act of the play was about to start and the stage manager noticed a very tiny fire
on the left of the stage.
Image: Waller & Schrader, Photographers - Period Stereograph. Public domain.
https://en.wikipedia.org/wiki/Brooklyn_Theatre_fire#/media/File:BrooklynTheatre_From_Johns
on_Street_Looking_East.jpg
What happened?
● obsolete contingency plans
● clutter
● clumsy incident response
● delayed response
● locked doors
1876
It was typical to keep buckets of water next to the stage, but there weren't any. There was
a fire hose, but too much scenery was in the way to get to it. So the stage manager asked
a couple of carpenters to put the fire out by beating it with poles. This didn't work and
actually spread some sparks, setting fire to the loft.
The actors wanted to avoid a panic, so they announced that the fire was part of the show,
and that people shouldn't freak out, but once the audience realised, they stampeded. And
they had trouble getting out. There was only one stairway down from the cheap seats at
the top, and it filled with smoke. There were no fire escapes. Some exits were locked to
prevent against gatecrashers so people couldn't get out that way.
278 people.
Outcome: better buildings
● prosecutions
● new laws
● sprinklers
1876
The jury blamed the theater owners for not obeying a bunch of existing fire laws, and
new laws were written, including not storing stuff on the stage and widening exits. In
1882, the building code said that theatres had to have automatic sprinklers: it's the
first type of building in the city to require sprinklers. The first automated response.
What I find remarkable is that this fire happened nine years after regulation said that
tenements had to have safe exits, but those laws didn't carry over to theatres. It
turns out they didn't carry over to other types of buildings either: hotels, schools,
factories, ships, offices all followed their own path to fire safety and each had horror
stories to get them there. Most of those I'm not going to talk about (trust me, it's better
this way), but we'll look at factories in a minute, after….
1890-1901
Even more
Tenement
House Acts!
...we get proper no-kidding tenement regulation at last! And it doesn't even take a
devastating fire to make it happen. Thank you Jacob Riis!
In 1890, Jacob Riis published a book about tenement life called How the Other Half
Lives and did a lecture tour on it. And up until now the upper and middle class people
of New York City had sort of known the tenements were awful, but for the first time
ever, there were photographs. It was harder to ignore. And over the next decade,
people started to care about the conditions of tenements. Well, it was probably part
empathy, part fear of smallpox coming out of there but, whatever, people suddenly
cared.
I was really reassured when I read this, because until then it had been all “there was a
horrific fire and we added a very specific law and then there was a different horrific
fire and we added a different very specific law”. And it was mostly like that! But this
Tenement House Act came from someone saying “wow, look how much this sucks” in
a compelling way. And that gives me hope!
Anyway, the next couple of Tenement House Acts included having to have actual
windows, not air shafts, and fire escapes couldn't be ladders any more: they had to
have open balconies and stairs and be properly attached to the wall. Even better:
your neighbours can no longer boil oil in the basement! And all new construction has
to have interior fire partitions. Failure domains!
We're finally looking at stopping fires from starting and spreading, not just escaping
from them. And, best of all, it’s all actually going to be enforced. Welcome to the 20th
century!
But, oh yeah, it still sucks in factories.
Image: Public domain.
https://commons.wikimedia.org/wiki/Category:How_the_Other_Half_Lives#/media/File:How_th
e_Other_Half_Lives_front_cover.png
The Newark
Factory Fire
1910
The Triangle Shirtwaist is the more famous one, but the Newark factory fire a few
months earlier is a textbook disaster waiting to happen so I wanted to talk about it.
The building was shared by a couple of paper box companies, a nightgown factory
and a lamp manufacturer. It had previously been used by machine companies and the
floors were soaked in oil.
It had two fire escapes -- look at the size of this building! One ended up on a roof, with
no way down from there, and the other was a really heavy ladder hanging from the
third floor balcony. This was another emergency plan that only worked for people with
good upper body strength, and this factory employed mostly young women. In the fire,
they weren't able to lift down the ladder, so there was effectively only one fire escape.
What happened?
● no fire alarms
● locked door
● not enough fire escapes
● delayed response, for insurance reasons
● panic
1910
A fire started in the lamp factory. There was no fire alarm, and everyone had
evacuated the bottom three floors before they realised that 116 people up on the 4th
didn't know there was a fire. The only door up to the 4th floor was kept locked,
which was against the law.
People on the ground brought a net and started catching people jumping from the
fourth floor, but it broke and they only had one net. 25 people died, mostly from
jumping. 32 more were badly injured.
The buildings department had condemned this factory three times, but the factory
owners ignored them.
This building had had ten fires in ten years, which was expensive for insurance and
they didn't want another fire on their record, so they delayed calling in the firefighters,
even though the firehouse was just across the street.
And the victims had never been in a fire drill and they had no idea what to do. They,
quite reasonably, freaked out.
"The commissioner of the
New Jersey bureau
regulating fire safety in
factories felt that the
building was sufficiently
constructed and that the
victims merely
succumbed to panic."
36
Operator
error?
NOPE.
From "Fire Escapes in Urban America: History and
Preservation, by Elizabeth Mary André. (emphasis mine)
When officials investigated, they said the root cause was not the walls soaked in
grease, or delaying calling fire fighters, or the locked door, or the lack of fire escapes.
It was that "the girls panicked".
Human reaction to an outage or a disaster is never the root cause. Humans will act in
human ways. If your systems can't handle that, and you haven't invested a lot of time
in training the humans to act in some other way, your systems are crap.
Quote: https://www.uvm.edu/histpres/HPJ/AndreThesis.pdf
Outcome: …?
● “They died from misadventure and accident.”
● New York City Fire Chief Croker: "This city may have a
fire as deadly as the one in Newark at any time."
● He wasn't wrong.
1910
So what happened? Nothing. The jury didn't convict. New Yorkers did look a bit at
their factories and say "huh, I wonder if we should care about that"..., but nothing
changed. Is it because it happened ten whole miles away instead of right in the city?
No idea. The New York Fire Chief said "This city may have a fire as deadly as the one
in Newark at any time".
Four months later…
Quote: "They died from misadventure and accident" from
http://www.nytimes.com/2011/02/24/nyregion/24towns.html
Quote: "This city may have a fire as deadly as the one in Newark at any time." from
http://trianglefire.ilr.cornell.edu/primary/testimonials/tf_warnings.html
1911The Triangle
Shirtwaist Factory
146 people, mostly young immigrant women, died inside 18 minutes in the Triangle
Shirtwaist Fire.
Image: Public domain.
https://en.wikipedia.org/wiki/Triangle_Shirtwaist_Factory_fire#/media/File:Image_of_Triangle_
Shirtwaist_Factory_fire_on_March_25_-_1911.jpg
What happened?
● no failure domains
● only one fire escape
● locked doors
● obsolete contingency plans
● and they already knew
1911
This building was considered fireproof, but it was packed with garments hanging so
tightly together that the building might as well have been made out of cloth.
The building should have had three fire escapes; it had one and that collapsed under
the weight of people escaping, killing 20 people who dropped from the 7th floor. One
exit was locked; the guy with the key escaped without unlocking it. The fire ladders
and the water from the hoses could only get to the 6th floor and the factory was on
the 7th to 9th.
And the employers already knew about the problems. Employees had organised a
strike the previous year to protest the working conditions, and they'd been fired. The
building had had a recent warning notice from the department of sanitary control, but
they hadn't fixed their violations.
Outcome: better incident response
● New fire-fighting equipment
1911
The fire department developed a stronger water pump and a longer ladder, so they
could reach taller buildings.
Outcome: better buildings
● 60 new laws in three years
● sprinklers
● professional organisation
● "the common outside form of iron ladder-like stairway
anchored to the side of the building is a pitiful delusion"
1911
But more importantly, building conditions took a big step forwards. A commission was
started to look into fire hazards and other conditions in factories and their
recommendations turned into 60 new laws over the next three years. Again, everyone
knew factories were bad. But, when people died, they changed the law.
Sprinklers started to be required in factories. (But only factories over seven stories
tall. Very specific again). A professional organisation, the American Society of Safety
Engineers, was founded and still exists.
And people started to look at fire escapes differently. After the disaster, a report called
them "a pitiful delusion." and "a type of exit condemned by the experience of
many fires".
“It has long been recognized that the common outside form of iron ladder-like stairway
anchored to the side of the building is a pitiful delusion. This device for a quarter of a
century has contributed the principal element of tragedy to all fires where panic
resulted. Passing successively the window openings of each floor, tongues of flames
issuing from the window of any one floor cut off the descent of all on floors above it.
Iron is quickly heated and is a good conductor of heat, and expansion of the bolts,
stays, and fastenings soon pulls the framework loose, so that the weight of a single
body may precipitate it into the street or alley. Many a human being has grasped the
hot rail of such a 'fire escape‘ only to release it with a scream and leap from it in
agony. Its platforms are usually pitifully small, and a rush to them from several floors
at once jams and chokes them hopelessly. It is a makeshift creation of the cupidity of
landlords, frequently rendered still more useless by the ignorance of tenants, who
clutter it up with milk bottles, ice boxes and other obstructions.“
Quote:
http://www.nfpa.org/News-and-Research/Publications/NFPA-Journal/2014/September-October
-2014/Features/Fire-Escapes/1914-Sound-the-Alarm
“ FIREWALL IN ALL
BUILDINGS SEEN AS
ONLY SAFEGUARD
" Horizontal" Escapes Created in 206 City Structures
After Triangle Fire With No Casualties Since
-- New York Times, February 25, 1923
1923
In 1923, the New York Times had an article praising fireproof interior walls: "For six
years there has been no loss of life by fire in the 200 buildings so treated. And the
cost is far less than the cost of the fire escapes and other equipment."
It blows my mind that a group of 206 buildings having no fire deaths in six years was
newsworthy.
Those fireproof walls became code. In 1929: all new buildings over 75 feet in height
had to have them, and also had to have two fully enclosed staircases! Failure
domains are part of the code at last!
Headline:
https://timesmachine.nytimes.com/timesmachine/1923/02/25/105849722.html?pageNumber=1
41
More
fires
1968+
More very
specific
laws
The code made frequent improvements over the next decades and in 1968 fire
escapes stopped being code. You can't build them now.
The 1968 code also required sprinklers for hotels and high-rise office buildings, but
not nightclubs or residential buildings. In 1975, seven people died in a nightclub, so,
sprinklers became required for nightclubs. In 1998 there were two bad residential
fires, and now you have to have sprinklers for residences with four or more units. And
I'm sure there will be more changes to react to future disasters.
There's no retrofitting of existing buildings, btw. The laws only apply to new buildings
and existing buildings only get better as they're renovated.
Building code: Fire escapes shall not be permitted on new construction, with the exception of
group homes. Fire escapes may be used as exits on buildings existing on December sixth,
nineteen hundred sixty-eight when such buildings are altered, subject to the approval of the
commissioner, or as provided in subdivision (b) hereof.
https://www.lawserver.com/law/state/new-york/ny-laws/ny_new_york_city_administrative_code
_27-368
Fire deaths didn't
decrease because we
built better fire escapes.
It was because we built
better buildings.
The number of fire deaths didn't decrease because we built better fire escapes. It was
because we built and operated better buildings. In the end, the thing that made the
difference was making it harder for the fire to start and spread. But for decades
we optimised for escaping from it.
Barbara L Hanson CC BY 2.0
Dan DeLuca CC by 2.0
Eden, Janine and Jim CC BY 2.0
don toye CC-BY-ND 2,0
Kristine Paulus CC-BY-ND 2.0
Fire escapes are last resort plans. Even at their best, they kind of suck:
● Not everyone can get out a window
● Windows get locked
● People put air conditioners in them
● And child-proof bars that are also adult proof.
● Fire escapes get covered in snow and ice
● And stuff, like bicycles and plants
● They rust
● They get detached from the wall
● The ladder gets stuck
● They make people worried about burglars
● If you leave windows open, fire blocks the path down
But most importantly
● they never, ever get tested.
Images:
Kristine Paulus CC BY 2.0. https://flic.kr/p/fszEDf (plants)
Dan DeLuca CC BY 2.0. https://flic.kr/p/5hsnTM (chairs)
Eden, Janine and Jim. CC BY 2.0. https://flic.kr/p/7G1tWZ (snow)
Barbara L. Hanson. CC BY 2.0. https://flic.kr/p/8uxpcf (rain)
Don toye, CC BY 2.0 https://flic.kr/p/9XrAs (bike)
“ vertical ladders also placed relatively
greater stress on their mounts to the
building, leading to fire escape
collapses during times of intense use
– such as during actual fires.
John W. Cramer, The Story
of a Tenement House
Fire escapes pretty much have one time of intense use: during a fire. If
they're going to collapse, that's when they're going to collapse. You don’t find
out they don’t work until there’s a fire.
Quote: http://www.boweryboogie.com/2014/10/favorite-pastime-tenement-fire-escapes
If we're using
a (terrifying)
contingency plan
a lot had to go
wrong
If we’re using a fire escape to escape a fire, a lot of things have had to go wrong. In
fact, we had three chances not to be here.
we could have
prevented the spark!
1
Ideally, the fire wouldn’t have started at all. Something made the first spark. There
was heat and fuel and oxygen. Maybe we could have not let it start.
Image: Christophe. CC0. https://pixabay.com/en/match-sticks-smoke-ignite-fire-359970/
we could have
automatically
fixed it!
2
Or we could have stopped it while it was tiny. Who hasn’t set a dish towel on fire in
the kitchen, or burned toast? If you catch a fire quickly enough, it’s not a big deal.You
just extinguish it.
For bigger fires, maybe there’s a sprinkler system which triggers automatically. Yeah,
it makes a mess, but you get to not have a complete building outage.
Image: JonathanLamb Public domain.
https://commons.wikimedia.org/wiki/File:Fire-blanket-on-display.jpg
we could have
contained it!
3
If we have proper failure domains, we can keep the fire or outage to one small part of
the building or infrastructure. We can at least make the fire spread very slowly, so we
have time to react non-urgently.
Image: Tim Gouw. CC0. https://unsplash.com/photos/MApjpqu9V7E
ok I guess
we're
reacting to
a fire :-(
4
If we miss those three chances, we end up at stage four, urgently reacting to a fire
that's out of our control. If we're here, we probably care a whole lot about fire
escapes! But it would have been much better not to get to here.
Image: skeeze. CC0. https://pixabay.com/en/firefighters-training-live-fire-696167/
75% of site
reliability
is not
firefighting
at least!
Whether it's a building or a software system, the most important reliability work is
making problems stop before they get to that fourth stage. Or at least it should be.
I got a recruiter mail a couple of years ago that said 'Our site reliability engineers are
seen as "firefighters"'. Wow, what a waste of SREs. I know there's a strong association
with SREs and on call, and the pager, but that should be a really small part of the job. If
your SREs or production folks are focused on emergency response, you're wasting 75% of
their skillset.
And, right, everyone who's writing code should have reliability in mind. Nobody sets out to
write a precarious system. But just as we have people who specialise in UI or security,
both of which we should all care about, we have people who specialise in reliability. And
they need to be involved at every stage.
So: prevention, detection, isolation, response: Let's look again at those four stages again
while thinking about software.
prevent the spark!
1Design for failure:
prevention
Something caused the first tiny breakage that caused the outage. Maybe we changed
something, or something got overloaded or a user entered an input we didn’t expect.
Or someone bulldozed a cable.
A certain amount of sparks is fine! Some things are allowed to have more problems
than others. We should know what our SLAs allow. We have error budgets. But how
can we stop those things from happening more than we can handle?
Image: Christophe. CC0. https://pixabay.com/en/match-sticks-smoke-ignite-fire-359970/
solid structures
● choose your
stack carefully
● design review
● think about
weak points
54
There's this quote by Ellen Ullman: “We build computer systems like we build our cities:
over time, without a plan, on top of ruins.”
We don't need to build systems like that. We can think about our stacks from the start,
with reliability in mind. We can spend time on high level design review and try to find holes
in the system before we build it.
Image: by me. My dad built this wall <3
wiring inspections
● component design
review
● code review
● testing
● fault injection
● be paranoid
55
State Farm CC BY 2.0
Next, catching bugs before they ship. We do this by having a second pair of eyes on
our designs and our code, but also by writing good tests, both for the stuff that we
think can't happen as well as the stuff that can. Fault injection like fuzz testing is a
good way to test input we haven't thought of.
And validate everything. Even if you write the only caller of a function, check what
input you just passed yourself. Assume that anyone who calls one of your functions,
including yourself, is a monster who's trying to take you out. Be paranoid.
Image: State Farm. CC BY 2.0. https://flic.kr/p/duWtgw
hiding the matches
● don't give users
access they
don't need
● clean interfaces
● sudo not root
56
Michael Chen CC BY 2.0
A stove igniter is a better tool than a box of matches. Don't give user access to
functions or data they don't need, and when they do need it, provide clean, safe
interfaces that are hard to get wrong. And don't even give yourself more access than
you need: use sudo, not root.
Image: Michael Chen CC BY 2.0. https://flic.kr/p/LdPYz
operating with care
● plan changes
● canary everything
● feature flags
57
"Reproduced from NFPA's website, © NFPA (2018).
The fire department recommends that you don't operate a stove while drunk or
sleepy, and the same goes for a root prompt or admin console.
Many outages are caused by changes, so make them deliberately and carefully.
Canarying helps: push the change to one instance before you push all the instances.
And push out new features in a way that makes it very fast to turn them off if you need
to.
Image: Reproduced from NFPA's website, © NFPA (2018).
https://www.nfpa.org/~/media/files/public-education/resources/safety-tip-sheets/cookingsafety.
pdf http://www.nfpa.org/termsofuse
fire safety campaigns
● best practices
● conferences!
Over time, the best practices for our industry have changed. We don't log in as root
any more. We use config management. We use change control. We use repeatable
builds. We socialise the idea of reliability in books and articles and at conferences.
Image: Senior Airman Kyle Gese. Public domain.
https://commons.wikimedia.org/wiki/File:Fire_Prevention_Week_131009-F-OP138-012.jpg
automatically
fix it!
2Design for failure:
detection
But, ok, sometimes we still break things. We have two options for immediate
response: humans staring intensely at screens, or robots. Robots are better!
Alice Goldfuss said (I think it was her monitorama talk last year) “If you have three
minute SLAs that you expect to be satisfied by a human, you don’t have SLAs."
I agree with that. Humans aren't fast enough for 4 nines.
Image: JonathanLamb Public domain.
https://commons.wikimedia.org/wiki/File:Fire-blanket-on-display.jpg
smoke alarms
● early warning.
But don't burn
out your
responders!
60
topquark22 CC BY 2.0
Smoke alarms need a fine balance, as everyone knows who’s every burned toast and
had trouble shutting up their smoke alarm. And taken the batteries out :-/ (Don't do
that.)
Having humans react to small problems can burn them out. You're using up your
gunpowder on small fires and not having enough left for the big ones! So ideally have
a robot deal with the thing, or make it happen slowly enough that humans don't need
to burn adrenaline to get involved.
Image: topquark22. CC BY-2.0. https://flic.kr/p/6AcBru
fire extinguishers
● rollback buttons
The less time a human spends on deciding what to do, the better. In an oil fire
emergency, don't make people have to think about whether the water fire extinguisher
is safe; make sure the available tools are the safe ones.
Provide a one-click rollback for all your changes. Let your on caller put out the fire,
and then we'll figure out how it happened.
Image. Hans. CC0. https://pixabay.com/en/fire-extinguisher-fire-delete-99915/
sprinklers
● automatic
recovery
● automatic
response
HomeSpot HQ CC BY 2.0 www.homespothq.com
Even better than automatic response, is automatic recovery. There's a ton of ways we
do this, especially at low levels of abstraction. If you drop a packet, TCP don't care.
It's built into the algorithm. Resend that thing. You're not paging a human for a
dropped packet or a failed checksum.
We need automatic recovery higher up the stack. If tasks are flapping, we should be
able to ride it out. If a backend goes missing, we should be able to coast, at least for a
while. If a machine dies, it should automatically be replaced. Health checking and
load balancing should move traffic from an unhealthy region to a healthy one. Maybe
you want to let humans know, but the message they should get is "everything is under
control but you might want to look at this when you get a chance". Not "WELCOME
TO 3AM! A MACHINE REBOOTED". You don't want humans involved in failing over.
We'll just mess it up.
Image: HomeSpot HQ. CC BY 2.0. www.homespothq.com https://flic.kr/p/fmr7a7
contain it!
3Design for failure:
isolation
Ok, there's a fire, it's happening. Now we need to not let it get on anything it's not
already on. Or at least slow it down enough that we can catch it.
Image: Tim Gouw. CC0. https://unsplash.com/photos/MApjpqu9V7E
fire barriers
● isolation
● sharding
● failure domains
64
Achim Hering CC BY 3.0
Failure domains split the infrastructure up so that only one part of it should be affected
by any given outage. Maybe we've sharded our users over a bunch of different
servers. Maybe we've added redundant network connectivity. If the problem's going to
move as components get overloaded, we want that to be slow enough that we can
control it, not an immediate cascade.
Image: Achim Hering. CC BY 3.0.
https://commons.m.wikimedia.org/wiki/File:Durasteel_fire_barrier.jpg
State Farm CC BY 2.0
fire drills
● controlled
outages
● disaster tests
Humans will panic the first time they get paged, or the first time they hit a situation
that's completely outside their comfort zone. They'll flail. They may even make the
problem worse. Fire drills help. Just like we make it incredibly common to hear a
smoke alarm and find our way outside, make it so that a disaster is never a surprise.
At intervals, tell people you're doing a controlled outage, and take a system offline.
See what breaks. Let people get their panic out of the way while you're there staring
at the systems. This will also shake out dependencies that you hadn't considered. If
you announce you're taking some low-SLA system down for ten minutes, you may
find a bunch of high-SLA users you don't know about!
Image: Clker-Free-Vector-Images. CC0.
https://pixabay.com/en/safety-helmet-construction-hat-295057/
avoiding
encumbrances
● clean ops
● fatigue awareness
66
You know the phenomenon where you're paged for a thing, and in the course of fixing
it, you hit a bunch of unintuitive commands, or out of date documentation, and it ends
up taking you much longer to do something simple? Or you even end up breaking
something else? These traps are a basement full of straw, or walls covered in oil, or a
fire hose with cluttered scenery on top of it. It's making it very, very hard for you to
move around safely as you try to fix the real problem.
Fatigue is another one. You're way more likely to make a mistake if you're exhausted.
Set rules about how many incidents a person should have to deal with and how many
hours they have to spend responding to problems before their on call shift is over.
Enforce those rules. If you end up having a ton of six hour on call shifts, your pager
fires too much. Quit burning out your humans.
Image: by me.
react with
urgency
4Design for failure:
response
Ok, sometimes this won't work. We don't have perfect software. Sometimes we will
respond to outages and we can set ourselves up for success in that.
not locking fire exits :-(
● no encumbrances
● don't comment out
the safety system
68
Robert Couse-Baker CC BY 2.0
First off, and the thing that broke my heart through many of the fire stories: don't lock
your fire exits. If you know there's a potential disaster and you have a way to deal with
it, don't disable it. Don't lock yourself out of your control plane by having a response
that depends on the thing that just went down. Don't comment out the automatic
recovery system. Please, don't take the batteries out of your smoke alarm, literally or
metaphorically.
Image: Robert Couse-Baker. CC BY 2.0. https://flic.kr/p/6L2qvU
communication
● status
dashboards
● known issues
● announcement
email
Communicate about outages early. Make sure everyone knows there's a problem who
needs to know. It's the worst when you're debugging some problem with your system
and after ages you get the notification that some other system has problems that
caused it.
Image: by me.
documented exits
● anticipate what
you'll do in a
disaster
● playbooks
● gotchas
And document your exits. A lot of the time, we know how we expect our systems to
fail. Maybe they've failed like that before. Maybe it's a specific thing we've set up an
alert for. Playbook entries let us tell an on caller what to do. And document the things
*not* to do, or even better, make it impossible to do them.
Image: by me.
controlled burns
● practice
responding
● wheel of
misfortune
● find unexpected
dependencies
Jereme Rauckman CC BY 2.0
Firefighters train using controlled outages and we can too. Just like we have drills to
prevent panic, we can use them to speed up our response time by practicing outages.
A wheel of misfortune is a regular exercise where you pick an arbitrary outage from a
list of things that might happen, and work through it to make sure you know what to
do. I saw a great one once where someone ended up needing to call Japan, and he
had no idea how to make an international call. This is the kind of stuff we should find
out before we need it!
Have regular fake outages, wheels of misfortune, and even large scale disaster tests.
Image: Jereme Rauckman. CC BY 2.0. https://flic.kr/p/pjPGD6
Reliability can't
be added at the
end. That's why
we do DevOps.
What I'm saying is that reliability can't be added at the end. If you're fixing a terrible
system by focusing on stage 4, with human response, you have a tenement. Foul air
is coming in through the air shafts, and it's not somewhere humans should spend
time.
Reliability needs to be built in. Failure needs to be built in.
73
data from
http://www.baruch.cuny.edu/nycdata
/public_safety/civilianfire.htm
graph made with gnuplot
Century
high: 1970
310 people
Century
low: 2016
48 people
71 people died of fire in NYC in 2017. 48 people in 2016. This is still a lot of people!
But 2016 was the lowest number since they started recording a hundred years ago.
That Bronx fire in December that killed 12 people was the deadliest in 25 years. How
did we get from the fire traps of the 1800s to here?
This graph shows NYC civilian fire deaths for each of a sample of years. I need a
more complete data source. Also, bear in mind that the city population is growing so
the decrease is even more impressive.
Data: http://www.baruch.cuny.edu/nycdata/public_safety/civilianfire.htm
444 --->
pages!!!
74
Well, this helped. This is the New York City fire code. It has 444 pages.
Book: http://shop.iccsafe.org/2014-new-york-city-fire-code.html
Fire code: https://www1.nyc.gov/site/fdny/about/resources/code-and-rules/nyc-fire-code.page
75
444 --->
pages!!!
Fire safety is also mentioned plenty in the city building code, the city construction
code, the state building code, the National Fire Prevention Agency’s electrical code
and plenty of other dense legislation. Don’t ask me what the difference is between all
of these. There’s a lot of code, that’s all I’m saying.
But we don't have a fire code for software. We have a bunch of O’Reilly books and
they're great. Our industry makes a valiant effort to document how we do things. But
nothing whatsoever makes us adhere to them, or prioritises one set of rules over the
others. Why don't we have a fire code?
Books:
http://shop.iccsafe.org/state-and-local-codes/new-york-city/2014-new-york-city-codes-complet
e-collection-1.html
https://catalog.nfpa.org/NFPA-70-National-Electrical-Code-NEC-Softbound-P1194.aspx
““Millions of computers throughout the
world are executing millions of instructions
per second for millions of seconds without a
single error [...] In spite of this, nobody trusts
a computer; and this lack of faith is amply
justified.”
Software: A Vital Key to UK Competitiveness
(C) Crown Copyright 1986
via Risks Digest
(https://catless.ncl.ac.uk/Risks)
h/t joe Thompson @caffeinepresent
It has been proposed from time to time!
There is an amazing list called the RISKS Digest, about public safety risks caused by
computers. It's been running since the 80s. Digging around on there, I found this
report from 1986 called "Software: a vital key to UK competitiveness", which had a
whole appendix on safety critical software. It starts with “No computer software
failure has killed or injured a large number of people. It is just conceivable that
such a tragedy could occur.” and it has detailed sections on disaster prevention,
management and analysis.
It also includes this fantastic line: “in spite of this, nobody trusts a computer, and this
lack of faith is amply justified”. I love it.
Quote: https://catless.ncl.ac.uk/Risks/4/14#subj3.1
In 1986, the UK Advisory Council for Applied
Research and Development recommended…
● Before any organization can operate a life-critical
computer system it must first obtain a License
To Operate (LTO)
● Each life-critical system must be operated by a
Certified Software Engineer who is named as
being personally responsible for the system.
The Advisory Council predicted a time when it wouldn’t be possible to recover from
software failure by just switching off the computer and doing the thing manually -- this
was written in 1986, remember. We're there now. They wanted certification: you
would only be able to operate a life-critical computer system if you had a license and
a Certified Software Engineer and a bunch of other stuff, and you'd have to get
re-certified every five years.
They also proposed what’s basically on call shifts, disaster recovery practice drills,
and post-mortems, including post-mortems for near misses. A lot of this feels
prescient and we ended up doing it, but we never required certification.
78
slide stolen from @jkuroda's
amazing LISA keynote. Used
with permission.
If you were at LISA in November, you might have seen a fantastic talk -- it was the
closing keynote -- by Jon Kuroda about aviation safety. Like with fire, plane travel got
safer only after a lot of bad accidents.
Jon pointed out that, while we might think of computing as a new field, it's the same
age as a bunch of others. Software, aviation, power, emergency medicine all took a
big jump forward after world war 2. But our industry is less mature than any of the
others.
Slide:
https://people.eecs.berkeley.edu/~jkuroda/talks/jkuroda-systemcrash-planecrash-lisa2017.pdf
Video of Jon's talk:
https://www.usenix.org/conference/lisa17/conference-program/presentation/kuroda
The stakes are
lower?
Why are we less mature? Is it because the stakes are lower? Mostly, the stakes have
have been lower. Though we have had life-threatening and sometimes fatal software
bugs.
The stakes are
lower? ● (1985-1987) Therac-25
● (1992) London Ambulance dispatch failure
● (2013) "Character substitution errors may
occur"
● (2017) "autocorrects medications to names
of different medications [...] without telling
the user"
The Therac-25 radiation therapy machine had a concurrent programming bug that
made it very rarely give its patients radiation doses that were hundreds of times
greater than they should have been. Three people died.
In college I remember studying the London Ambulance dispatch failure. A new
software system was deployed that hadn't been load tested, and that had a memory
leak. It couldn't keep track of where the ambulances were, which led to them arriving
hours late. 46 people died who might have been ok if the ambulance had arrived on
time.
I haven't heard of any actual negative outcomes from the OCR bug that went around
in 2013 and the medication autocorrection that people have been talking about
recently, but in both cases, you can see how someone might print out a bunch of
prescriptions that don't say what the doctor wrote.
The use of software for life-critical systems grows every year. And every month we
send #hugops on Twitter to the people working on the latest massive software
outage. At some point these will overlap.
Are we ready for this kind of responsibility?
“ It took an Iroquois Theater fire to
improve the safety of theaters. It took a
Titanic disaster to improve the safety of
vessels. It took a Newark fire and a
Triangle fire to bring New York State's
fire legislation to its present
inefficiency.
Inis Weed, New Outlook
volume 104, 1913
My new favourite 1910s journalist, Inis Weed, summed it up:
"It took an Iroquois Theater fire to improve the safety of theaters. It took a Titanic
disaster to improve the safety of vessels. It took a Newark fire and a Triangle fire to
bring New York State's fire legislation to its present inefficiency."
Quote: https://books.google.com/books?id=URCzNkpDZp0C
Maybe software
could improve
without a disaster?
Let's choose not to
build tenements
But some regulations didn't come from fires! A bunch came from a lot of people
deciding to care about the same thing at the same time. We SRE/DevOps/Production
folks are here, doing this thing, because we care about reliability. We care and we
can encourage everyone else to care.
Software is increasingly used for life-critical systems. I don't want us to wait for a
disaster to decide not to build tenements. We can decide now what good systems
look like. We can create professional standards and industry safety codes, and opt in
to a professional organisation to keep ourselves honest. And then, like the fire code,
we can keep revising and improving it until huge software outages are rare.
The entire industry should learn from every major outage. No secrets.
83
● http://noidea.dog/fires
● Escapes in Urban America: History and
Preservation, Elizabeth Mary Andre
● No exit: the rise and demise of the outside fire
escape: Sara E Wermiel
● How Fire Disaster Shaped the Evolution of the
New York City Building Code, Charles
Shelhamer
● The Creative and forgotten fire escape designs
of the 1800s, Lauren Young
● New Outlook vol 104 (May-August 1913)
● RISKS Digest
● 1910 Newark Factory Fire, Mary Alden Hopkins
● New York City (NYC) Disasters, Baruch College
● Presentation template by SlidesCarnival
Questions, objections?
Find me at @whereistanya
or fires@noidea.dog
#GetAlarmedNYC
Before I finish: The NYFD and the Red Cross have a shared campaign to give
people free smoke alarms and free batteries. They'll even come install it for you.
If you don't have a smoke alarm, please search for #GetAlarmedNYC and fill in
their form. http://fw.to/Kzv1G4f
This slide lists a few references that I found especially useful or interesting while
writing this talk. That first one contains a list of all the others, so hit up
http://noidea.dog/fires if you want a lot of links to read more about fires and fire
escapes. It includes a bunch of fires I didn't include here because I ended up with like
70 minutes of material (whoops).
If you have comments on the talk, or questions or you're a building historian who is
willing to tell me what I got wrong, please find me at @whereistanya on Twitter or
fires@noidea.dog.
And also, everyone, please insist on fireproof software. That's all I have, thank you.
Image: Sweetie candykim. CC0. https://commons.wikimedia.org/wiki/File:Smoke_alarm.JPG

Contenu connexe

Dernier

Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 

Dernier (20)

Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 

En vedette

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...Palo Alto Software
 

En vedette (20)

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
 

The History of Fire Escapes (v1)

  • 1. The history of fire escapes Tanya Reilly (@whereistanya) This talk was a keynotes at DevOps Days New York in 2018. If you like watching videos more than reading, the video is at https://www.youtube.com/watch?v=02KEKtc-5Dc. Abstract: When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug, we usually have a contingency plan. We reduce damage, redirect traffic, page someone, drop low-priority requests, follow documented procedures. But why do many failures still come as a surprise? In this talk, we look at some real life analogs to preventing and managing software failures. Fire partitions. Public safety campaigns. Smoke alarms. Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. What can we learn from the real world about expecting failure and designing for it?
  • 2. “ When we first dropped our bags on apartment floors… Welcome To New York Taylor Swift My name is Tanya and I'm an immigrant here -- I moved here from Ireland ten years ago -- and one of the things I love about New York City is that you move here, and it’s immediately your city. The number one criterion for being a New Yorker is wanting to be a New Yorker. I love that. The greatest city in the world, Lin Manuel Miranda said so :-) I'm a Site Reliability Engineer and I’m especially interested in what happens when things fail, and the contingency plans we use to recover when something breaks. And a few months ago I was thinking about that a lot and walking around the city -- which is *beautiful* in September/October, the soft light makes the buildings look gorgeous -- and I started really noticing the fire escapes. They’re a contingency plan too. They’re for incident response. You don’t use them until all of your regular methods of getting out of the building have failed. So I started reading about fire escapes.
  • 3. Content Warning Fire and deaths caused by fire. Before I talk about fire escapes, let’s talk content. I'm looking at disaster prevention and disaster recovery in software, by looking at parallels in building fires. This will include stories of some of the worst fires in the history of the city. I've intentionally kept this talk as low on vivid details as possible, but we'll be looking at the reasons fires started, the stuff that helped them spread and how people died. There's also some pictures of buildings on fire. Nothing lurid, but there are pictures. If you have raw feelings related to recent fires like Ghost Ship or the devastating fire in the Bronx just after Christmas, this could be rough. If you'd be more comfortable skipping this one, you should do that with my blessing. On the next slide, I'll even tell you what I'm going to say, so you don't miss anything:
  • 4. “ 4 Tony Fischer CC BY-2.0 ● focus on better buildings, not better fire escapes ● focus on better software, not better incident response ● software needs a fire code tl;dr Here's my thesis: ● fire escapes are a hacky bit of afterthought tacked on to the outside of a building. If you're using fire escapes, it's worth making them as good as possible, but you’ll prevent more fires if you build better buildings. ● Similarly, incident response is often a hacky bit of afterthought tacked on long after software is released. Again, great incident response can help you recover faster than if you don’t have it but… you’ll prevent more outages if you build better software. ● Finally, buildings have an extremely detailed fire code, but we don't really have an extremely detailed systems engineering code for software, and maybe we should have. If you'd rather not read more about fires, that is ok. Stop here! Otherwise, keep going. Image: Tony Fischer. CC BY-2.0. https://flic.kr/p/72Lhz1
  • 5. Chinatown David Ohmer CC BY 2.0 Fire escapes were really only built in New York City for a hundred years. They weren't common until the 1860s, and in the 1960s they stopped being allowed on new construction. But in that time, they thoroughly changed the face of the city. And they seem to have really captured the imagination of people who live here. If you search for 'fire escapes nyc' on flickr you get 13000 pictures. Image: David Ohmer. CC BY 2.0. https://flic.kr/p/oybSWP
  • 6. Claudia Heidelberger CC BY-ND 2.0 Greenwich Village There's some debate now about whether we should start removing them in places where the building has been upgraded, or whether they should be preserved as part of the city's history. I think at least some of them should be preserved. Look how beautiful that is! Image: Claudia Heidelberger. CC BY-ND 2.0. https://flic.kr/p/oqYYv1
  • 7. East Village Dan DeLuca CC BY-2.0 Here's another lovely one. They made an effort to have it match the style of the building, not feel like a separate thing tacked on at the end. And I think that's key. Image: Dan DeLuca. CC BY 2.0. https://flic.kr/p/76Jmb2
  • 8. “ fire escapes were haphazardly attached to the most elaborately designed facades, with no consideration given to the relationship between the two. The facade was within the realm of architecture and the fire escape in the realm of the law Richard Plunz, a History of Housing in New York City But most of the time, the people adding the fire escape didn't think of it as part of the building. It was an afterthought. As this quote says, the facade of the building was architecture but the fire escape was law. It was an external contingency plan, not part of the main structure. And I think that's part of why fire escapes ended up not being successful. Quote: https://books.google.com/books?id=fcKlDAAAQBAJ&pg=PA24
  • 9. A brief history of NYC fires! (With apologies to actual historians) But I'm jumping to the end. Let's look at the evolution of New York City's fire code. By the way, I'm not an expert on buildings or fire escapes or the history of New York City. I read a lot to prepare this talk, and I'll link references along the way, but you should not consider this a reliable source of historical information. There may be errors.
  • 10. Financial District 1835 On to the history. We’re skipping the great fire of 1776, and jumping straight to 1835 and the Financial District. This was a commercial, not residential area, and as a result the number of fatalities was comparatively low -- two people -- I mean, still, two people, but this is mostly remembered as a fire that cost a LOT of money. The total cost of the damage was $20 million, which to put it into context, was three times the cost of the entire Erie Canal, which had opened ten years earlier. Almost 700 buildings were destroyed. The city had 26 fire insurance companies. This fire put 23 of them out of business. Image: Library of Congress. Public domain. https://en.wikipedia.org/wiki/Great_Fire_of_New_York#/media/File:The_Great_Fire_of_the_Cit y_of_New_York_Dec_16_1835.jpg
  • 11. What happened? ● contingency plans failed ● no failure domains ● exhausted incident responders 1835 A gas pipe burst in a maze of warehouses. These warehouses were full of extremely expensive, extremely flammable things for sale: lace, silks, musical instruments, and so on. It was winter with gale force winds and the fire spread very quickly through the wooden buildings. Inside two hours it covered 17 city blocks (or 13 acres), most of the financial district. The city's water supplies were low and it was a freezing night in December. Before the fire fighters could pull water from the rivers they had to cut through ice. At the time it was common to use gunpowder to level buildings and stop the fire spreading. But there had been a fire two days earlier and they were out of gunpowder. That fire had been bad: it involved the entire fire department of 1500 people, and they were still exhausted. I’ve seen no literature that says they did badly -- fire fighters tend to be complete badasses -- but nobody does their best work when tired. Still, they fought the fire for 15 hours until marines from the Brooklyn Navy Yard arrived with more gunpowder and made a barrier by blowing up some buildings along Wall street. So, two contingency plans failed. When there's a fire, we'll spray water on it and use gunpowder. But there wasn't enough of either. And having no gunpowder meant no failure domains, nothing to stop the fire spreading.
  • 12. Outcome: better incident response ● a bigger, better, non-volunteer fire department ● reliable water: Croton dam and aqueduct 1835 As a result of the fire, the number of firefighters was increased,and they got better equipment. They stopped using volunteer fire fighters, only professionals. And they built the Croton Dam and Aqueduct. It was built because of the fire, but a reliable water source is good for lots of reasons!
  • 13. ● they rebuilt in stone Outcome: better buildings 1835 As well as better incident response, they took the opportunity to make a more resilient city. The fire spread fast because the buildings were made of wood. They rebuilt with stone and brick. And this paid off, ten years later, when there was another enormous fire. The great fire of 1845 was very bad -- thirty people died -- but it didn’t spread as far or as fast, because it slowed down when it hit those new brick buildings.
  • 14. 1860Tenement Fires The population of New York City doubled every decade between 1800 and 1880. Maybe you've seen this with teams and software systems: when you're growing rapidly, it's easy to build some culture problems and some technical debt. This was certainly true in this case: landlords made more accommodation by splitting big rooms into many smaller ones, called tenements, mostly without light or ventilation. In the 1860s, more than half the city -- nearly 500,000 people -- lived in tenements. These were horrible places to live. They were filthy, and riddled with crime and disease, and every report about them mentioned that they were fire traps. A New York Times article in 1860 said “If a skillful man, with a deadly hatred of his race in his heart, sat down to plan a human residence in which to entrap and destroy those who should dwell in it, it is extremely probable that if he had seen these houses in West Forty-fifth-street he would take them as a model.“ Image: Moncrief. Public domain. https://commons.wikimedia.org/wiki/File:LowerEastSideTenements.JPG Quote: http://www.nytimes.com/1860/03/29/news/destructive-fires-four-tenement-houses-destroyed-t wo-mothers-eight-children.html?pagewanted=all
  • 15. What happened? ● bakery in the basement ● clutter ● no isolation ● obsolete contingency plans 1860 In 1960, two bad tenement fires happened back to back, killing at least twenty people. The first one, on Elm Street (now Lafayette), started in the bakery on the ground floor of a six storey building. The wooden stairway burned away, trapping people on the top floor. They could get to the roof, but this building was four storeys higher than its neighbours, so there was nowhere to go. The baker was storing a lot of hay and wood shavings and when they burned, they made dense smoke, which killed some of the people who lived on the top floors before the fire even got up there. A month later, on West 45th Street, four houses burned. All four of these had roof hatches called scuttles, which would have let people escape across the roofs, but they were missing their ladders so people couldn't get up there. The roofs were canvas covered in pitch, so when the fire reached them, it spread quickly across the buildings. No isolation. These escape plans -- the ladders and scuttles and escaping across the roof -- had worked fine for a previous iteration of shorter NYC buildings, but they hadn't been updated for the new shape of the city. I'm sure people had noticed, but until there was a disaster, it didn't get priority.
  • 16. Outcome: better buildings ● An Act to Provide Against Unsafe Buildings in the City of New York ● fire-proof stairs 1860 The city immediately passed a law to make the tenements more robust against fire. They even put an injunction on new tenement construction until the law was passed. Now houses for more than eight families (kind of specific) had to have fire-proof stairs either inside or outside the building. What’s frustrating about this is that four years earlier a commission had reported that, if there was a fire, tenants on the 6th and 7th floors of tenements had basically zero chance of survival. They recommended fireproof stairs. But nothing happened until a bunch of people died.
  • 17. 1867The Tenement House Act Tenements must have fire escapes! What does that mean? ¯_(ツ)_/¯ Seven years later, the Tenement House act was passed. This act had extremely good goals. It was extremely unsuccessful. The act said that tenements had to have fire escapes, but it didn't really spell out what that meant. Buildings had to have a fire escape, but they didn't have to make anyone safer! So landlords put up fire escapes that couldn’t hold the number of people in the house, or that weren’t well attached to the walls, or that fed into tiny spaces that couldn’t hold all the people. And what even was a fire escape? Did a rusty ladder count? Absolutely! Let's take a diversion and look at some fire escape patents. I will admit that these are not especially relevant to devops but they're delightful, so humour me. Image: Detroit Publishing Co., publisher. Public domain. https://commons.wikimedia.org/wiki/File:New_York,_N.Y.,_yard_of_tenement_LOC_det.4a185 86.jpg
  • 18. Things that are fire escapes William Houghton, 1891 This is a ladder with a counterweight. Imagine climbing down from the 7th floor of your building on one of these. With your six children. In a dress that went to your ankles. Image: Scientific American. Public domain. https://en.wikipedia.org/wiki/Fire_escape#/media/File:Houghton%27s_Fire_Escape_1877.jpg
  • 19. Things that are fire escapes Mary McArthur, 1904 This is a kind of rope ladder that attaches to a window sill. Patent: http://www.google.com.pg/patents/US800934
  • 20. Things that are fire escapes William Bedinger, 1915 This is a parachute that rolls up very small. The idea was that you'd carry it with you everywhere in case you were in any tall building fire situations. Patent: https://www.google.com/patents/US1168465
  • 21. Things that are fire escapes 21 Henry Vieregg 1902 "A person desiring to escape seizes one member of the cord, rope, or chain, as shown in Fig. 1, and forthwith jumps out of the window. [...]" Like, I am looking at this thing and do not feel like I could forthwith jump out of anything. Patent: https://www.google.com/patents/US708846
  • 22. Things that are fire escapes 22 Anna Gonnelly 1887 This is a bridge that you can sling from your roof to another building. It has side rails, so it's only moderately terrifying. Patent: https://www.google.com/patents/US368816
  • 23. Things that are fire escapes Pasquale Nigro 1909 This one is just fantastically ludicrous. But good if you want to fight supervillain crime, I guess? All of these patents were granted, btw. Patent: https://www.google.com/patents/US912152
  • 24. Things that are fire escapes BB Openheimer 1879 You might think that this is just a parachute helmet. It is not. It is a parachute helmet and a pair of very bouncy shoes. Patent: https://www.google.com/patents/US221855 .
  • 25. Things that are fire escapes Nicholas Borgfeldt 1882 Finally, I've read this patent three times and I'm fairly convinced that the guy invented a rope. It's the most Silicon Valley invention of 1882. Though, let's be clear, rope was a popular kind of fire escape. In fact, it was the state of the art for hotels. Patent: https://www.google.com/patents/US267399
  • 26. The New Rope Fire-Escape Law for Hotels I don't mean a ladder made of rope, I mean literally a rope. Every hotel room had to have a rope and that was the only fire escape. Even at the time, people found that pretty terrible. This is a snarky cartoon from a magazine called Puck, published in 1887, of a whole lot of people trying to use the ropes. Image: https://books.google.com/books?id=XwAjAQAAMAAJ&pg=PA48. Pre 1923 so public domain.
  • 27. The escape plan only works for one of these people ---> Puck Magazine, 1887 This lady is saying "Slide down a rope in my night-dress, with every body looking at me? Never! I'll be cremated first!" And it's a fair objection! These escape plans are designed for the easiest case: someone with good upper body strength and agility who isn't wearing a skirt or carrying a child. If your disaster plan only works for the easiest case, it's not a good plan. I want to emphasise here that a rope is still better than nothing. In fact, probably every one of these fire escapes, even Mister Parachute Hat, is better than nothing. Once the fire has started, you’ll be glad of whatever you have. But these escape plans are not where I would put my efforts if I wanted to have fewer people die in fires. But this is what the law focused on. Image: https://books.google.com/books?id=XwAjAQAAMAAJ&pg=PA48 Pre 1923 so public domain.
  • 28. 1867The Tenement House Act Tenements also must have windows! What does that mean? ¯_(ツ)_/¯ Even with fire escapes, tenements were still terrible. They were badly constructed, overcrowded, there was no ventilation, and -- I find this amazing -- it was perfectly legal to store lots of combustible materials in them. One other thing the tenement act said was that every room now had to have a window. And just like “what even is a fire escape” it didn’t define “what even is a window”. So the landlords cut holes in interior walls between rooms and called them "interior windows". A decade later, the law said sigh, ok, exterior windows. So landlords started constructing buildings with air shafts, little narrow gaps between buildings. Bear in mind that there’s no indoor plumbing in these tenements and the bathroom is maybe down six flights of stairs and now you have an air shaft, so you can imagine how that goes. One article I read described the air shaft as “festering tubes of disease” (very poetic), and said that they provided just enough oxygen (and, maybe, methane) to help spread fires more quickly. So that’s something. Anyway, many of the fire escapes just led down to these air shafts and there was no way out from there. Image: U.S. National Archives and Records Administration. Public domain. https://en.wikipedia.org/wiki/Old_Law_Tenement#/media/File:Airshaft_of_a_dumbbell_teneme
  • 30. 1871More Tenement House Acts! Carla Geisser CC BY THANK YOU CARLA <3 By 1871, iron fire escapes were common and of course people were using them as extra space. Kids played and slept out on them. People aired their mattresses there and hung laundry. You still see that now -- they're used for bikes and gardening and barbecue space and cat runs. All of that has been illegal since 1871. A later law said that fire escapes had to have a cast-iron sign saying that you could be fined for obstructing your fire escape. You still see those signs in some places in the city. And it was fair, because usable fire escapes are better than unusable ones. But, again, it was still perfectly legal to run your explosive business out of a tenement basement and tons of residential fires started because of people deep frying crullers. And anyway, the regulations were mostly not enforced, so people didn't pay much attention. Image: Carla Geisser. Used with permission.
  • 31. 1876Brooklyn Theater Fire 1876 In 1876, and this is a staggering number: 278 people died in Cadman Plaza in the third worst theater fire in US history. (The worst won't happen until in 1903 in Chicago, so in 1876 this is the worst one ever.) The final act of the play was about to start and the stage manager noticed a very tiny fire on the left of the stage. Image: Waller & Schrader, Photographers - Period Stereograph. Public domain. https://en.wikipedia.org/wiki/Brooklyn_Theatre_fire#/media/File:BrooklynTheatre_From_Johns on_Street_Looking_East.jpg
  • 32. What happened? ● obsolete contingency plans ● clutter ● clumsy incident response ● delayed response ● locked doors 1876 It was typical to keep buckets of water next to the stage, but there weren't any. There was a fire hose, but too much scenery was in the way to get to it. So the stage manager asked a couple of carpenters to put the fire out by beating it with poles. This didn't work and actually spread some sparks, setting fire to the loft. The actors wanted to avoid a panic, so they announced that the fire was part of the show, and that people shouldn't freak out, but once the audience realised, they stampeded. And they had trouble getting out. There was only one stairway down from the cheap seats at the top, and it filled with smoke. There were no fire escapes. Some exits were locked to prevent against gatecrashers so people couldn't get out that way. 278 people.
  • 33. Outcome: better buildings ● prosecutions ● new laws ● sprinklers 1876 The jury blamed the theater owners for not obeying a bunch of existing fire laws, and new laws were written, including not storing stuff on the stage and widening exits. In 1882, the building code said that theatres had to have automatic sprinklers: it's the first type of building in the city to require sprinklers. The first automated response. What I find remarkable is that this fire happened nine years after regulation said that tenements had to have safe exits, but those laws didn't carry over to theatres. It turns out they didn't carry over to other types of buildings either: hotels, schools, factories, ships, offices all followed their own path to fire safety and each had horror stories to get them there. Most of those I'm not going to talk about (trust me, it's better this way), but we'll look at factories in a minute, after….
  • 34. 1890-1901 Even more Tenement House Acts! ...we get proper no-kidding tenement regulation at last! And it doesn't even take a devastating fire to make it happen. Thank you Jacob Riis! In 1890, Jacob Riis published a book about tenement life called How the Other Half Lives and did a lecture tour on it. And up until now the upper and middle class people of New York City had sort of known the tenements were awful, but for the first time ever, there were photographs. It was harder to ignore. And over the next decade, people started to care about the conditions of tenements. Well, it was probably part empathy, part fear of smallpox coming out of there but, whatever, people suddenly cared. I was really reassured when I read this, because until then it had been all “there was a horrific fire and we added a very specific law and then there was a different horrific fire and we added a different very specific law”. And it was mostly like that! But this Tenement House Act came from someone saying “wow, look how much this sucks” in a compelling way. And that gives me hope! Anyway, the next couple of Tenement House Acts included having to have actual windows, not air shafts, and fire escapes couldn't be ladders any more: they had to have open balconies and stairs and be properly attached to the wall. Even better: your neighbours can no longer boil oil in the basement! And all new construction has
  • 35. to have interior fire partitions. Failure domains! We're finally looking at stopping fires from starting and spreading, not just escaping from them. And, best of all, it’s all actually going to be enforced. Welcome to the 20th century! But, oh yeah, it still sucks in factories. Image: Public domain. https://commons.wikimedia.org/wiki/Category:How_the_Other_Half_Lives#/media/File:How_th e_Other_Half_Lives_front_cover.png
  • 36. The Newark Factory Fire 1910 The Triangle Shirtwaist is the more famous one, but the Newark factory fire a few months earlier is a textbook disaster waiting to happen so I wanted to talk about it. The building was shared by a couple of paper box companies, a nightgown factory and a lamp manufacturer. It had previously been used by machine companies and the floors were soaked in oil. It had two fire escapes -- look at the size of this building! One ended up on a roof, with no way down from there, and the other was a really heavy ladder hanging from the third floor balcony. This was another emergency plan that only worked for people with good upper body strength, and this factory employed mostly young women. In the fire, they weren't able to lift down the ladder, so there was effectively only one fire escape.
  • 37. What happened? ● no fire alarms ● locked door ● not enough fire escapes ● delayed response, for insurance reasons ● panic 1910 A fire started in the lamp factory. There was no fire alarm, and everyone had evacuated the bottom three floors before they realised that 116 people up on the 4th didn't know there was a fire. The only door up to the 4th floor was kept locked, which was against the law. People on the ground brought a net and started catching people jumping from the fourth floor, but it broke and they only had one net. 25 people died, mostly from jumping. 32 more were badly injured. The buildings department had condemned this factory three times, but the factory owners ignored them. This building had had ten fires in ten years, which was expensive for insurance and they didn't want another fire on their record, so they delayed calling in the firefighters, even though the firehouse was just across the street. And the victims had never been in a fire drill and they had no idea what to do. They, quite reasonably, freaked out.
  • 38. "The commissioner of the New Jersey bureau regulating fire safety in factories felt that the building was sufficiently constructed and that the victims merely succumbed to panic." 36 Operator error? NOPE. From "Fire Escapes in Urban America: History and Preservation, by Elizabeth Mary André. (emphasis mine) When officials investigated, they said the root cause was not the walls soaked in grease, or delaying calling fire fighters, or the locked door, or the lack of fire escapes. It was that "the girls panicked". Human reaction to an outage or a disaster is never the root cause. Humans will act in human ways. If your systems can't handle that, and you haven't invested a lot of time in training the humans to act in some other way, your systems are crap. Quote: https://www.uvm.edu/histpres/HPJ/AndreThesis.pdf
  • 39. Outcome: …? ● “They died from misadventure and accident.” ● New York City Fire Chief Croker: "This city may have a fire as deadly as the one in Newark at any time." ● He wasn't wrong. 1910 So what happened? Nothing. The jury didn't convict. New Yorkers did look a bit at their factories and say "huh, I wonder if we should care about that"..., but nothing changed. Is it because it happened ten whole miles away instead of right in the city? No idea. The New York Fire Chief said "This city may have a fire as deadly as the one in Newark at any time". Four months later… Quote: "They died from misadventure and accident" from http://www.nytimes.com/2011/02/24/nyregion/24towns.html Quote: "This city may have a fire as deadly as the one in Newark at any time." from http://trianglefire.ilr.cornell.edu/primary/testimonials/tf_warnings.html
  • 40. 1911The Triangle Shirtwaist Factory 146 people, mostly young immigrant women, died inside 18 minutes in the Triangle Shirtwaist Fire. Image: Public domain. https://en.wikipedia.org/wiki/Triangle_Shirtwaist_Factory_fire#/media/File:Image_of_Triangle_ Shirtwaist_Factory_fire_on_March_25_-_1911.jpg
  • 41. What happened? ● no failure domains ● only one fire escape ● locked doors ● obsolete contingency plans ● and they already knew 1911 This building was considered fireproof, but it was packed with garments hanging so tightly together that the building might as well have been made out of cloth. The building should have had three fire escapes; it had one and that collapsed under the weight of people escaping, killing 20 people who dropped from the 7th floor. One exit was locked; the guy with the key escaped without unlocking it. The fire ladders and the water from the hoses could only get to the 6th floor and the factory was on the 7th to 9th. And the employers already knew about the problems. Employees had organised a strike the previous year to protest the working conditions, and they'd been fired. The building had had a recent warning notice from the department of sanitary control, but they hadn't fixed their violations.
  • 42. Outcome: better incident response ● New fire-fighting equipment 1911 The fire department developed a stronger water pump and a longer ladder, so they could reach taller buildings.
  • 43. Outcome: better buildings ● 60 new laws in three years ● sprinklers ● professional organisation ● "the common outside form of iron ladder-like stairway anchored to the side of the building is a pitiful delusion" 1911 But more importantly, building conditions took a big step forwards. A commission was started to look into fire hazards and other conditions in factories and their recommendations turned into 60 new laws over the next three years. Again, everyone knew factories were bad. But, when people died, they changed the law. Sprinklers started to be required in factories. (But only factories over seven stories tall. Very specific again). A professional organisation, the American Society of Safety Engineers, was founded and still exists. And people started to look at fire escapes differently. After the disaster, a report called them "a pitiful delusion." and "a type of exit condemned by the experience of many fires". “It has long been recognized that the common outside form of iron ladder-like stairway anchored to the side of the building is a pitiful delusion. This device for a quarter of a century has contributed the principal element of tragedy to all fires where panic resulted. Passing successively the window openings of each floor, tongues of flames issuing from the window of any one floor cut off the descent of all on floors above it. Iron is quickly heated and is a good conductor of heat, and expansion of the bolts, stays, and fastenings soon pulls the framework loose, so that the weight of a single body may precipitate it into the street or alley. Many a human being has grasped the hot rail of such a 'fire escape‘ only to release it with a scream and leap from it in agony. Its platforms are usually pitifully small, and a rush to them from several floors at once jams and chokes them hopelessly. It is a makeshift creation of the cupidity of
  • 44. landlords, frequently rendered still more useless by the ignorance of tenants, who clutter it up with milk bottles, ice boxes and other obstructions.“ Quote: http://www.nfpa.org/News-and-Research/Publications/NFPA-Journal/2014/September-October -2014/Features/Fire-Escapes/1914-Sound-the-Alarm
  • 45. “ FIREWALL IN ALL BUILDINGS SEEN AS ONLY SAFEGUARD " Horizontal" Escapes Created in 206 City Structures After Triangle Fire With No Casualties Since -- New York Times, February 25, 1923 1923 In 1923, the New York Times had an article praising fireproof interior walls: "For six years there has been no loss of life by fire in the 200 buildings so treated. And the cost is far less than the cost of the fire escapes and other equipment." It blows my mind that a group of 206 buildings having no fire deaths in six years was newsworthy. Those fireproof walls became code. In 1929: all new buildings over 75 feet in height had to have them, and also had to have two fully enclosed staircases! Failure domains are part of the code at last! Headline: https://timesmachine.nytimes.com/timesmachine/1923/02/25/105849722.html?pageNumber=1 41
  • 46. More fires 1968+ More very specific laws The code made frequent improvements over the next decades and in 1968 fire escapes stopped being code. You can't build them now. The 1968 code also required sprinklers for hotels and high-rise office buildings, but not nightclubs or residential buildings. In 1975, seven people died in a nightclub, so, sprinklers became required for nightclubs. In 1998 there were two bad residential fires, and now you have to have sprinklers for residences with four or more units. And I'm sure there will be more changes to react to future disasters. There's no retrofitting of existing buildings, btw. The laws only apply to new buildings and existing buildings only get better as they're renovated. Building code: Fire escapes shall not be permitted on new construction, with the exception of group homes. Fire escapes may be used as exits on buildings existing on December sixth, nineteen hundred sixty-eight when such buildings are altered, subject to the approval of the commissioner, or as provided in subdivision (b) hereof. https://www.lawserver.com/law/state/new-york/ny-laws/ny_new_york_city_administrative_code _27-368
  • 47. Fire deaths didn't decrease because we built better fire escapes. It was because we built better buildings. The number of fire deaths didn't decrease because we built better fire escapes. It was because we built and operated better buildings. In the end, the thing that made the difference was making it harder for the fire to start and spread. But for decades we optimised for escaping from it.
  • 48. Barbara L Hanson CC BY 2.0 Dan DeLuca CC by 2.0 Eden, Janine and Jim CC BY 2.0 don toye CC-BY-ND 2,0 Kristine Paulus CC-BY-ND 2.0 Fire escapes are last resort plans. Even at their best, they kind of suck: ● Not everyone can get out a window ● Windows get locked ● People put air conditioners in them ● And child-proof bars that are also adult proof. ● Fire escapes get covered in snow and ice ● And stuff, like bicycles and plants ● They rust ● They get detached from the wall ● The ladder gets stuck ● They make people worried about burglars ● If you leave windows open, fire blocks the path down But most importantly ● they never, ever get tested. Images: Kristine Paulus CC BY 2.0. https://flic.kr/p/fszEDf (plants) Dan DeLuca CC BY 2.0. https://flic.kr/p/5hsnTM (chairs) Eden, Janine and Jim. CC BY 2.0. https://flic.kr/p/7G1tWZ (snow) Barbara L. Hanson. CC BY 2.0. https://flic.kr/p/8uxpcf (rain)
  • 49. Don toye, CC BY 2.0 https://flic.kr/p/9XrAs (bike)
  • 50. “ vertical ladders also placed relatively greater stress on their mounts to the building, leading to fire escape collapses during times of intense use – such as during actual fires. John W. Cramer, The Story of a Tenement House Fire escapes pretty much have one time of intense use: during a fire. If they're going to collapse, that's when they're going to collapse. You don’t find out they don’t work until there’s a fire. Quote: http://www.boweryboogie.com/2014/10/favorite-pastime-tenement-fire-escapes
  • 51. If we're using a (terrifying) contingency plan a lot had to go wrong If we’re using a fire escape to escape a fire, a lot of things have had to go wrong. In fact, we had three chances not to be here.
  • 52. we could have prevented the spark! 1 Ideally, the fire wouldn’t have started at all. Something made the first spark. There was heat and fuel and oxygen. Maybe we could have not let it start. Image: Christophe. CC0. https://pixabay.com/en/match-sticks-smoke-ignite-fire-359970/
  • 53. we could have automatically fixed it! 2 Or we could have stopped it while it was tiny. Who hasn’t set a dish towel on fire in the kitchen, or burned toast? If you catch a fire quickly enough, it’s not a big deal.You just extinguish it. For bigger fires, maybe there’s a sprinkler system which triggers automatically. Yeah, it makes a mess, but you get to not have a complete building outage. Image: JonathanLamb Public domain. https://commons.wikimedia.org/wiki/File:Fire-blanket-on-display.jpg
  • 54. we could have contained it! 3 If we have proper failure domains, we can keep the fire or outage to one small part of the building or infrastructure. We can at least make the fire spread very slowly, so we have time to react non-urgently. Image: Tim Gouw. CC0. https://unsplash.com/photos/MApjpqu9V7E
  • 55. ok I guess we're reacting to a fire :-( 4 If we miss those three chances, we end up at stage four, urgently reacting to a fire that's out of our control. If we're here, we probably care a whole lot about fire escapes! But it would have been much better not to get to here. Image: skeeze. CC0. https://pixabay.com/en/firefighters-training-live-fire-696167/
  • 56. 75% of site reliability is not firefighting at least! Whether it's a building or a software system, the most important reliability work is making problems stop before they get to that fourth stage. Or at least it should be. I got a recruiter mail a couple of years ago that said 'Our site reliability engineers are seen as "firefighters"'. Wow, what a waste of SREs. I know there's a strong association with SREs and on call, and the pager, but that should be a really small part of the job. If your SREs or production folks are focused on emergency response, you're wasting 75% of their skillset. And, right, everyone who's writing code should have reliability in mind. Nobody sets out to write a precarious system. But just as we have people who specialise in UI or security, both of which we should all care about, we have people who specialise in reliability. And they need to be involved at every stage. So: prevention, detection, isolation, response: Let's look again at those four stages again while thinking about software.
  • 57. prevent the spark! 1Design for failure: prevention Something caused the first tiny breakage that caused the outage. Maybe we changed something, or something got overloaded or a user entered an input we didn’t expect. Or someone bulldozed a cable. A certain amount of sparks is fine! Some things are allowed to have more problems than others. We should know what our SLAs allow. We have error budgets. But how can we stop those things from happening more than we can handle? Image: Christophe. CC0. https://pixabay.com/en/match-sticks-smoke-ignite-fire-359970/
  • 58. solid structures ● choose your stack carefully ● design review ● think about weak points 54 There's this quote by Ellen Ullman: “We build computer systems like we build our cities: over time, without a plan, on top of ruins.” We don't need to build systems like that. We can think about our stacks from the start, with reliability in mind. We can spend time on high level design review and try to find holes in the system before we build it. Image: by me. My dad built this wall <3
  • 59. wiring inspections ● component design review ● code review ● testing ● fault injection ● be paranoid 55 State Farm CC BY 2.0 Next, catching bugs before they ship. We do this by having a second pair of eyes on our designs and our code, but also by writing good tests, both for the stuff that we think can't happen as well as the stuff that can. Fault injection like fuzz testing is a good way to test input we haven't thought of. And validate everything. Even if you write the only caller of a function, check what input you just passed yourself. Assume that anyone who calls one of your functions, including yourself, is a monster who's trying to take you out. Be paranoid. Image: State Farm. CC BY 2.0. https://flic.kr/p/duWtgw
  • 60. hiding the matches ● don't give users access they don't need ● clean interfaces ● sudo not root 56 Michael Chen CC BY 2.0 A stove igniter is a better tool than a box of matches. Don't give user access to functions or data they don't need, and when they do need it, provide clean, safe interfaces that are hard to get wrong. And don't even give yourself more access than you need: use sudo, not root. Image: Michael Chen CC BY 2.0. https://flic.kr/p/LdPYz
  • 61. operating with care ● plan changes ● canary everything ● feature flags 57 "Reproduced from NFPA's website, © NFPA (2018). The fire department recommends that you don't operate a stove while drunk or sleepy, and the same goes for a root prompt or admin console. Many outages are caused by changes, so make them deliberately and carefully. Canarying helps: push the change to one instance before you push all the instances. And push out new features in a way that makes it very fast to turn them off if you need to. Image: Reproduced from NFPA's website, © NFPA (2018). https://www.nfpa.org/~/media/files/public-education/resources/safety-tip-sheets/cookingsafety. pdf http://www.nfpa.org/termsofuse
  • 62. fire safety campaigns ● best practices ● conferences! Over time, the best practices for our industry have changed. We don't log in as root any more. We use config management. We use change control. We use repeatable builds. We socialise the idea of reliability in books and articles and at conferences. Image: Senior Airman Kyle Gese. Public domain. https://commons.wikimedia.org/wiki/File:Fire_Prevention_Week_131009-F-OP138-012.jpg
  • 63. automatically fix it! 2Design for failure: detection But, ok, sometimes we still break things. We have two options for immediate response: humans staring intensely at screens, or robots. Robots are better! Alice Goldfuss said (I think it was her monitorama talk last year) “If you have three minute SLAs that you expect to be satisfied by a human, you don’t have SLAs." I agree with that. Humans aren't fast enough for 4 nines. Image: JonathanLamb Public domain. https://commons.wikimedia.org/wiki/File:Fire-blanket-on-display.jpg
  • 64. smoke alarms ● early warning. But don't burn out your responders! 60 topquark22 CC BY 2.0 Smoke alarms need a fine balance, as everyone knows who’s every burned toast and had trouble shutting up their smoke alarm. And taken the batteries out :-/ (Don't do that.) Having humans react to small problems can burn them out. You're using up your gunpowder on small fires and not having enough left for the big ones! So ideally have a robot deal with the thing, or make it happen slowly enough that humans don't need to burn adrenaline to get involved. Image: topquark22. CC BY-2.0. https://flic.kr/p/6AcBru
  • 65. fire extinguishers ● rollback buttons The less time a human spends on deciding what to do, the better. In an oil fire emergency, don't make people have to think about whether the water fire extinguisher is safe; make sure the available tools are the safe ones. Provide a one-click rollback for all your changes. Let your on caller put out the fire, and then we'll figure out how it happened. Image. Hans. CC0. https://pixabay.com/en/fire-extinguisher-fire-delete-99915/
  • 66. sprinklers ● automatic recovery ● automatic response HomeSpot HQ CC BY 2.0 www.homespothq.com Even better than automatic response, is automatic recovery. There's a ton of ways we do this, especially at low levels of abstraction. If you drop a packet, TCP don't care. It's built into the algorithm. Resend that thing. You're not paging a human for a dropped packet or a failed checksum. We need automatic recovery higher up the stack. If tasks are flapping, we should be able to ride it out. If a backend goes missing, we should be able to coast, at least for a while. If a machine dies, it should automatically be replaced. Health checking and load balancing should move traffic from an unhealthy region to a healthy one. Maybe you want to let humans know, but the message they should get is "everything is under control but you might want to look at this when you get a chance". Not "WELCOME TO 3AM! A MACHINE REBOOTED". You don't want humans involved in failing over. We'll just mess it up. Image: HomeSpot HQ. CC BY 2.0. www.homespothq.com https://flic.kr/p/fmr7a7
  • 67. contain it! 3Design for failure: isolation Ok, there's a fire, it's happening. Now we need to not let it get on anything it's not already on. Or at least slow it down enough that we can catch it. Image: Tim Gouw. CC0. https://unsplash.com/photos/MApjpqu9V7E
  • 68. fire barriers ● isolation ● sharding ● failure domains 64 Achim Hering CC BY 3.0 Failure domains split the infrastructure up so that only one part of it should be affected by any given outage. Maybe we've sharded our users over a bunch of different servers. Maybe we've added redundant network connectivity. If the problem's going to move as components get overloaded, we want that to be slow enough that we can control it, not an immediate cascade. Image: Achim Hering. CC BY 3.0. https://commons.m.wikimedia.org/wiki/File:Durasteel_fire_barrier.jpg State Farm CC BY 2.0
  • 69. fire drills ● controlled outages ● disaster tests Humans will panic the first time they get paged, or the first time they hit a situation that's completely outside their comfort zone. They'll flail. They may even make the problem worse. Fire drills help. Just like we make it incredibly common to hear a smoke alarm and find our way outside, make it so that a disaster is never a surprise. At intervals, tell people you're doing a controlled outage, and take a system offline. See what breaks. Let people get their panic out of the way while you're there staring at the systems. This will also shake out dependencies that you hadn't considered. If you announce you're taking some low-SLA system down for ten minutes, you may find a bunch of high-SLA users you don't know about! Image: Clker-Free-Vector-Images. CC0. https://pixabay.com/en/safety-helmet-construction-hat-295057/
  • 70. avoiding encumbrances ● clean ops ● fatigue awareness 66 You know the phenomenon where you're paged for a thing, and in the course of fixing it, you hit a bunch of unintuitive commands, or out of date documentation, and it ends up taking you much longer to do something simple? Or you even end up breaking something else? These traps are a basement full of straw, or walls covered in oil, or a fire hose with cluttered scenery on top of it. It's making it very, very hard for you to move around safely as you try to fix the real problem. Fatigue is another one. You're way more likely to make a mistake if you're exhausted. Set rules about how many incidents a person should have to deal with and how many hours they have to spend responding to problems before their on call shift is over. Enforce those rules. If you end up having a ton of six hour on call shifts, your pager fires too much. Quit burning out your humans. Image: by me.
  • 71. react with urgency 4Design for failure: response Ok, sometimes this won't work. We don't have perfect software. Sometimes we will respond to outages and we can set ourselves up for success in that.
  • 72. not locking fire exits :-( ● no encumbrances ● don't comment out the safety system 68 Robert Couse-Baker CC BY 2.0 First off, and the thing that broke my heart through many of the fire stories: don't lock your fire exits. If you know there's a potential disaster and you have a way to deal with it, don't disable it. Don't lock yourself out of your control plane by having a response that depends on the thing that just went down. Don't comment out the automatic recovery system. Please, don't take the batteries out of your smoke alarm, literally or metaphorically. Image: Robert Couse-Baker. CC BY 2.0. https://flic.kr/p/6L2qvU
  • 73. communication ● status dashboards ● known issues ● announcement email Communicate about outages early. Make sure everyone knows there's a problem who needs to know. It's the worst when you're debugging some problem with your system and after ages you get the notification that some other system has problems that caused it. Image: by me.
  • 74. documented exits ● anticipate what you'll do in a disaster ● playbooks ● gotchas And document your exits. A lot of the time, we know how we expect our systems to fail. Maybe they've failed like that before. Maybe it's a specific thing we've set up an alert for. Playbook entries let us tell an on caller what to do. And document the things *not* to do, or even better, make it impossible to do them. Image: by me.
  • 75. controlled burns ● practice responding ● wheel of misfortune ● find unexpected dependencies Jereme Rauckman CC BY 2.0 Firefighters train using controlled outages and we can too. Just like we have drills to prevent panic, we can use them to speed up our response time by practicing outages. A wheel of misfortune is a regular exercise where you pick an arbitrary outage from a list of things that might happen, and work through it to make sure you know what to do. I saw a great one once where someone ended up needing to call Japan, and he had no idea how to make an international call. This is the kind of stuff we should find out before we need it! Have regular fake outages, wheels of misfortune, and even large scale disaster tests. Image: Jereme Rauckman. CC BY 2.0. https://flic.kr/p/pjPGD6
  • 76. Reliability can't be added at the end. That's why we do DevOps. What I'm saying is that reliability can't be added at the end. If you're fixing a terrible system by focusing on stage 4, with human response, you have a tenement. Foul air is coming in through the air shafts, and it's not somewhere humans should spend time. Reliability needs to be built in. Failure needs to be built in.
  • 77. 73 data from http://www.baruch.cuny.edu/nycdata /public_safety/civilianfire.htm graph made with gnuplot Century high: 1970 310 people Century low: 2016 48 people 71 people died of fire in NYC in 2017. 48 people in 2016. This is still a lot of people! But 2016 was the lowest number since they started recording a hundred years ago. That Bronx fire in December that killed 12 people was the deadliest in 25 years. How did we get from the fire traps of the 1800s to here? This graph shows NYC civilian fire deaths for each of a sample of years. I need a more complete data source. Also, bear in mind that the city population is growing so the decrease is even more impressive. Data: http://www.baruch.cuny.edu/nycdata/public_safety/civilianfire.htm
  • 78. 444 ---> pages!!! 74 Well, this helped. This is the New York City fire code. It has 444 pages. Book: http://shop.iccsafe.org/2014-new-york-city-fire-code.html Fire code: https://www1.nyc.gov/site/fdny/about/resources/code-and-rules/nyc-fire-code.page
  • 79. 75 444 ---> pages!!! Fire safety is also mentioned plenty in the city building code, the city construction code, the state building code, the National Fire Prevention Agency’s electrical code and plenty of other dense legislation. Don’t ask me what the difference is between all of these. There’s a lot of code, that’s all I’m saying. But we don't have a fire code for software. We have a bunch of O’Reilly books and they're great. Our industry makes a valiant effort to document how we do things. But nothing whatsoever makes us adhere to them, or prioritises one set of rules over the others. Why don't we have a fire code? Books: http://shop.iccsafe.org/state-and-local-codes/new-york-city/2014-new-york-city-codes-complet e-collection-1.html https://catalog.nfpa.org/NFPA-70-National-Electrical-Code-NEC-Softbound-P1194.aspx
  • 80. ““Millions of computers throughout the world are executing millions of instructions per second for millions of seconds without a single error [...] In spite of this, nobody trusts a computer; and this lack of faith is amply justified.” Software: A Vital Key to UK Competitiveness (C) Crown Copyright 1986 via Risks Digest (https://catless.ncl.ac.uk/Risks) h/t joe Thompson @caffeinepresent It has been proposed from time to time! There is an amazing list called the RISKS Digest, about public safety risks caused by computers. It's been running since the 80s. Digging around on there, I found this report from 1986 called "Software: a vital key to UK competitiveness", which had a whole appendix on safety critical software. It starts with “No computer software failure has killed or injured a large number of people. It is just conceivable that such a tragedy could occur.” and it has detailed sections on disaster prevention, management and analysis. It also includes this fantastic line: “in spite of this, nobody trusts a computer, and this lack of faith is amply justified”. I love it. Quote: https://catless.ncl.ac.uk/Risks/4/14#subj3.1
  • 81. In 1986, the UK Advisory Council for Applied Research and Development recommended… ● Before any organization can operate a life-critical computer system it must first obtain a License To Operate (LTO) ● Each life-critical system must be operated by a Certified Software Engineer who is named as being personally responsible for the system. The Advisory Council predicted a time when it wouldn’t be possible to recover from software failure by just switching off the computer and doing the thing manually -- this was written in 1986, remember. We're there now. They wanted certification: you would only be able to operate a life-critical computer system if you had a license and a Certified Software Engineer and a bunch of other stuff, and you'd have to get re-certified every five years. They also proposed what’s basically on call shifts, disaster recovery practice drills, and post-mortems, including post-mortems for near misses. A lot of this feels prescient and we ended up doing it, but we never required certification.
  • 82. 78 slide stolen from @jkuroda's amazing LISA keynote. Used with permission. If you were at LISA in November, you might have seen a fantastic talk -- it was the closing keynote -- by Jon Kuroda about aviation safety. Like with fire, plane travel got safer only after a lot of bad accidents. Jon pointed out that, while we might think of computing as a new field, it's the same age as a bunch of others. Software, aviation, power, emergency medicine all took a big jump forward after world war 2. But our industry is less mature than any of the others. Slide: https://people.eecs.berkeley.edu/~jkuroda/talks/jkuroda-systemcrash-planecrash-lisa2017.pdf Video of Jon's talk: https://www.usenix.org/conference/lisa17/conference-program/presentation/kuroda
  • 83. The stakes are lower? Why are we less mature? Is it because the stakes are lower? Mostly, the stakes have have been lower. Though we have had life-threatening and sometimes fatal software bugs.
  • 84. The stakes are lower? ● (1985-1987) Therac-25 ● (1992) London Ambulance dispatch failure ● (2013) "Character substitution errors may occur" ● (2017) "autocorrects medications to names of different medications [...] without telling the user" The Therac-25 radiation therapy machine had a concurrent programming bug that made it very rarely give its patients radiation doses that were hundreds of times greater than they should have been. Three people died. In college I remember studying the London Ambulance dispatch failure. A new software system was deployed that hadn't been load tested, and that had a memory leak. It couldn't keep track of where the ambulances were, which led to them arriving hours late. 46 people died who might have been ok if the ambulance had arrived on time. I haven't heard of any actual negative outcomes from the OCR bug that went around in 2013 and the medication autocorrection that people have been talking about recently, but in both cases, you can see how someone might print out a bunch of prescriptions that don't say what the doctor wrote. The use of software for life-critical systems grows every year. And every month we send #hugops on Twitter to the people working on the latest massive software outage. At some point these will overlap. Are we ready for this kind of responsibility?
  • 85. “ It took an Iroquois Theater fire to improve the safety of theaters. It took a Titanic disaster to improve the safety of vessels. It took a Newark fire and a Triangle fire to bring New York State's fire legislation to its present inefficiency. Inis Weed, New Outlook volume 104, 1913 My new favourite 1910s journalist, Inis Weed, summed it up: "It took an Iroquois Theater fire to improve the safety of theaters. It took a Titanic disaster to improve the safety of vessels. It took a Newark fire and a Triangle fire to bring New York State's fire legislation to its present inefficiency." Quote: https://books.google.com/books?id=URCzNkpDZp0C
  • 86. Maybe software could improve without a disaster? Let's choose not to build tenements But some regulations didn't come from fires! A bunch came from a lot of people deciding to care about the same thing at the same time. We SRE/DevOps/Production folks are here, doing this thing, because we care about reliability. We care and we can encourage everyone else to care. Software is increasingly used for life-critical systems. I don't want us to wait for a disaster to decide not to build tenements. We can decide now what good systems look like. We can create professional standards and industry safety codes, and opt in to a professional organisation to keep ourselves honest. And then, like the fire code, we can keep revising and improving it until huge software outages are rare. The entire industry should learn from every major outage. No secrets.
  • 87. 83 ● http://noidea.dog/fires ● Escapes in Urban America: History and Preservation, Elizabeth Mary Andre ● No exit: the rise and demise of the outside fire escape: Sara E Wermiel ● How Fire Disaster Shaped the Evolution of the New York City Building Code, Charles Shelhamer ● The Creative and forgotten fire escape designs of the 1800s, Lauren Young ● New Outlook vol 104 (May-August 1913) ● RISKS Digest ● 1910 Newark Factory Fire, Mary Alden Hopkins ● New York City (NYC) Disasters, Baruch College ● Presentation template by SlidesCarnival Questions, objections? Find me at @whereistanya or fires@noidea.dog #GetAlarmedNYC Before I finish: The NYFD and the Red Cross have a shared campaign to give people free smoke alarms and free batteries. They'll even come install it for you. If you don't have a smoke alarm, please search for #GetAlarmedNYC and fill in their form. http://fw.to/Kzv1G4f This slide lists a few references that I found especially useful or interesting while writing this talk. That first one contains a list of all the others, so hit up http://noidea.dog/fires if you want a lot of links to read more about fires and fire escapes. It includes a bunch of fires I didn't include here because I ended up with like 70 minutes of material (whoops). If you have comments on the talk, or questions or you're a building historian who is willing to tell me what I got wrong, please find me at @whereistanya on Twitter or fires@noidea.dog. And also, everyone, please insist on fireproof software. That's all I have, thank you. Image: Sweetie candykim. CC0. https://commons.wikimedia.org/wiki/File:Smoke_alarm.JPG