sur·vey
noun,
sərˈvā/
1. a general view, examination, or description of someone or something.
"the author provides a survey of the relevant literature"
Each topic worthy of 40 minutes itself
Goal is to teach the reason for things
So you make better implementation choices
What is Operations?
By which we mean technical operations
The work of building and maintaining computer systems, networks, and applications.
Original Image
The definition covers everyone
This is why "devops" is obvious and the new normal
How to be good at Operations
Design to improve the safety , contentment , knowledge and freedom of your colleagues and users.
Focus on improving availability through reducing MTTD and MTTR.
Improve the organizations efficiency through improvements in People, Process, and Technology.
Done well, Operations enhances the safety , contentment , knowledge and freedom of both the authors and users of the system.
Design is fundamental
Each choice you make needs to make life better for the humans involved
That also leads to better business outcomes, as we'll learn later
Ultimately, the most scalable, fastest systems are also the ones that are best for the humans invovled, most of the time
Safety
Human safety
Information safety
Availability of the system as a possible link to both
The ability for individuals to act without fear of unintended consequences
Safety is a slider – different systems have different thresholds
Original Photo
Imagine you were early days at twitter
The system wasn't human safety critical, in your mind
Until it became a source of human saftey and communication during countless revolutions
Contentment
Contentment is about being satisfied with what you have.
The state of our systems is often a source of deep discontent :)
It may not make you happier – but it won’t hurt
Happiness is not a goal – it’s a by-product of a life well lived
- Eleanor Roosevelt
Original Image
Happiness is fleeting
If you are in trouble, contentment helps you make better decisions
Think about a brutal on call week - if the systems that support you are good, you survive
Knowledge
Access to knowledge is a leading indicator of social progress.
We should be making it easier to understand what the system is for, why we need it, and what good outcomes are.
The goal isn’t to minimize needed knowledge – its to provide access to the wealth of it, when we need it.
Original Image
The right knowledge, at the right time
Think about PaaS - its awesome you just git push
Until you are Rap Genius and heroku changes the router and everything sucks and you don't know why
Which doesn't make PaaS awful - at a different level of criticality, who cares?
The power or right to act, speak, or think as one wants without hindrance or restraint. – The Internet
We should be empowering ourselves and others to act, speak, and think as they need to with less hindrance.
Original Image
The Big Web got this right
Empower individuals to work as they see fit
Trust them to do the right things
Build systems that increase the trust needed to allow more freedom
Safety
Contentment
Knowledge
Freedom
We will come back to these throughout
Being good at Operations
Means being good at two things
Availability
Efficiency
Availability: Is the system down? Bring it back up.
Efficiency:Make the effort required to do work easier .
The work here is building and maintaining computers, networks, and applications
So efficiently doing that covers damn near everything
Focus on Availability
Efficiency Follows
Availability shows where you need to be most efficient now
It's a virtuous cycle
Availability is everybody's problem
There is no team that owns availability - other than the company itself
The problems are too big
The 9's
Availability
Downtime per month
90% (one nine)
72 hours
99% (two nines)
7.2 hours
99.9% (three nines)
43.8 minutes
99.99% (four nines)
4.32 minutes
99.999% (five nines)
25.9 seconds
Original Image
The difference in magnitude matters - days, hours, half hours, minutes, seconds
To achieve higher levels, everything has to get more precise
Know your target, and communicate it
It probably isn't five nines
The M's
Mean Time To Failure (MTTF) ↑
The average time there is correct behavior
Mean Time To Diagnose (MTTD) ↓
The average time it takes to diagnose the problem
Mean Time To Repair (MTTR) ↓
The average time it takes to fix a problem
Mean Time Between Failures (MTBF) ↑
The average time between failures
We want to decrease MTTD and MTTR
And increase MTTF and MTBF
Focus your efforts
On reducing Mean Time to Diagnose and Mean Time to Repair .
Failure is inevitable - it's how you detect and react that matter most to availability.
Original Image
All systems fail
Fear of failure is the greatest killer of availability
Slow and ponderous
Fast and nimble
Online banking is a huge thing for consumer banks
I met with one that has 5 9's of availability
They achieved this through changing the website once ever 6 months
After a torture chamber of hate and pain
They were not better at diagnose and repair - they were good at MTBF, and lucky
Contrast that with a more nimble org, who might have more frequent outages (say scheduled maintenance once a week)
But the system improves week over week
Raise your hand which one you want!
It's safer, increases human contentment, is easier to reason about, and frees people up
Diagnose
Metrics Collection
Collect metrics from the operating system ,
network , and applications .
High resolution matters !
As few systems as possible.
Original Image
You can't fix what you can't see
Metrics resolution has direct impact on MTTD
Diagnose
Two Critical Metrics
Is it up - from a users perspective
Is it making money
Original Image
One binary metric - can your users use your stuff
Money is often a trailing indicator of deeper systemic problems that are hard to see
I helped run an ad network back in the day, and the hour-by-hour money graph was the fastest way to see if we were letting people run over cap
Money graph also helps you justify other activity!
Diagnose
Graphing, Trends and Analysis
Use graphs to understand normal behavior.
Graphs taken from Theo Schlossnagle and OmniTI
Lets say this is puppy.com - the prime source for puppy news
Nice, easy content day - 70% utilization, smooth peaks and valleys
The Doge dog beats up the taco bell chihuahua outside the most posh dog park in LA
Puppy.com has the exclusive video
Diagnose
Graphing, Trends and Analysis
Use graphs to understand abnormal behavior.
Graphs taken from Theo Schlossnagle and OmniTI
The new york times picks it up, and adds long exposure traffic
Digg shows up, and it goes to 11
Happens in 60 seconds!
Auto-Scaling Will Not Save You
Either you design for this load, or you fail to meet the expectations
The right answer here is serve puppy.com from behind fast.ly :)
Capacity Planning
Identify key metrics
Put them on a graph
Set a limit
Plot a trend line
Expand your time horizon
Original Image
Capacity Planning
Do this on a regular cadence - monthly, etc.
Show your R-squared - think of it as a confidence number
This could be any metric that matters for your system
This is the number one source of trivially preventable outages
Diagnose
Alerts
Get the attention of the right humans.
As few alerts as possible
Routed to the people who can take action
Start with the is it up alert
Never create an alert that isn't actionable!
There is nothing more disrespectful than waking someone up for shit they can't fix
It's happening.. its happening... again
Repair
Incident Response
Original Image
Observe: whats going on
Orient: put whats going on in context of waht you know about the system, people, and dynamics
Decide: what to do next
Act: take action
Originally for fighter pilots to get inside the heads of the enemy
A faster loop means success in combat
This is the same pattern for responding to operations availability issues
Repair
Orient
Orient is the step we often fail at.
Thinking is the best tool we have in incident response.
Understanding more about the system, and how each piece behaves, is what separates the good from the great.
What Rob Pike learned from Ken Thompson
In fighter jets, knowing typical behavior, jets, and culture was crucial
Rob Pike and Ken Thompson working on a visual language
Rob typed faster, so he was at the keyboard
Rob attacked bugs, Ken thought about it
Ken was orienting better
Unlike a fighter jet, he had time :)
Repair
Incident Command
The First Responder is the default Incident Commander
Decides what to do next
Coordinates resources
Can hand off command
Communicates status
Not about rank
There is only ONE Incident Commander.
This isn't always true in real Incident Command, but go with it.
When it gets bigger than one person can handle, we flip to this
Knowing we have a Process, and command structure makes it easier to OODA
And faster loops means faster resolution
Learn
Post Mortem
Incident Commander schedules a post mortem within 24 hours of incident resolution.
Purpose is to learn from the incident , and
and identify the work needed to:
Prevent recurrence (if necessary)
Improve Mean Time To Diagnose
Improve Mean Time To Repair
Original Image
This should be the IC at the end of the incident
Progress on safety coincides with learning from failure. This makes punishment
and learning two mutually exclusive activities: Organizations can either learn
from an accident or punish the individuals involved in it, but hardly do both
at the same time. The reason is that punishment of individuals can protect
false beliefs about basically safe systems, where humans are the least reliable
components. Learning challenges and potentially changes the belief about what
creates safety. Moreover, punishment emphasizes that failures are deviant, that
they do not naturally belong in the organization ...
Sidney W.A. Dekker, Ten Questions about Human Error: A New View of Human Factors and System Safety (Human Factors in Transportation)
Learn
How to run a Post Mortem
Invoke the space: we are here to learn, not to blame
Describe the incident
Establish the timeline
Identify contributing factors
Describe customer impact
Describe remediation tasks for the root cause
Describe improvement tasks for response process
We hold post mortems to learn and improve, not to blame and punish
Puppys.com went down when Digg linked to the Doge/Chihuaua story
Story gets posted at 8am PST, NYT picks it up at 8:15am PST, Digg posts at 8:30am PST
Site goes down at 8:30am, alert at 8:31am, diagnosed at 8:50am, more capacity launched on ec2 at 8:55am, online and resolved at 9:00am PST
The traffic load overwhelmed mpm worker apache configuration, and exhausted capacity
People could not watch the doge dog crush the chihuaha, and click ads
Launched more capacity. Long term remediation is to move static content to a CDN
We investigated a denial of service and backend database issues before we looked at traffic graphs. Add passive alert on traffic.
Prioritize the outcomes
The process works because you prioritize the outcomes
Our remediation steps are the efficiency improvements you want
If you fail to act, or do other stuff, you're wasting the opportunity
Availability Roundup
Understand your Availability Targets
Track and understand your M*'s
Reduce time to detect and repair
Use capacity planning to avoid obvious incidents
Have an incident response and command process
Perform and publish post-mortems for every incident
Prioritize the outcomes
Efficiency
$$Efficiency = \frac{Output}{Effort}$$
Make the effort required to do work easier .
Original Image
People
Process
Technology
3 areas for efficiency, in order or most potential for gains
Think about Puppy's dot com - if we didn't have the right people, if we didn't have a process for incidents, if we didn't have post mortems, the technology fixes wouldn't make a dent long term
What is the mission?
How does your organization intend to fulfill it?
How do you contribute?
What are the stakes?
Knowing your purpose enables you to put decisions in context
The more context you have, the better your decision will be
Like a very long OODA loop
Know the people
Software Developers
Business Decision Makers
Systems and Network Administrators
Marketing and PR
Sales
Legal
Original Image
Trust is crucial to effective operations
Knowing people is crucial to trusting them
Set up lunch dates
Talk about your lives
THIS IS WHERE DEVOPS COMES FROM
John Allspaw and Paul Hammond are friends
When they create electronic devices, they can reflect on
whether that new product will take people away from themselves,
their family and nature. Instead they can create the kind of
devices and software that can help them go back to themselves, to
take care of their feelings. By doing that, they will feel good
because they’re doing something good for society.
- Thich Naht Hanh at Google
The way we do our work informs our lives
Having good lives improves the quality of our work in every dimension
We are blessed to be the architects of our environment
Lets back Thay up with data
People
Engaged Workers Rule
Stats in this section come from asking 25 million employees the same 12 questions in
Gallup's state of the American Workplace
with causality evidence from Causal Impact of Employee Work Perceptions on the Bottom Line of Organizations .
Gallup has been running this study since the 90s
They have proven the impact engaged workers have is causul
What other single thing could you possibly do that has a 22% impact on profitability?
21% impact on productivity!
65% less turnover!
Or a 41% impact on defects! Happy people care about their work more
It's the most critical operations efficiency task
Sources of Engagement
Clear expectations
Opportunity to shine
Praise
Having people care about you
Having your opinions count
A mission that makes you feel important
Commitment to quality
Original Image
Repetition, Repetition, Repetition
Training people is like training cats - you gotta be on that
Assholes
Know you an Asshole
After encountering them, people feel oppressed, humiliated, or otherwise worse about themselves
They target people less powerful than them
Chronic assholes are the problem.
Sections on Assholes taken from The No Asshole Rule .
Not talking about a bad day - these poeple are out to undo all the good engaged people do
Pick someone out, insult them gently, then compliment them
Point out this is what they will remember from this talk, forever
What you can do
Don't be an Asshole, and fire or shun those who are
Set clear expectations for others
Praise people
Make friends with, and care about your co-workers
Listen to each other
Take pride in your work
Kaizen
改善
Change for the better
Continuous Improvement
A few lean/improvement resources: Lean thinking , The Goal - there are so many more.
Kaizen
Small improvements
Evaluate a process, make it better.
Try using the scientific method:
Ask a question
Do research
Construct a hypothesis
Test your hypothesis
Analyze data and draw a conclusion
Communicate your results
Kaikaku
Radical Change
Recognize when desired results are beyond incremental improvement.
Start fresh, incorporate a new process, then do Kaizen
Continuous Delivery is a good example
If you are a big, waterfall org with manual testing
Incrementally moving to CD is going to fail
You need to blow up the way you work, learn how that feels, and kaizen your way to happiness
A house built on sand and all that
Original Image
Technology
Systems Design
Understand the requirements
Do not mistake existing implementations for hard requirements
Big retailers web division, wanted to automate, I wanted to sell software
Asked how they felt about Cd, said they weren't CD people
I was like: Me neither! ;)
They told me their design, said "then we come together and make it work"
We rebuilt it in that room, much better - not real requirements
Scalable Systems Design
Identify autonomous actors, and have them keep their promises
Rolling Upgrade
Traditional web servers behind a load balancer
Upgrade servers one at a time
Naive way
Take App1 from Load Balancer Pool
Update Software on App1
Verify update worked
Put App1 back into Load Balancer Pool
What happens if a server is down? What happens to traffic in transit? What if we die in the middle?
This is what you would do if you wrote the steps down!
And it's whats going to happen in any case
But linearly implementing these as a script - whoa doggies
600 configuration changes to the load balancer!
Autonomous Actors
Each component responsible for itself
Promises
Each Autonomous Actor promises to behave a certain way.
Other Actors can verify those promises.
Identify Autonomous Actors
Load Balancers
Promises to route traffic to working app servers
Application Servers
Promises to serve application traffic and publish status
Better way
Update software on App1
Add a service that is smart about the apps status to each server
Monitor that service with the load balancer
Upgrade process manages that services response
Load balancer just blindly routes traffic
All the questions from the neive implementation can be answered by improvements to the status endpoint
The better solution has fewer interactions .
But it has more pieces .
We reduced the degree of difficulty in the process
Increased the number of moving parts
Safety: Resilient against many more failure modes
Knowledge Far easier to reason about during Orient in the OODA loop
Freedom: Pattern adapts to different values of "available" based on service needs
Contentment Safer>, easier to reason about, and more flexible - that makes everyone content
Efficiency Roundup
Greatest gains are in improving People
Continually improve process, be willing to redesign in the face of new challenges
Use Scalable Systems Design to improve your technology and automation
How to be good at Operations
Design to improve the safety , contentment , knowledge and freedom of your colleagues and users.
Focus on improving availability through reducing MTTD and MTTR.
Improve the organizations efficiency through improvements in People, Process, and Technology.
ring-3
ring-2
ring-1
ring-0