Thoughts on DevOps vs. Enterprise Culture Clash

Probably not unlike you, every day I work with folks caught in a clash between organizational processes and technology imperatives. “We have to get this new software up and running, but the #DevOps group won’t give me the time of day.”

Large organizations don’t have the luxury of ‘move fast, break stuff’; if they did, their infrastructure, security, financial, and software release processes would be a chaotic mess…far more than usual. But how does one ‘move fast’ without breaking enterprise processes, particularly ones that they don’t understand?

Enterprise, Know Thyself

The answer is simple: encourage engineers to always be curious to know more about their environment, constraints, and organizational culture. The more you know, the more nimble you’ll be when planning and responding to unanticipated situations.

Today I had a call with a health care company, working to get docker installed on a RHEL server provisioned by an infra team. What was missing was that the operator didn’t know that the security team using Centrify to manage permissions on that box required tickets to be created to grant ‘dzdo su’ access for a very narrow window of time. Additionally, the usual ‘person to connect with’ was off on holiday break, so we were at the mercy of a semi-automated process for handling these tickets, and because they had already put in a similar request in the past 7 days, all new tickets would have to go through a manual verification process. This frustrated our friend.

The frustration manifested in the form of the following statement:

Why can’t they just let me have admin access to this non-production machine for more like 72 hours? Why only 2 meaasly hours at a time?

– Engineer at an F100 health care organization

My empathy and encouragement to them was to “expect delays at first, don’t expect everyone to know exactly how processes work until they’ve gone through them a few times, but don’t accept things like this as discouragements to your primary objective.”

If everything were easy and no problems existed, kind words might be useless. When things are not working that way, knowing how to fix or overcome them goes a long way, just like a kind word at the right time. We crafted an email to the security team together explaining exactly what was needed AND WHY, as well as an indication of the authority and best/worst case timelines that we were operating under, and a sincere thank you.

Enterprise “DevOps” Patterns that Feel Like Anti-Patterns

In my current work, I experience a lot of different enterprise dynamics at many organizations around the world. The same themes, of course, come up often. A few dynamics I’ve seen in play when enterprises try to put new technology work in a pretty box (i.e. consolidate “DevOps engineers” into a centralized team) are:

  1. Enterprise DevOps/CloudOps/infra teams adopt the pattern of “planned work”, just like development teams, using sprints and work tracking to provide manageable throughput and consistency of support to other organizational ‘consumers’. This inherits other patterns like prioritization of work items, delivery dates, estimable progress, etc.
  2. Low/no context requests into these teams get rejected because it’s slow/impossible to prioritize and plan based on ambiguous work requirements
  3. The amount of control and responsibility these teams have over security and infrastructure systems the organization is often considered “high risk”, so they’re subject to additional scrutiny come audit time

That last point about auditing, particularly the psychological impacts on ‘move fast’ engineers, cannot be understated. When someone asks you to break protocol ‘just this one time’, it’s you that’s on the hook for explaining why you took action to do so, rarely the product owner or director who pressured the engineer to do it.

Technical auditors that are worth anything more than spit will focus on processes instead of narrow activities because to comb through individual log entries is not scalable…but verifying that critical risk mitigative processes are in place and checking for examples of when the process is AND isn’t being followed…that’s far more doable in the few precious weeks that auditing firms are contracted to complete their work.

The More You Know, The Faster You Can Go (Safely)

An example of how understanding your enterprise organization’s culture improves the speed of your work comes from an email today between two colleagues at F100+:

Can you confirm tentative dates when you are planning to conduct this test? Also will it take time to open firewall, post freeze incident tickets can be fast tracked?

– Performance Engineering at Major Retailer

This is a simple example of proper planning. Notice that the first as is for concrete dates, an inference that others also need to have their shit together (in this particular case because they’re conducting a 100k synthetic user test against some system, not a trivial thing in the slightest). The knowledge that firewall rules have to be requested ahead of time, and to notify incident response that potential issues reported may be due to the simulation, not real production traffic, comes from having experienced these things before. Understanding takes time.

Another software engineer friend of mine in the open-source space and I were discussing the Centrify thing today, and he asked: “why can’t they just set up and configure this server with temporary admin rights off to the side, then route appropriate ports and stuff to it once it’s working?” Many practitioners in the bowels of enterprises will recognize a few wild assumptions there, and in no way is this a slight of my friend, but rather an example of how different thinking is from two very different engineering cultures. More specifically, those who are used to being constrained as opposed to those who aren’t often have a harder time collaborating with each other because they’re reasoning is predicated on very different past experiences. I see this one a lot.

DevOps Is an Approach to Engineering Culture, not a Team

This is my perspective after only 5yrs of working out what “DevOps” means. I encourage everyone to find their own by having their own journey of curiosity, keyboard work, and many conversations.

There is and never should be a DevOps ‘manifesto’. As Andrew Clay Shafer (@littleidea) once said, DevOps is about ‘optimizing for people’, not process or policy or one type of team only. Instead of manifesto bullet points, there are some clear and common principles that have stayed the test of time since 2008:

  • A flow of work, as one way as possible
  • Observability and Transparency
  • Effective communication and collaboration
  • A high degree of automation
  • Feedback and experimentation for learning and mastery

Some of the principles above come from early work like The Phoenix Project, The Goal, and Continuous Delivery; others come from more formalized research such as ISO and IEEE working groups on DevOps that I’ve been a part of over the past 3 years.

I don’t tend to bring the “DevOps is not a team” bit up when talking with F100s primarily because:

  • it’s not terribly relevant to our immediate work and deliverables
  • enterprises that think in terms of cost centers always make up departments, because “we have to know who’s budget to pay them from and who manages them”
  • Now that DevOps is in vogue with various IT leaders and just like the manifestation of Agile everywhere now, DevOps is perceived as ‘yet another demand to do things differently from management’, so after being restructured, engineers often have enough open wounds that I don’t need to throw salt on
  • if this is how people grok DevOps in their organization, there’s little I as an ‘outside’ actor can do to change it…except maybe a little side-conversation over beers here and there, which I try to do as much as appropriately possible with receptive folks

However, as an approach to engineering culture, DevOps expects people to work together, to “row in the same direction”, and to learn at every opportunity. As I stated at the beginning of this post, learning more about the people and processes around you, the constraints and interactions behind the behaviors we see, being curious, and having empathy…these things all still work in an enterprise context.

As the Buddha taught, the Middle Path gives vision, gives knowledge, and leads to calm, to insight, to enlightenment. There is always a ‘middle way’, and IMO is often the easiest path between extremes to get to the place where you want to be.

Put That in Your Pipeline and Smoke Test It!

I rarely bother to open my mouth as a speaker and step into a spotlight anymore. I’ve been mostly focused on observing, listening, and organizing tech communities in my local Boston area for the past few years. I just find that others’

A friend of mine asked if I would present at the local Ministry of Testing meetup, and since she did me a huge last-minute favor last month, I was more than happy to oblige.

“Testing Is Always Interesting Enough to Blog About”

Permissioned quote from the Boston DevOps community, Dec 12th 2019. James Goin, DevOps Engineer

The state and craft of quality (not to mention performance) engineering has changed dramatically in the past 5 years since I purposely committed to it. After wasting most of my early tech career as a developer not writing testable software, the latter part of my career as of late has been what some might consider penance to that effect.

I now work in the reliability engineering space. More specifically, I’m a Director of Customer Engineering at a company focusing on the F500. As a performance nerd, everything inherits a statistical perspective, not excluding how I view people, process, and technology. In this demographic, “maturity” models are a complex curve across dozens of teams and a history of IT decisions, not something you can pull out of an Agilista’s sardine can or teach like the CMMI once thought it could.

A Presentation as Aperitif to Hive Minding

This presentation is a distillation of those experiences to date as research and mostly inspired to learn what other practitioners like me think when faced with challenges in translating the importance of holistic thinking around software quality to business leaders.

Slides: bit.ly/put-that-in-your-pipeline-2019

Like I say at the beginning of this presentation, the goal is to incite collaboration about concepts, sharing the puzzle pieces I am actively working to clarify so that the whole group can get involved with each other in a constructive manner.

Hive Minding on What Can/Must/Shouldn’t Be Tested

The phrase ‘Hive Minding‘ is (to my knowledge and Google results) a turn-of-phrase invention of my own. It’s one incremental iteration past my work and research in open spaces, emphasizing the notions of:

  • Collective, aggregated collaboration
  • Striking a balance between personal and real-time thinking
  • Mindful, structured interactions to optimize outcomes

At this meetup, I beta launched the 1-2-4-All method from Liberating Structures that seemed to work so well when I was in France at a product strategy session last month. It so well balanced the opposite divergent and convergent modes of thinking, as discussed in The Creative Thinker’s Toolkit, that I was compelled again to continue my active research into improving group facilitation.

Even after a few people had to leave the meetup early, there were still six groups of four. In France there were eight contributors, so I felt that this time I had a manageable but still scaled (4x) experiment of how hive minding works with larger groups.

My personal key learnings

Before I share some of the community feedbacks (below), I should mention what I as the organizer saw during and as outcomes after the meetup:

  • I need to use a bell or chime sound on my phone rather than having to interrupt people once the timers elapse for each of the 1-2-4 sessions; I hate stopping good conversation just because there’s a pre-agreed-to meeting structure.
  • We were able to expose non-quality-engineer people (like SysOps and managers) to concepts new to them, such as negative testing and service virtualization; hopefully next time they’re hiring a QA manager, they’ll have new things to chat about
  • Many people confirmed some of the hypotheses in the presentation with real-world examples; you can’t test all the things, sometimes you can’t even test the thing because of non-technical limitations such as unavailability of systems, budget, or failure of management to understand the impact on organizational risk
  • I was able to give shout-outs to great work I’ve run across in my journeys, such as Resilient Coders of Boston and technical projects like Mockiato and OpenTelemetry
  • Quite a few people hung out afterward to express appreciation and interest in the sushi menu of ideas in the presentation. They are why I work so hard on my research areas.
  • I have to stop saying “you guys”. It slipped out twice and I was internally embarrassed that this is still a latent habit. At least one-third of the attendees were women in technology and as important as being an accomplice to improving underrepresented communities (including non-binary individuals), my words need work.

A Few Community Feedbacks, Anonymized

Consolidated outcomes of “Hive Minding” on the topics “What must be tested?” and “What can’t we test?”
  • What must we test?
    • Regressions, integrations, negative testing
    • Deliver what you promised
    • Requirements & customer use cases
    • Underlying dependency changes
    • Access to our systems
    • Monitoring mechanisms
    • Pipelines
    • Things that lots of devs use (security libraries)
    • Things with lots of dependencies
  • What can’t we test?
    • Processes that never finish (non-deterministic, infinite streams)
    • Brute-force enterprise cracking
    • Production systems
    • Production data (privacy concerns)
    • “All” versions of something, some equipment, types of data
    • Exhaustive testing
    • Randomness
    • High-fidelity combinations where dimensions exponentially multiply cases
    • Full system tests (takes too long for CI/CD)

A few thoughts from folks in Slack (scrubbed for privacy)…

Anonymized community member:

Writing up my personal answers to @paulsbruce’s hivemind questions yesterday evening: What can/should you test?

  • well specified properties of your system, of the form if A then B. Only test those when your gut tells you they are complex enough to warrant a test, or as a preliminary step to fixing a bug, and making sure it won’t get hit again (see my answer to the next question).
  • your monitoring and alerting pipeline. You can never test up front for everything, things will break. The least you can do is test for end failure, and polish your observability to make debugging/fixing easier.

What can’t/shouldn’t you test?

  • my answer here is a bit controversial, and a bit tongue in cheek (I’m the person writing more than 80% of the tests at my current job). You should test the least amount possible. In software, writing tests is very expensive. Tests add code, sometimes very complex code that is hard to read and hard to test in itself. This means it will quickly rot, or worse, it will prevent/keep people from modifying the software architecture or make bold moves because tests will break/become obsolete. For example, assume you tested every single detail of your current DB schema and DB behaviour. If changing the DB schema or moving to a new storage backend is “the right move” from a product standpoint, all your tests become obsolete.
  • tests will often add a lot of complexity to your codebase, only for the purpose of testing. You will have to add mocking at every level. You will have to set up CICD jobs. The cost of this depends on what kind of software you write, the problem is well solved for webby/microservicy/cloudy things, much less so for custom software / desktop software / web frontends / software with complex concurrency. For example, in my current job (highly concurrent embedded firmware, everything is mocked: every state machine, every hardware component, every ocmmunication bus is mocked so that individual state machines can be tested against. This means that if you add a new hardware sensor, you end up writing over 200 lines of boilerplate just to satisfy the mocking requirements. THis can be alleviated with scaffolding tools, some clever programming language features, but there is no denying the added complexity)

To add to this, I think this is especially a problem for junior developers / developers who don’t have enough experience with large scale codebases. They are either starry-eyed about TDD and “best practices” and “functional programming will save the world”, and so don’t exercise the right judgment on where to test and where not to test. So you end up with huge test suites that basically test that calling database.get_customer('john smith') == customer('john smith') which is pretty useless. much more useful would be logging that result.name != requested_name in the function get_customer

the first is going to be run in a mocked environment either on the dev machine, on the builder, or in a staging environment, and might not catch a race condition between writers and readers that happens under load every blue moon. the logging will, and you can alert on it. furthermore, if the bug is caught as a user bug “i tried to update the customer’s name, but i got the wrong result”, a developer can get the trace, and immediately figure out which function failed

Then someone else chimed in:

It sounds like you’re pitting your anecdotal experience against the entire history of the industry and all the data showing that bugs are cheaper and faster to fix when found “to the left” i.e. before production. The idea that a developer can get a trace and immediately figure out which function failed is a starry-eyed fantasy when it comes to most software and systems in production in the world today.

The original contributor then continues with:

yeah, this is personal experience, and we don’t just yeet stuff into production. as far data-driven software engineering, I find mostly scientific studies to be of dubious value, meaning we’re all back to personal experience. as for trace driven debugging, it’s working quite well at my workplace, I can go much more into details about how these things work (I had a webinar with qt up online but I think they took it down)

as said, it’s a bit tongue in cheek, but if there’s no strong incentive to test something, I would say, don’t. the one thing i do is keep tabs on which bugs we did fix later on, which parts of the sourcecode were affected, who fixed them, and draw conclusions from that

Sailboat Retrospective

Using the concept of a sailboat retrospective, a few things that I’d like to improve are below, namely:

Things that propel us:

  • Many people said they really liked the collaborative nature of hive minding and would love to do this again because it got people to share learnings and ideas
  • Reading the crowd in real-time, I could see that people were connecting with the ideas and message; there were no “bad actors” or trolls in the crowd
  • Space, food, invites and social media logistics were handled well (not on me)

Things that slowed us:

  • My presentation was 50+ mins, way too long for a meetup IMO.

    To improve this, I need to:
    • Break my content and narratives up to smaller chunks, ones that I can actually stick to a 20min timeframe on. If people want to hear more, I can chain on topics.
    • Recruit a timekeeper from the audience, someone who provides accountability
    • Don’t get into minutia and examples that bulk out my message, unless asked
  • Audio/video recording and last-minute mic difficulties kind of throws speakers off

    To fix this? Maybe bring my own recording and A/V gear next time.
  • Having to verbally interrupt people at the agree upon time-breaks in 1-2-4-All seems counter to collaborative spirit.

    To improve this, possibly use a Pavlovian sound on my phone (ding, chime, etc.)

Things to watch out for:

  • I used the all-to-common gender-binary phrase “you guys” twice. Imagine rooms where it would somehow be fine to say that, but saying “hey ladies” to a mixed crowd would be considered pejorative to many cisgender men. Everything can be improved and this is certainly one thing I plan to be very conscious of.
  • Though it’s important to have people write things down themselves, not everyone’s handwriting can be read back by others after, and certainly not without high-fidelity photos of the post-its afterward.

    To improve this, maybe stand with the final group representatives and if needed re-write the key concepts they verbalize to the “all” group on the whiteboard next to their post-it.

More reading:

Engineering Is About More Than Code

Curiosity is what drives engineers, and is equal parts curse and companion. An engineer isn’t limited to development or operations. An engineer would be a problem-solver in both areas, probably more. Curiosity is a surprisingly rare quality in people, even in technology.

If you want to know how something works, take it apart and observe. My first digital systems disassembly was a Sony Discman in 1989. Whatever I did, I fixed it. The feeling was powerful. It just took me 25 years to realize that there are many broken things in the world and to prioritize which one’s I involve myself with. Understanding the problem is crucial.

This is how I approach many conversations, navigating purposely and politely until there’s a useful reframing. People aren’t things, so be kind, be sensitive, and be patient. When you engage, learn about their biggest challenges, how they approach things, and what drives them. Just start with that.

[If my 80’s discman was still around, it would be like “yup, those were the days”. My 8086 XT clone next to it would splutter out some op codes. My Mega Man game watch would be waterlogged and stuck in a loop. Which all lead me to the next action…]

Put things back together again so that they work, hopefully, better than before. It’s just courtesy. In commit-worthy code that’s called hygiene. In conversation, that’s called maintaining a shared view or vision. Do these things enough and you’ll find that the way to a common goal is easy easier than with clutter obscurring your journey, and for others’. When you can row in the same direction, you get to your destination a whole lot faster.

Put it with other things to see where it doesn’t work. That’s integration and it’s not always easy, especially if it’s your own code or new auto-scaling configuration that causes unforeseen things to blow up. Get to know what your thing does before and after you put it out in the wild. Be honest with yourself and others about the time this takes.

There are some things you can take apart, and some things you can’t. If you can’t or if it’s too much effort for not enough value, move on to another learning tool, but remain committed to your goal. In the light of fundamental flaws in how we think about security, privacy, and basic human welfare right now, seek something you’re proud and grateful to do. [There are some very worthy things happening in Boston right now.]

CTO vs. CIO: How many tech “corners” do you really need?

Have you ever thought about what “departments” really means? The word “department” starts with another word: “depart”. Stop, think, continue reading.

Technical Chief Officer’s Dilemma: Departments and “Agency”

Are you in a situation where you honestly need people who purposely segregate themselves into groups that start with a departure from each other, rather than a congregation of ideas, people, and purpose?

If you are responsible for a technology “department”, you are responsible for a “failure”. #explain

Consider a geometric line, the most efficient way to connect one point to another. If only people were that easy. Get enough of them together and you start having to group them into manageable departments. IT, Development, Operations, Finance, Sales, Marketing, Management. Business lines to make things easy, right?

Departments are “Depart”-ments

Wrong. Department f*ck screw things up. Drawing lines isn’t a good thing unless if it’s to connect people with each other. They distract people from the simple truth that businesses who succeed are filled with people who instinctually understand that they are all on the same path, together.

Consider a geometric shape, the triangle. A line plus one point, an important point, an entire dimension. What good does it do to add another point beyond that? A square? Another department? Finance? HR? Marketing? Why?

I’m minimizing, I know wonderful, necessary in finance and human resource. Apologies to them, it’s just to make a point.

Only the Right Lines Need to Be Drawn

People who work with very large organizations know this inside out. Enterprises, government agencies, financial institutions. Corporations. The more lines there are, the more overhead and lack of progress there is. Sure, there’s stability, structure, fortitude; but the further we get away from connecting point A to point B in a straight line, the less efficient we are.

Truly effective business starts with figuring out how to define things with the least number of lines. Communication, organization, collaboration all benefit from simplifying how many lines are drawn. #karma

More reading:

[Talk] API Strategy: The Next Generation

I took the mic at APIStrat Austin 2015 last week.

A few weeks back, Kin Lane (sup) emailed and asked if I could fill in a spot, talk about something that was not all corporate slides. After being declined two weeks before that and practically interrogating Mark Boyd when he graciously called me to tell me that my talk wasn’t accepted, I was like “haal no!” (in my head) as I wrote back “haal yes” because duh.

I don’t really know if it was apparent during, but I didn’t practice. Last year at APIStrat Chicago, I practiced my 15 minute talk for about three weeks before. At APIdays Mediterranea in May I used a fallback notebook and someone tweeted that using notes is bullshit. Touché, though some of us keep our instincts in check with self-deprecation and self-doubt. Point taken: don’t open your mouth unless you know something deep enough where you absolutely must share it.

I don’t use notes anymore. I live what I talk about. I talk about what I live. APIs.

I live with two crazy people and a superhuman. It’s kind of weird. My children are young and creative, my wife and I do whatever we can to feed them. So when some asshole single developer tries to tell me that they know more about how to build something amazing with their bare hands, I’m like “psh, please, do have kids?” (again, in my head).

Children are literally the only way our race carries on. You want to tell me how to carry on about APIs, let me see how much brain-power for API design nuance you have left after a toddler carries on in your left ear for over an hour.

My life is basically APIs + Kids + Philanthropy + Sleep.

That’s where my talk at APIstrat came from. Me. For those who don’t follow, imagine that you’ve committed to a long-term project for how to make everyone’s life a little easier by contributing good people to the world, people with hearts and minds at least slightly better than your own. Hi.

It was a testing and monitoring track, so for people coming to see bullet lists of the latest ways to ignore important characteristics and system behaviors that only come from working closely with a distributed system, it may have been disappointing. But based on the number of conversation afterwards, I don’t think that’s what happened for most of the audience. My message was:

Metrics <= implementation <= design <= team <= people

If you don’t get people right, you’re doomed to deal with overly complicated metrics from dysfunctional systems born of hasty design by scattered teams of ineffective people.

My one piece of advice: consider that each person you work with when designing things was also once a child, and like you, has developed their own form of learning. Learn from them, and they will learn from you.

 

Quality Means Not Accepting Crap

Software. Hardware. Things. Opinions. Places. Excuses. Ideas.

Anyone can produce a cheap “affordable” solution. But details matter. How many cheap plastic things have broken in your hands unexpectedly, and were entirely disappointing in that moment?

My AirBnB is not that. I knew “quality” when I saw it. You can tell someone lived in this thing and made it convenient for them, then handed it off to you. That’s quality, making something that meets your own standards, then giving it to someone else.

WP_20151117_005

I travel a lot, enough to know what matters on a trip. Leg room on the plane. Working wifi. Power plugs, everywhere. Politeness. Clean bathrooms. Details matter.

Conversely, a $300/night hotel room only to have plugs too far away from the bed, lamp toggle buttons that take so much effort to push that you push the lamp over, light switches that are harder to find than Carmen Sandiego; the annoyances all add up too. The lights in this camper that I’m staying in are easy to use and don’t cause me to cuss.

WP_20151117_010

Same with software, details matter.

Quality software comes from people using their own product, living in it, fixing its flaws, and asking others how their experience with it is. In the tech industry, we call it “dogfooding” your own product. Believe me, it works.

People intrinsically know “quality” when they experience it. They pick up a phone, it’s heavy and solid, they think “that’s quality”. Conversely, they close a car door and it rattles or sounds hollow, they think “that’s cheap”. Even the sounds shipped with your mobile phone help to engineer your perception of the quality of the device.

Quality is in the details.

Oh, and BTW, I’d also rather put a constraint on myself not to over-drink and stumble into a $450/night on-premise room way too late at night to wake up on time the next morning. I have business to attend to. Knowing when to quit starts with looking at a ridiculous estimate and just saying no:

airbnb-1-fail

So, even at only 3 nights, this would have cost $1,350 just for a room I would be spending around 4-6 hours a night in, and not getting all the charms of an outside shower and condensation on the windows each morning. The AirBnB alternative for all three nights, just 60% as compared to JUST ONE NIGHT AT THE SHERATON!

airbnb-2-win

I should have remembered to check the crime overlay though, but Uber is a cheap solution to that problem:

airbnb-3-crime