Thoughts on DevOps vs. Enterprise Culture Clash

Probably not unlike you, every day I work with folks caught in a clash between organizational processes and technology imperatives. “We have to get this new software up and running, but the #DevOps group won’t give me the time of day.”

Large organizations don’t have the luxury of ‘move fast, break stuff’; if they did, their infrastructure, security, financial, and software release processes would be a chaotic mess…far more than usual. But how does one ‘move fast’ without breaking enterprise processes, particularly ones that they don’t understand?

Enterprise, Know Thyself

The answer is simple: encourage engineers to always be curious to know more about their environment, constraints, and organizational culture. The more you know, the more nimble you’ll be when planning and responding to unanticipated situations.

Today I had a call with a health care company, working to get docker installed on a RHEL server provisioned by an infra team. What was missing was that the operator didn’t know that the security team using Centrify to manage permissions on that box required tickets to be created to grant ‘dzdo su’ access for a very narrow window of time. Additionally, the usual ‘person to connect with’ was off on holiday break, so we were at the mercy of a semi-automated process for handling these tickets, and because they had already put in a similar request in the past 7 days, all new tickets would have to go through a manual verification process. This frustrated our friend.

The frustration manifested in the form of the following statement:

Why can’t they just let me have admin access to this non-production machine for more like 72 hours? Why only 2 meaasly hours at a time?

– Engineer at an F100 health care organization

My empathy and encouragement to them was to “expect delays at first, don’t expect everyone to know exactly how processes work until they’ve gone through them a few times, but don’t accept things like this as discouragements to your primary objective.”

If everything were easy and no problems existed, kind words might be useless. When things are not working that way, knowing how to fix or overcome them goes a long way, just like a kind word at the right time. We crafted an email to the security team together explaining exactly what was needed AND WHY, as well as an indication of the authority and best/worst case timelines that we were operating under, and a sincere thank you.

Enterprise “DevOps” Patterns that Feel Like Anti-Patterns

In my current work, I experience a lot of different enterprise dynamics at many organizations around the world. The same themes, of course, come up often. A few dynamics I’ve seen in play when enterprises try to put new technology work in a pretty box (i.e. consolidate “DevOps engineers” into a centralized team) are:

  1. Enterprise DevOps/CloudOps/infra teams adopt the pattern of “planned work”, just like development teams, using sprints and work tracking to provide manageable throughput and consistency of support to other organizational ‘consumers’. This inherits other patterns like prioritization of work items, delivery dates, estimable progress, etc.
  2. Low/no context requests into these teams get rejected because it’s slow/impossible to prioritize and plan based on ambiguous work requirements
  3. The amount of control and responsibility these teams have over security and infrastructure systems the organization is often considered “high risk”, so they’re subject to additional scrutiny come audit time

That last point about auditing, particularly the psychological impacts on ‘move fast’ engineers, cannot be understated. When someone asks you to break protocol ‘just this one time’, it’s you that’s on the hook for explaining why you took action to do so, rarely the product owner or director who pressured the engineer to do it.

Technical auditors that are worth anything more than spit will focus on processes instead of narrow activities because to comb through individual log entries is not scalable…but verifying that critical risk mitigative processes are in place and checking for examples of when the process is AND isn’t being followed…that’s far more doable in the few precious weeks that auditing firms are contracted to complete their work.

The More You Know, The Faster You Can Go (Safely)

An example of how understanding your enterprise organization’s culture improves the speed of your work comes from an email today between two colleagues at F100+:

Can you confirm tentative dates when you are planning to conduct this test? Also will it take time to open firewall, post freeze incident tickets can be fast tracked?

– Performance Engineering at Major Retailer

This is a simple example of proper planning. Notice that the first as is for concrete dates, an inference that others also need to have their shit together (in this particular case because they’re conducting a 100k synthetic user test against some system, not a trivial thing in the slightest). The knowledge that firewall rules have to be requested ahead of time, and to notify incident response that potential issues reported may be due to the simulation, not real production traffic, comes from having experienced these things before. Understanding takes time.

Another software engineer friend of mine in the open-source space and I were discussing the Centrify thing today, and he asked: “why can’t they just set up and configure this server with temporary admin rights off to the side, then route appropriate ports and stuff to it once it’s working?” Many practitioners in the bowels of enterprises will recognize a few wild assumptions there, and in no way is this a slight of my friend, but rather an example of how different thinking is from two very different engineering cultures. More specifically, those who are used to being constrained as opposed to those who aren’t often have a harder time collaborating with each other because they’re reasoning is predicated on very different past experiences. I see this one a lot.

DevOps Is an Approach to Engineering Culture, not a Team

This is my perspective after only 5yrs of working out what “DevOps” means. I encourage everyone to find their own by having their own journey of curiosity, keyboard work, and many conversations.

There is and never should be a DevOps ‘manifesto’. As Andrew Clay Shafer (@littleidea) once said, DevOps is about ‘optimizing for people’, not process or policy or one type of team only. Instead of manifesto bullet points, there are some clear and common principles that have stayed the test of time since 2008:

  • A flow of work, as one way as possible
  • Observability and Transparency
  • Effective communication and collaboration
  • A high degree of automation
  • Feedback and experimentation for learning and mastery

Some of the principles above come from early work like The Phoenix Project, The Goal, and Continuous Delivery; others come from more formalized research such as ISO and IEEE working groups on DevOps that I’ve been a part of over the past 3 years.

I don’t tend to bring the “DevOps is not a team” bit up when talking with F100s primarily because:

  • it’s not terribly relevant to our immediate work and deliverables
  • enterprises that think in terms of cost centers always make up departments, because “we have to know who’s budget to pay them from and who manages them”
  • Now that DevOps is in vogue with various IT leaders and just like the manifestation of Agile everywhere now, DevOps is perceived as ‘yet another demand to do things differently from management’, so after being restructured, engineers often have enough open wounds that I don’t need to throw salt on
  • if this is how people grok DevOps in their organization, there’s little I as an ‘outside’ actor can do to change it…except maybe a little side-conversation over beers here and there, which I try to do as much as appropriately possible with receptive folks

However, as an approach to engineering culture, DevOps expects people to work together, to “row in the same direction”, and to learn at every opportunity. As I stated at the beginning of this post, learning more about the people and processes around you, the constraints and interactions behind the behaviors we see, being curious, and having empathy…these things all still work in an enterprise context.

As the Buddha taught, the Middle Path gives vision, gives knowledge, and leads to calm, to insight, to enlightenment. There is always a ‘middle way’, and IMO is often the easiest path between extremes to get to the place where you want to be.

Put That in Your Pipeline and Smoke Test It!

I rarely bother to open my mouth as a speaker and step into a spotlight anymore. I’ve been mostly focused on observing, listening, and organizing tech communities in my local Boston area for the past few years. I just find that others’

A friend of mine asked if I would present at the local Ministry of Testing meetup, and since she did me a huge last-minute favor last month, I was more than happy to oblige.

“Testing Is Always Interesting Enough to Blog About”

Permissioned quote from the Boston DevOps community, Dec 12th 2019. James Goin, DevOps Engineer

The state and craft of quality (not to mention performance) engineering has changed dramatically in the past 5 years since I purposely committed to it. After wasting most of my early tech career as a developer not writing testable software, the latter part of my career as of late has been what some might consider penance to that effect.

I now work in the reliability engineering space. More specifically, I’m a Director of Customer Engineering at a company focusing on the F500. As a performance nerd, everything inherits a statistical perspective, not excluding how I view people, process, and technology. In this demographic, “maturity” models are a complex curve across dozens of teams and a history of IT decisions, not something you can pull out of an Agilista’s sardine can or teach like the CMMI once thought it could.

A Presentation as Aperitif to Hive Minding

This presentation is a distillation of those experiences to date as research and mostly inspired to learn what other practitioners like me think when faced with challenges in translating the importance of holistic thinking around software quality to business leaders.

Slides: bit.ly/put-that-in-your-pipeline-2019

Like I say at the beginning of this presentation, the goal is to incite collaboration about concepts, sharing the puzzle pieces I am actively working to clarify so that the whole group can get involved with each other in a constructive manner.

Hive Minding on What Can/Must/Shouldn’t Be Tested

The phrase ‘Hive Minding‘ is (to my knowledge and Google results) a turn-of-phrase invention of my own. It’s one incremental iteration past my work and research in open spaces, emphasizing the notions of:

  • Collective, aggregated collaboration
  • Striking a balance between personal and real-time thinking
  • Mindful, structured interactions to optimize outcomes

At this meetup, I beta launched the 1-2-4-All method from Liberating Structures that seemed to work so well when I was in France at a product strategy session last month. It so well balanced the opposite divergent and convergent modes of thinking, as discussed in The Creative Thinker’s Toolkit, that I was compelled again to continue my active research into improving group facilitation.

Even after a few people had to leave the meetup early, there were still six groups of four. In France there were eight contributors, so I felt that this time I had a manageable but still scaled (4x) experiment of how hive minding works with larger groups.

My personal key learnings

Before I share some of the community feedbacks (below), I should mention what I as the organizer saw during and as outcomes after the meetup:

  • I need to use a bell or chime sound on my phone rather than having to interrupt people once the timers elapse for each of the 1-2-4 sessions; I hate stopping good conversation just because there’s a pre-agreed-to meeting structure.
  • We were able to expose non-quality-engineer people (like SysOps and managers) to concepts new to them, such as negative testing and service virtualization; hopefully next time they’re hiring a QA manager, they’ll have new things to chat about
  • Many people confirmed some of the hypotheses in the presentation with real-world examples; you can’t test all the things, sometimes you can’t even test the thing because of non-technical limitations such as unavailability of systems, budget, or failure of management to understand the impact on organizational risk
  • I was able to give shout-outs to great work I’ve run across in my journeys, such as Resilient Coders of Boston and technical projects like Mockiato and OpenTelemetry
  • Quite a few people hung out afterward to express appreciation and interest in the sushi menu of ideas in the presentation. They are why I work so hard on my research areas.
  • I have to stop saying “you guys”. It slipped out twice and I was internally embarrassed that this is still a latent habit. At least one-third of the attendees were women in technology and as important as being an accomplice to improving underrepresented communities (including non-binary individuals), my words need work.

A Few Community Feedbacks, Anonymized

Consolidated outcomes of “Hive Minding” on the topics “What must be tested?” and “What can’t we test?”
  • What must we test?
    • Regressions, integrations, negative testing
    • Deliver what you promised
    • Requirements & customer use cases
    • Underlying dependency changes
    • Access to our systems
    • Monitoring mechanisms
    • Pipelines
    • Things that lots of devs use (security libraries)
    • Things with lots of dependencies
  • What can’t we test?
    • Processes that never finish (non-deterministic, infinite streams)
    • Brute-force enterprise cracking
    • Production systems
    • Production data (privacy concerns)
    • “All” versions of something, some equipment, types of data
    • Exhaustive testing
    • Randomness
    • High-fidelity combinations where dimensions exponentially multiply cases
    • Full system tests (takes too long for CI/CD)

A few thoughts from folks in Slack (scrubbed for privacy)…

Anonymized community member:

Writing up my personal answers to @paulsbruce’s hivemind questions yesterday evening: What can/should you test?

  • well specified properties of your system, of the form if A then B. Only test those when your gut tells you they are complex enough to warrant a test, or as a preliminary step to fixing a bug, and making sure it won’t get hit again (see my answer to the next question).
  • your monitoring and alerting pipeline. You can never test up front for everything, things will break. The least you can do is test for end failure, and polish your observability to make debugging/fixing easier.

What can’t/shouldn’t you test?

  • my answer here is a bit controversial, and a bit tongue in cheek (I’m the person writing more than 80% of the tests at my current job). You should test the least amount possible. In software, writing tests is very expensive. Tests add code, sometimes very complex code that is hard to read and hard to test in itself. This means it will quickly rot, or worse, it will prevent/keep people from modifying the software architecture or make bold moves because tests will break/become obsolete. For example, assume you tested every single detail of your current DB schema and DB behaviour. If changing the DB schema or moving to a new storage backend is “the right move” from a product standpoint, all your tests become obsolete.
  • tests will often add a lot of complexity to your codebase, only for the purpose of testing. You will have to add mocking at every level. You will have to set up CICD jobs. The cost of this depends on what kind of software you write, the problem is well solved for webby/microservicy/cloudy things, much less so for custom software / desktop software / web frontends / software with complex concurrency. For example, in my current job (highly concurrent embedded firmware, everything is mocked: every state machine, every hardware component, every ocmmunication bus is mocked so that individual state machines can be tested against. This means that if you add a new hardware sensor, you end up writing over 200 lines of boilerplate just to satisfy the mocking requirements. THis can be alleviated with scaffolding tools, some clever programming language features, but there is no denying the added complexity)

To add to this, I think this is especially a problem for junior developers / developers who don’t have enough experience with large scale codebases. They are either starry-eyed about TDD and “best practices” and “functional programming will save the world”, and so don’t exercise the right judgment on where to test and where not to test. So you end up with huge test suites that basically test that calling database.get_customer('john smith') == customer('john smith') which is pretty useless. much more useful would be logging that result.name != requested_name in the function get_customer

the first is going to be run in a mocked environment either on the dev machine, on the builder, or in a staging environment, and might not catch a race condition between writers and readers that happens under load every blue moon. the logging will, and you can alert on it. furthermore, if the bug is caught as a user bug “i tried to update the customer’s name, but i got the wrong result”, a developer can get the trace, and immediately figure out which function failed

Then someone else chimed in:

It sounds like you’re pitting your anecdotal experience against the entire history of the industry and all the data showing that bugs are cheaper and faster to fix when found “to the left” i.e. before production. The idea that a developer can get a trace and immediately figure out which function failed is a starry-eyed fantasy when it comes to most software and systems in production in the world today.

The original contributor then continues with:

yeah, this is personal experience, and we don’t just yeet stuff into production. as far data-driven software engineering, I find mostly scientific studies to be of dubious value, meaning we’re all back to personal experience. as for trace driven debugging, it’s working quite well at my workplace, I can go much more into details about how these things work (I had a webinar with qt up online but I think they took it down)

as said, it’s a bit tongue in cheek, but if there’s no strong incentive to test something, I would say, don’t. the one thing i do is keep tabs on which bugs we did fix later on, which parts of the sourcecode were affected, who fixed them, and draw conclusions from that

Sailboat Retrospective

Using the concept of a sailboat retrospective, a few things that I’d like to improve are below, namely:

Things that propel us:

  • Many people said they really liked the collaborative nature of hive minding and would love to do this again because it got people to share learnings and ideas
  • Reading the crowd in real-time, I could see that people were connecting with the ideas and message; there were no “bad actors” or trolls in the crowd
  • Space, food, invites and social media logistics were handled well (not on me)

Things that slowed us:

  • My presentation was 50+ mins, way too long for a meetup IMO.

    To improve this, I need to:
    • Break my content and narratives up to smaller chunks, ones that I can actually stick to a 20min timeframe on. If people want to hear more, I can chain on topics.
    • Recruit a timekeeper from the audience, someone who provides accountability
    • Don’t get into minutia and examples that bulk out my message, unless asked
  • Audio/video recording and last-minute mic difficulties kind of throws speakers off

    To fix this? Maybe bring my own recording and A/V gear next time.
  • Having to verbally interrupt people at the agree upon time-breaks in 1-2-4-All seems counter to collaborative spirit.

    To improve this, possibly use a Pavlovian sound on my phone (ding, chime, etc.)

Things to watch out for:

  • I used the all-to-common gender-binary phrase “you guys” twice. Imagine rooms where it would somehow be fine to say that, but saying “hey ladies” to a mixed crowd would be considered pejorative to many cisgender men. Everything can be improved and this is certainly one thing I plan to be very conscious of.
  • Though it’s important to have people write things down themselves, not everyone’s handwriting can be read back by others after, and certainly not without high-fidelity photos of the post-its afterward.

    To improve this, maybe stand with the final group representatives and if needed re-write the key concepts they verbalize to the “all” group on the whiteboard next to their post-it.

More reading: