Beyond DevOps: The ‘Next’ Management Theory

In a conversation today with Ken Mugrage (organizer of DevOps Days Seattle), the scope of the term ‘DevOps’ came up enough to purposely double-click into it.

‘DevOps’ Is (and Should Be) Limited In Scope

Ken’s view is that the primary context for DevOps is in terms of culture, as opposed to processes, practices, or tools. To me, that’s fine, but there’s so much not accounted for that I feel I have to generalize a bit to get to where I’m comfortable parsing the hydra of topics in the space.

Like M-theory which attempts to draw relationships in how fundamental particles interact with each other, I think that DevOps is just a single view of a particular facet of the technology management gem.

DevOps is an implementation of a more general theory, a ‘next’ mindset over managing the hydra. DevOps addresses how developers and operations can more cohesively function together. Injecting all-the-things is counter to the scope of DevOps.

Zen-in: A New Management Theory for Everyone

Zen-in (ぜんいん[全員]) is a Japanese term that means ‘everyone in the group’. It infers a boundary, but challenges you to think of who is inside that boundary. Is it you? Is it not them? Why not? Who decides? Why?

By ‘management’ theory, I don’t mean another ‘the silo of management’. I literally mean the need to manage complexity, personal, technological, and organizational. Abstracting up a bit, the general principals of this theory are:

  • Convergence (groups come together to accomplish a shared goal)
  • Inclusion (all parties have a voice, acceptance of constraints)
  • Focus (alignment to shared goal, strategies, and tactics)
  • Improvement (learning loops, resultant actions, measurement, skills, acceleration, workforce refactoring, effective recruiting)
  • Actualization (self-management, cultural equilibrium, personal fulfillment)

I’ll be writing more on this moving forward as I explore each of these concepts, but for now I think I’ve found a basic framework that covers a lot of territory.

I Need Your Help to Evolve This Conversation

True to Zen-in, if you’re reading this, you’re already in the ‘group’. Your opinions, questions, and perspectives are necessary to iterate over how these concepts fit together.

Share thoughts in the comments section below! Or ping me up on Twitter@paulsbruce or LinkedIn.

 

How to Be a Good DevOps Vendor

This article is intended for everyone involved in buying or selling tech, not just tooling vendors. The goal is to paint a picture of what an efficient supply and acquisition process in DevOps looks like. Most of this article will be phrased from a us (acquirer) to you (supplier) perspective, but out of admiration for all.

Developers, Site-Reliability Engineers, Testers, Managers…please comment and add to this conversation because we all win when we all win.

MP3: https://soundcloud.com/paulsbruce/how-to-be-a-good-devops-vendor

I’ll frame my suggestions across a simplified four stage customer journey:

Make It Easy for Me to Try Your Thing Out

(Product / Sales)
Make the trial process as frictionless as possible. This doesn’t mean hands off, but rather a progressive approach that gives each of us the value we need iteratively to get to the next step.

(Sales / Marketing)
If you want to know what we’re doing, do your own research and come prepared to listen to us about our immediate challenge. Know how that maps to your tool, or find someone who does fast. If you don’t feel like you know enough to do this, roll up your sleeves and engage your colleagues. Lunch-n-learn with product/sales/marketing really help to make you more effective.

(Sales)
I know you want to qualify us as an opportunity for your sales pipeline, but we have a few checkboxes in our head before we’re interested in helping you with your sales goals. Don’t ask me to ‘go steady’ (i.e. regular emails or phone calls) before we’ve had our first date (i.e. i’ve validated that your solution meets basic requirements).

(Product / Marketing)
Your “download” process should really happen from a command line, not from a 6-step website download process (that’s so 90s) and don’t bother us with license keys. Handle the activation process for us. Just let us get in to code (or whatever) and fumble around a little first…because we’re likely engineers and we like to take things apart to understand them. So long as your process isn’t kludgy, we’ll get to a point where we have some really relevant questions.

(Marketing / Sales)
And we’ll have plenty of questions. Make it absurdly easy reach out to you. Don’t be afraid if you can’t answer them, and don’t try to preach value if we’re simply looking for a technical answer. Build relationships internally so you can get a technical question answered quickly. Social and community aren’t just marketing outbound channels, they’re inbound too. We’ll use them if we see them and when we need them.

(Marketing / Community / Relations)
Usage of social channels vary per person and role, so have your ears open on many of them: Github, Stack Overflow, Twitter, (not Facebook pls), LinkedIn, your own community site…make sure your marketing+sales funnel is optimized to accept me in the ‘right’ way (i.e. don’t put me in a marketing list).

Don’t use bots. Just don’t. Be people, like me.

(Sales / BizDev)
As I reach out, ask me about me. If I’m a dev, ask what I’m building. If I’m a release engineer, ask how you can help support your team. If I’m a manager, ask me how I can help your team deliver what they need to deliver faster. Have a 10-second pitch, but start the conversation right in order to earn trust so you can ask your questions.

 

Help Me Buy What I Need Without Lock-in

(Sales / Customer Success)
Even after we’re prepared to sign a check, we’re still dating. Tools that provide real value will spread and grow in usage over time. Let us buy what we need, do a PoC (which we will likely need some initial help with), then check in with us occasionally (customer success) to keep the account on the right train tracks.

(Sales / Marketing)
Help us make the case for your tool. Have informational materials, case studies, competitive sheets, and cost/value break downs that we may need to justify an expenditure that exceeds our discretionary budget constraints. Help us align our case depending on whether it will be coming out of a CapEx or OpEx line. Help us make it’s value visible and promote what an awesome job we did to pick the right solution for everyone it benefits. Don’t wait for someone to hand you what you need, try things and share your results.

(Product)
Pick a pricing model that meets both your goals and mine. Yes, that’s complicated. That’s why it’s a job for the Product Team. As professional facilitators and business drivers, seek input from everyone: sales, marketing, customers!!!, partners, and friends of the family (i.e. trusted advisors, brand advocates, professional services). Don’t be greedy; be realistic. Have backup plans on the ready, and communicate pricing changes proactively.

(Sales)
Depending on your pricing model, really help us pick the right one for us, not the best one for you. Though this sounds counter-intuitive to your bottom line, doing this well will increase our trust in you. When I trust you, not only will I likely come back to you for more in the future, we’ll also excitedly share this with colleagues and our networks. Some of the best public champions for a technology are those that use it and trust the team behind it.

Integrate Easily Into My Landscape

(Product)
Let us see you as code. If your solution is so proprietary that we can’t see underlying code (like layouts, test structure, project file format), re-think your approach because if it’s not code, it probably won’t fit easily into our delivery pipeline. Everything is code now…the product, the infrastructure configuration, the test artifacts, the deployment semantics, the monitoring and alerting…if you’re not in there, forget it.

(Product)
Integrate with others. If you don’t integrate into our ecosystem (i.e. plugins to other related parts of our lifecycle), you’re considered a silo and we hate silos. Workflows have to cross platform boundaries in our world. We already bought other solutions. Don’t be an island, be a launchpad. Be an information radiator.

(Product / Sales / Marketing)
Actually show how your product works in our context…which means you need to understand how people should/do use your product. Don’t just rely on screenshots and product-focused demos. Demonstrate how your JIRA integration works, or how your tool is used in a continuous integration flow in Jenkins or Circle CI, or how your metrics are fed into Google Analytics or Datadog or whatever dashboarding or analytics engine I use. The point is (as my new friend Lauri says it)…”show me, don’t tell me”.

(Sales / Marketing)
This goes for your decks, your videos, your articles, your product pages, your demos, your booth conversations, and even your pitch. One of the best technical pitches I ever saw wasn’t a pitch at all…it was a technical demo from the creator of Swagger, Tony Tam at APIstrat Austin 2015. He just showed how SwaggerHub worked, and everyone was like ‘oh, okay, sign me up’.

Truth be told, I only attended to see what trouble I could cause.  Turns out he showed a tool called Swagger-Inflector and I was captivated.
– Darrel Miller on Bizcoder

(Sales / Product)
If you can’t understand something that the product team is saying, challenge them on it and ask them for help to understand how and when to sell the thing. Product, sales enablement is part of your portfolio, and though someone else might execute it, it’s your job to make sure your idea translates into an effective sales context (overlap/collaborate with product marketing a lot).

(Product / Customer Support)
As part of on-boarding, have the best documentation on the planet. This includes technical documentation (typically delivered as part of the development lifecycle) that you regularly test to make sure is accurate. Also provide how-to articles that are down to earth. Show me the ‘happy path’ so I can use it as a reference to know where I’ve gone wrong on my integration.

(Product / Developers / Customer Support)
Also provide validation artifacts, like tools or tests that make sure I’ve integrated your product into my landscape correctly. Don’t solely rely on professional services to do this unless most other customers have told you this is necessary, which indicates you need to make it easier anyway.

(Customer Support / Customer Success / Community / Relations)
If I get stuck, as me why and how I’m integrating your thing into my stuff to get some broader context on my ultimate goal. Then we can row in that direction together. Since I know you can’t commit unlimited resources to helping customers, build a community that helps each other and reward contributors when they help each other. A customer gift basket or Amazon gift card to the top external community facilitators goes a long way to gaming yourself into a second-level support system to handle occasional support overflows.

Improve What You Do to Help Me With What I Do

(Product / Development / Customer Support)
Fix things that are flat out broken. If you can’t now, be transparent and diplomatic about how your process works, what we can do as a work-around in the mean time, and receive our frustration well. If we want to contribute our own solution or patch, show gratitude not just acknowledgement, otherwise we won’t go the extra mile again. And when we contribute, we are your champions.

(Product)
Talk to us regularly about what would work better for us, how we’re evolving our process, and how your thing would need to change to be more valuable in our ever-evolving landscape. Don’t promise anything, but also don’t hide ideas. Selectively share items from your roadmap and ask for our candid opinion. Maybe even hold regional user groups or ask us to come speak to your internal teams as outside feedback from my point of view as a customer.

(Product)
Get out to the conferences, be in front of people and listen to their reactions. Do something relevant yourself and don’t be just another product-headed megalomaniac. Be part of the community, don’t just expect to use them when you want to say something. Host things (maybe costs money), be a volunteer occasionally, and definitely make people feel heard.

(Everyone)
Be careful that your people-to-people engagements don’t suffer from technical impedance mismatch. Sales and marketing can be at booths, but should have a direct line to someone who can answer really technical questions as they arise. We engineers can smell marketing/sales from a mile away (usually because they smell showered and professional). But it’s important to have our questions answered and to feel friendly. This is what’s great about having your Developer Relations people there…we can nerd out and hit it off great. I come away with next steps that you (marketing / sales) can follow-up on. And make sure you have a trial I can start in on immediately. Use every conversation (and conference) as a learning opportunity.

(Product)
Build the shit out of your partner ecosystem so it’s easier for me to get up and running with integrations. Think hard before you put your new shiny innovative feature in front of a practical thing like a technical integration I and many others have been asking for.

(Development / Community / Marketing / Relations)
If there is documentation with code in it and you need API keys or something, inject them in to the source code for me when I’m logged in to your site (like SauceLabs Appium tutorials). I will probably copy and paste, so be very careful about the code you put out there because I will judge you for it when it doesn’t work.

(Marketing / Product)
When you do push new features, make sure that you communicate to me about things I am sure to care about. This means you’ll have to keep track of what I indicate I care about (via tracking my views on articles, white paper downloads, sales conversations, support issues, and OPT-IN newsletter topics). I’m okay with this if it really benefits me, but if I get blasted one too many times, I’ll disengage/unsubscribe entirely.

Summary: Help Me Get to the Right Place Faster…Always

None of us have enough time for all the things. If you want to become a new thing on my plate, help me see how you can take some things off of my plate first (i.e. gain time back). Be quick to the point, courteous, and invested in my success. Minimize transaction (time) cost in every engagement.

(Sales, et al: “Let’s Get Real or Let’s Not Play” is a great read on how to do this.)

At often as appropriate, ask me what’s on my horizon and how best we can get there together. Even if I’m heads-down in code, I’ll remember that you were well-intentioned and won’t write you off for good.

NEXT STEPS: share your opinions, thoughts, suggestions in the comments section below! Or ping me up on Twitter@paulsbruce or LinkedIn.

More reading:

Stop Using the ‘Staging’ Server – DevOps Days Boston

Chloe Condon presented on how containers and IaC (infrastructure as code) can help us skip over the ‘staging server’ part of traditional deployment strategies. This article is a loose transcript of taking points from her talk at DevOps Days Boston 2017.

What’s Wrong a Staging Environment?

Feedback from a traditional staging environment is too slow. The only thing the reviewer knows is if unit tests passed, the rest of the tests are run after that. “Staging” is usually reserved for integration, functional, UI, and performance testing (i.e. complete feedback). Too little, too late.

We’re all too familiar with the question “who broke staging?”. The fragility and centrality of this staging model creates bottlenecks. Also, the very first time something is brought into pipeline usually happens in staging and that’s when ‘broken’ occurs.

There’s lots of “friction” between environments. Dev/test/staging are often not equivalent and are configured differently, causing deployment between environments to be a hastle. Flows across these environments are time-consuming (environment variables and files missing).

Code changes are being tested more extensively in staging, which means there’s little room for timely feedback.

Ephemeral Environments

The great thing is now, we have containers. We can run every build, package it in a container, then run tests on it in the same pipeline. Microservices are well-suited for this type of model, but also distributed stacks (like a web app, database, and supporting APIs) benefit from this model too.

Additionally, most stages of testing can be containerized. Leaving performance and scalability off for a moment, that enables us to run integration, functional, and security testing as part of a complete containerized package.

The problem still remains: we have the rule that staging has to be as close to prod as possible. This might serve some of those tests (like performance and security), but is largely dis-optimal for unit, integration, and functional tests. Performance tests could also be run earlier to provide us a better heads-up about degradations that creep in over time. In practice, late-stage environments don’t match reality and this causes friction..

So let’s reconsider the premise that all of our non-unit testing has to be run in a shared environment that bottlenecks us. This helps us shift feedback to the left. (Chloe says to insert Beyonce clip here.)

Containers = Consistency & Composition & Completeness

So now the container we’re handing off is much more complete: it includes a more complete set of self-testing capabilities that we can ask our pipeline to run for us.

You can hand off containers to your customers (usually internal but maybe even external) and with composition, you have confidence that the bits they’re running are the same as what you tested and what you want them to have.

Infrastructure as Code

Team should define what code is part of the process. When people are able to spin things up automatically on their own, this streamlines an important part of their process. Visualizations help a lot, which is why CodeFresh and other platforms have visual controls over the package and deploy process.

Infrastructure-as-Code (IaC) includes Dockerfiles, but also deployment scripts. If it’s code, treat it like it’s important because otherwise it’s outside the flow of delivery.

Paul’s take: IaC also includes a whole bunch of other stuff too. For example:

  • Composition scripts (like Docker compose, Kubernetes scripts)
  • Secrets management configuration
  • Network configuration
  • Database configuration (might include data)
  • Tests and test data
  • Feature flag configuration
  • Monitoring configuration & scripts

Implementing IaC requires a few things:

  1. Your team agrees and has an in-depth knowledge of how to push healthy code artifacts into the pipeline. No one is an island, others’ contributions need to be readable and easily debuggable.
  2. A resilient process (i.e. pipeline) including dynamic build/package/test semantics enables contributors to focus on the ‘push’ and feedback rather than the semantics.
  3. Information radiators along the process must cater feedback as granularly as possible: individual contributor first, then channel, then team. ChatOps bots give you immediate feedback about breakage as soon as it occurs.

A complete IaC artifact list will require collaboration between multiple contributors, which facilitates communication. Just make sure that empathy and positive reinforcement is part of your management strategy.

Questions from the Audience:

Q: “How do you describe the state of the code in PRs?”

Chloe: “Badges in the repo, some conventions, success flags on Codefresh.”

Q: “How often do people actually use this for pre-stage vs. just going to prod?”

Chloe: “For lots of people, they maintain separate branches for multiple environments. Then you can introduce new versions dynamically.”

Q: “In more complex systems, is there a composition management layer?”

Chloe: “This is the beauty of the compose files. When you treat them like code, this makes management a lot easier.”

More reading:

Iterative Security – DevOps Days Boston 2017

Tom McLaughlin presented on iterative security,  incorporating security into DevOps cycles through early detection and prevention of vulnerabilities. His slide are here. This is a loose transcript of taking points from his talk at DevOps Days Boston 2017.

Breaches in Practice vs. Theory

Tom made the point that breaches often occur in areas that aren’t covered by development or security teams because vulnerabilities escape due to a lack of objective and continuous risk assessment.

Code still has passwords and tokens in it. Lots of assumed knowledge going from dev to prod. Account access and password policies, patching, is usually handled by someone else. Leads to “good luck, it’s up to you” syndrome.

There’s also security paralysis. When we don’t think we know how to do something, we just won’t. And we’re rewarded for accomplishing things. So long as disaster doesn’t strike, we get by.

Why Do We Suffer from Security Breaches?

Mostly, we get distracted. 0-day exploits, crypto weaknesses, hash collisions. We get distracted by logos and discussion threads, but not patching the system. We get caught up in all of this stuff instead of actually doing what improves security.

Think about all the publicly exposed mongodb and Elasticsearch instances you’ve seen…being proactive isn’t always hard, but is rarely incentivized well.

We don’t do a good job explaining how to get from where you are to where you should be. We also don’t always practice critical thinking. What is you goal? What is your posture about security? Proactive, reactive?

We also don’t always have a wealth of layered instructional content. There’s a lot of information at the extremes (101 and advanced tutorials), but most of us are in the middle.

Solve the Problem Like You’re At Work

So then let’s develop a threat model together, as an example. Let’s start by being realistic. What kind of org and product matters? Align with your company on risk management policies and processes.

Prioritize. Use DREAD (or STRIDE) for rating threats and modeling risk.

Also take care of the easy stuff: USB sticks over man in the ceiling.

Do you still use a service after it’s been breached? I leave that up to you.”

Decompose the system. Map out your architecture and understand the systems. Look at the perimeters, how are credentials proliferated? Understand your data pipeline, where is your really valuable data stored?

Take time to consider things like exposed net ports, unpached containers, weak secrets…there are tools for this. These tools can be found in later slide here.

Putting a Response to Security Threats into Action

Two words: impose constraints. To find which constrains work for you and start with a simple discovery process that includes:

  • Time, how long to solve? Timebox solutions, defensible use of existing time.
  • Complexity, how hard is it? Ask deep questions, iterate over which help.
  • Risk, how risky is the problem and solution?

Secrets management is a first start. Tom pretty much pwns this space and I encourage you to seriously check out his extensive work on the topic here.

In terms of tactical actions you can take today, Tom mentioned these few, but of course there are more:

  • At the code, start with at least something like git-crypt.
    Ask yourself, what should be thrown out before it goes anywhere else?
  • In configuration management scripts:
    Developing a master re-key strategy is a great exercise to flesh this out.
  • Storage…a tool like sneaker for S3
    Really makes you ask questions…who/how are buckets managed.

Summary

We need to be better at security, continuous or otherwise. We need to act. There are simple things you can do, but they need to be aligned to your team/organization risk strategy. And make it easy for others to do the right thing, so that it’s far more likely to happen without imposing huge effort cost.

Tom’s a great speaker, engaging and fun to listen to. He is also a huge community contributor and even runs a distributed DevRel (developer relations) slack group. Tom is currently working on the CloudZero team.

More reading:

Enterprise Wild West – DevOps Days Boston 2017

Rob Cummings‘ keynote at DevOps Days Boston 2017 explored how Simon Wardley’s Pioneers, Settlers, and Town Planners model applies in enterprise engineering and large organizations. The general idea is:

  • Pioneers: explorers of new ideas, create prototypes, prove the need
  • Settlers: stealers of new ideas, move prototypes to MVP, prove feasibility
  • Town Planters: manufacturers, MVPs to industries, prove scalability
Bits or pieces?: On Pioneers, Settlers, Town Planters and Theft

The Problem: Overly Simplistic Approaches

Bi-modal IT splits the org into Mode 1 (systems of record) and Mode 2 (systems of innovation). Mode 1 has less line of sight to customers and is governed by enterprise architecture and governance. Mode 2 often runs into Mode 1 when …. The problem is that often, there’s no flow between Mode 1 and Mode 2. Bi-modal is overly simplistic.

The book “Thinking in Systems” is a great place to start your journey beyond these modes. Transition states and feedback loops exist already in your org, but realizing where they are and how they could be improved takes practice and group engagement.

Paul’s advice: System’s thinking is a much broader topic that, if you haven’t actually studied, it would serve you well to listen to The Fifth Discipline by Peter Senge. As context for my presentation in April on IoT testing, this made me realize that systems thinking was a necessary mental tool moving forward.

Everyone Innovates…Sometimes.

Pioneers live outside standards, fail often, and don’t necessarily make decisions based on metrics. Find the new horizon. That’s how they innovate, they bring ideas from outside in.

Settlers make prototypes real, building trust in the org, kick off ecosystems around the adoption of ideas, but sometimes suffer from adoption problems. They bring ideas further in to the org.

Town Planners focus on ops efficiency, build services and platforms that Pioneers rely on for future innovations. They’re metrics heavy and bring reality to the operation of ideas.

Fostering Friendly Theft

The Wild West is a “theft-based pull model”. There are no mandates. Theft occurs from right to left (pioneers on the left). re-use from left to right.. This is a good thing. Everyone is excellent and everyone both should participate in empathy. Foster feedback loops and maintain pull culture.

The Wild model exists within a team, not as separate departments. Again, for DevOps we’re not talking about traditional cost centers and departments; we’ve got mixed teams that are aligned on a shared goal with their own perspectives on how to do things best, together.

Paul’s Take: DevOps Requires Buy-in from Everyone

For DevOps to work, a team needs to understand and adapt to their organizational ecosystem. So while the micro-mechanics of the Wild West help us pull new ideas in on a continual basis, there has to be an understanding that extends across the whole org.

Many conversations at DevOps Days Boston 2017 on day 1 expressed the need for “buy-in from the top”, but effective DevOps also requires buy in from everyone. Teams need to align the virtues of DevOps to how they can positively impact the organization. It does no good for an SMB VP of Engineering to apply DevOps if the purpose of doing so hasn’t been clearly articulated in terms that other dependencies (like the developers, operations, sales, marketing, finance, and support) understand. But when you do so, it’s much easier to carry people with you in planning and execution.

DevOps is Organizational, Operational, and Orthogonal. Applying it in isolation only decreases the value it brings to us.

Scaling to the Enterprise

Rob shared an anonymized anecdote from a large company where the Wild West model was adopted:

A small group of pioneers realized “we need to fix this, can’t meet customer needs”. They knew how to do it and got CIO sponsorship. The team got to MVP status with code. Unfortunately, the Wild West model was not immediately adopted beyond that initial release.

“We were trying to push the model onto the team.” Even though everything done up to that point focused on ease-for-enterprise (weekly demos, code was open sourced, process transparency), adoption took time.

Eventually, another team took the ideas and model, shipped their thing to production, then other teams followed. “Now we have a ‘proliferation problem’…people started customizing tools and artifacts.” Teams often stuck with some favorite tools, and in DevOps culture, tailoring is huge.

But not everyone wants to build their own house. For example, code pipelines…yuk. So Planners came in and built a commodity pipeline platform. This requires talent, people who have skill and can scale, understand operational efficiency.

Summary

Here are a few anti-patterns that will reduce friction and increase your flow.

  1. Using enterprise architecture to prevent waste and force adoption.
    Don’t use it as a gate to get to production!
    .
  2. Relying on innovation labs or CoE for pioneers.
    Teams outside your org toss things in that often don’t work inside the org. Be super-public so settlers are likely to steal. Change CoE to “Center of Practice”, inclusive, then everyone can be excellent.
    .
  3. Don’t forget that your org requires a systems thinking approach.
    Create flows not barriers. Each role is filled with excellent people.

 

More reading:

No Root Cause in Emergent Behavior – DevOps Days Boston 2017

At DevOpsDays Boston 2017, Matthew Boeckman presented on how emergent behavior in complex systems requires us to re-think our root cause analysis paradigms. His slides are here. I also had a great time talking meta with Matthew afterhours, but that’s for a later post.

Traditional RCA in Complex Systems

Unfortunately, traditional RCA focuses on what and who. Despite its roots stemming from NASA, in the software world, RCA is misaligned to find only one channel of causality. A fishbone diagram shows this:

This might be okay for simple systems (i.e. 3-tier web/app/data servers). There’s much more to this: networking, hosting, and operating environments. Beyond that, users access in both benign and benevolent ways.

Waterfall encouraged us to minimize complexity by locking down state (i.e. promote a “don’t change” mentality). Waterfall (think 12mo cycles) encourages us to think that change is the developer’s fault. And there were a lot of constrains in the 80s and 90s, most of them are no longer true.

Root cause is fine for static models, but there are bad when it comes to “lots of boxes”, cloud-based dynamic and distributed systems. It’s very hard to trace the source of problems in this new world. Change vectors (a/b testing, reconfigurations, migrations, feature flags) abound, in fact they’re encouraged.

Our systems are far more complex than they were 20 years ago. They involve the whole stack, the whole team, and the whole organization.

Paul’s Take: Occam’s Anti-Razor

A heuristic idea we often employ is Occam’s razor, in general that, the simplest answer is often the right one. Coupled with a confirmation bias, we (humans) often look for a single causal root to the problems we see. Then we build processes that inherit our bias. But what if operational failures occur because of multiple causes, chain reactions that exceed the typical ‘5 whys’ RCA model?

As quickly as the concept of the razor was introduced, Chatton, a contemporary, countered the idea with: “If three things are not enough to verify an affirmative proposition about things, a fourth must be added, and so on.” Similarly, many ascribe a balance of simplicity and complexity in solving problems to the quote “Make things as simple as possible, but no simpler.” by Einstein.

The idea is right fit…right fit of simplicity/complexity to the problem at hand. With complex systems, we can’t always assume that the simple answer is the most useful one in future scenarios.

Our Systems Aren’t Trees, They’re Forests

Emergence is about collective behaviors, systems we connect and integrate over time, and not simply the aggregate of behaviors emitted by individual subcomponents and nodes.

We need to develop, test, deploy, monitor and issue resolve them like the complex semi-organic systems they are, part of an ecosystem of services and fallible subsystems that they are. We can no longer afford to ignore better paradigms for dealing with them.

Enter Systems Thinking. Understanding why things emerge takes more than an ops dashboard and intuition. Sometimes analysis on complex problems requires a multi-variate perspective.

Paul’s advice: System’s thinking is a much broader topic that, if you haven’t actually studied, it would serve you well to listen to The Fifth Discipline by Peter Senge. As context for my presentation in April on IoT testing, this made me realize that systems thinking was a necessary mental tool moving forward.

Systems thinking helps us to identify activities, interactions, and ultimately change vectors contributing to emergent behaviors. Understanding which dials and levers are involved in the problem enables later actions to resolve the issue. This feeling of being at home in the problem space is also similar to “cynefin”, a welch/gaelic term that in Scottish (my heritage!) means:

“a place to live and belong. where the nature of what’s around you feels right and welcoming”

Not at all coincidentally, the Cynefin framework as applied to emergent behavior helps us make quick decisions during and about incident management situations.

Staying Ahead of Emergent Behavior

The fact is that most workforces, small or large, are a revolving door. So is your current system state after multiple releases and infrastructure migrations. There be the monsters. Software is dynamic, and so should be your product discovery process, your learning loops, your incident management model, and so on.

The Cynefin framework gives us this quadrant visual to show that various issues need to be addressed differently:

The fact is, each of these quadrants assume two things:

  1. The issue occurred already, so you need to fix it and learn from it
  2. Information needs to be radiated (sensed) to make “sense” of it

In my after-hours chat with Matthew, we dived into the issue of metrics. Measuring issue tracking goes beyond mean-time-to-resolution (MTTR). Issues that are flagged with *how* they were resolved using Cynefin categories now have an opportunity for improvement.

Paul’s Take: could this be a JIRA custom field? Just thinking out loud.

Tracking the delta on a specific issue (what approach someone thought should be used at first vs. what would have been better after the fact) is a way to measure successfulness and improvement on a spot basis.

Then over time, aggregates can be used to show team and organizational reflexiveness to dynamic, emergent behavior. Though neither of us have customer anecdotes or proof-of-concept clients, I challenge you who are reading this to try it out for a few sprints or whatever intervals you use.

Summary

We need to embrace emergent behavior and learn how to approach incidents better using systems thinking and frameworks like Cynefin. Unlike traditional RCA, we’ll need to step out of our comfort zones, see what works, and learn from our mistakes.

Matthew is a Denver, Colorado native, and has spoken at other conferences like Gluecon (wicked!). If you have questions, ping him (and me) up on Twitter and let’s get a dialog going.

DevOps is Organizational, Operational, and Orthogonal

Some people seem to think that DevOps is a buzzword. It is not. At all.

As part of my research for integrating concepts of risk, quality assurance, and continuous testing into the IEEE 2675 working group (DevOps standard), I am realizing that there is no one single articulation of DevOps that seems to fit all contexts. However, in the spirit of DevOps, I’ll continue to iterate past this issue to explore aspects of the paradigm to provide value to people I meet and conversations we have. Here are three I’ve been pondering:

DevOps Is an Organizational Paradigm

DevOps is about breaking apart established paradigms, structures that worked for the prior generation of problems, management methodologies that are no longer the optimal solution for tomorrows problems today. DevOps is challenging institutional values that don’t actually lead to value. DevOps is helping us to take ownership over our success.
We are learning for the first time, every time, and deliberately discovering what we should know as we build the future.
Collaboration, contribution, sharing, learning, improvement, alignment, focus, value. These are words that describe our homegrown methodology, one whose aim is to meet the pace of innovation better than agility alone. Self management, self organization, self improvement. Shared understanding, shared goals, shared vision. Many experiments, many failures, and many wins.
It is fine if a single team wants to “try out DevOps”, but unless the organization is prepared to support and change to value the positive outcomes of that team, initiatives won’t go very far. In this way, it is a relational paradigm that applies between individuals, between teams, and between organizations too.

If you want to go fast, go alone. If you want to go far, go together.

There’s juice to that statement. Fast and far are relative to what needs to get accomplished (see this link for quote etymology).

DevOps Is an Operational Paradigm

Software tools are a huge part of DevOps conversations now. Why? Because automation and efficiency, sure, but also because its easier to feel confident and efficient in our own ignorance than to face the fact that most of software is about finding the right people to build the right software for other people.
Tools are only a part of our conversation. And more often, it’s tools (I mean outspoken assholes here) that dictate how [little] we understand about DevOps. Just because he buys all the ski equipment and reads a lot about which slopes are best doesn’t make him an expert. Practice matters, and practice means knowing the software landscape.
But tools and automation are only an enabler to the work, an outcome of good decisions; it is the team together which holds the capability to make better decisions tomorrow. And every day has new challenges which yesterday’s solutions won’t overcome, not to mention known challenges that demand experience and perseverance.
If you are automating the shit out of your pipeline, good for you. Is this truly helping you learn how to provide people value, or do you more often find employees arguing about which tool and approach is better? This is an example of how hyper-focusing on tools is counter to the goal of DevOps, to iteratively improve our ability to provide value to (and with) people.

DevOps Is an Orthogonal Paradigm

In this way, DevOps is a mindset that also encompasses those who may not necessarily think it applies to them. It must include everyone, each with our own skills and perspectives. It is not simply about developers and operations. It is about connecting contributors to consumption just as much as it connects consumers to contributions. It is about the whole supply chain, the whole delivery pipeline, and the whole collective of people impacting each other.
For DevOps to be really successful, its execution must be inclusive across boundaries. That includes more than just engineering teams, it involves recruitment, marketing, sales, customer support, HR, PR, and finance. In the IEEE 2675 working group, we are finding that these other groups are a necessary part of the supply chain that DevOps teams depend on. A few examples of the need for an orthogonal approach to DevOps are:
  • How can you go “faster” if your acquisition process takes many months?
  • How can you go “faster” if a supplier doesn’t provide a way to validate that you integrated their product or service correctly?
  • How can testing be continuous if it isn’t automated and therein scalable?
  • How can you expect marketing to crush numbers if you don’t integrate them into your sprints (their work often needs weeks/months of lead time)?
  • If your onboarding process doesn’t train new engineering recruits (dev/test/ops/PM/IT) on your lifecycle, how can you expect them to “go fast”?
  • If it takes days/weeks for customer feedback to reach development cycles, how can you expect to be building the “right thing” tomorrow?
Every one of these questions takes some kind of answer that includes collaboration, which you can’t expect unless you foster positive work culture and encourage people to improve professionally and personally.

DevOps Is Just a Word

DevOps is a word we have now for the next set of ideas for how to sustainable move fast in the right direction. Unlike a manifesto, its goal isn’t to constrain, but to evolve. Hopefully we can come up with a better name in the future, which is highly likely because we iteratively learn. But DevOps is what we have now and so far its doing us a lot of good.

Streaming Tweets to InfluxDB in Node.js

This week, I’ve been exploring the InfluxData tech stack. As a muse, I decided to move some of my social media sharing patterns formal algorithms. I also want to use my blog as a source for keywords, filter out profanity, and apply sentiment analysis to clarify relevant topics in the tweets.

Github repo for this article: github.com/paulsbruce/InfluxTwitterExample

What Is InfluxData?

Simply put, it’s a modern engine for metrics and events. You stream data in, then you have a whole host of options for real-time processing and analysis. Their platform diagram speaks volumes:

From InfluxData.com Telegraph Overview

Based on all open source components, the InfluxData platform has huge advantages over other competitors in terms of extensibility, language support, and its community. They have cloud and enterprise options when you need to scale your processing up too.

For now, I want to run stuff locally, so I went with the free sandbox environment. Again, completely open source stack bits, which is very cool of them as lots of their work ends up as OSS contributions into these bits.

Why Process Twitter Events  in InfluxDB?

Well, frankly, it’s an easy source for real-time data. I don’t have a 24/7 Jenkins instance or pay-for stream of enterprise data flowing in right now, but if I did, I would have started there because they already have a Jenkins data plugin. 🙂

But Twitter, just like every social media platform, is a firehose of semi-currated data. I want to share truly relevant information, not the rest of the garbage. To do this, I can pre-filter based on keywords from my blog and ‘friendlies’ that I’ve trusted enough to re-share in the past.

The point is not to automatically re-share content (which would be botty), but to queue up things in a buffer that would likely be something I would re-tweet. Then I can approve or reject these suggestions, which in turn can be a data stream to improve the machine learning algorithms that I will build as Kapacitor user-defined functions later on.

Streaming Data Into InfluxDB

There’s a huge list of existing, ready-to-go plugins for Telegraph, the collection agent. They’ve pretty much thought of everything, but I’m a hard-knocks kind of guy. I want to play with the InfluxDB APIs, so for my exploration I decided to write a standalone process in Node.js to insert data directly into InfluxDB.

To start, let’s declaring some basic structures in Node to work with InfluxDB:

  • dbname: the InfluxDB database to insert into
  • measure: the measurement (correlates to relational table) to store data with
  • fields: the specific instance data points to collect on every relevant Tweet
  • tags: an extensible list of topic-based keywords to associate with the data

Making Sure That the Database Is Created

Of course, we need to ensure that there’s a place and schema for our Twitter data points to land as they come in. That’s simple:

Saving Pre-screened Tweets as InfluxDB Data Points

Minus the plumbing of the Twitter API, inserting Tweets as data points to InfluxDB is also very easy. We simply need to match our internal data structure to than of the schema above:

Notice that the keywords (tags) can be a simple Javascript array of strings. I’m also optionally inserting the raw data for later analysis, but aggregating some of the most useful information for InfluxQL queries as fields.

The InfluxDB Node.js client respects ES6 Promises, as we can see with the ‘.catch’ handler. Huge help. This allows us to create robust promise chains with easy-to-read syntax. For more on Promises, read this article.

Verifying the Basic Data Stream

To see that the data is properly flowing in to the InfluxData platform, we can use Chronograf in a local sandbox environment and build some simple graphs:

To do this, we use the Graph Editor to write a basic InfluxQL statement:

The simple graph shows a flow of relevant tweets grouped by keyword so we can easily visualize as real-time data comes in.

A Few Ideas and Next Steps

Of the many benefits of processing data on the InfluxData platform, processing in Kapacitor seems to be one of the most interesting areas.

Moving forward I’d like to:

  1. Move Sentiment Analysis with Rosette from Node into Kapacitor
  2. Add Machine Learning into Kapacitor for
    A) clarifying relevance of keywords based on sentiment entity extraction
    B) extract information about the positivity / negativity of the tweet
  3. Catch high-relevance notifications and send to Buffer ‘For Review’ queue
    A) accepts and rejects factor back into machine learning algorithm
    B) follow-up statistics about re-shares further inform ML algorithm
  4. Have Kapacitor alert when:
    A) specific high-priority keywords are used (use ML based on my tweets)
    B) aggregate relevance for a given keyword spikes (hot topic)
    C) a non-tracked keyword/phrase is used in multiple relevant tweets
    (could be a related topic I should track, event hashtag, or something else)

As You Build Your Own, Reach Out!

I’m sure as I continue to implement some of these ideas, I’ll probably need help. Fortunately, Influx has a pretty active and helpful Community site. Everything from large exports, plugin development, and IoT gateways are discussed there. Jack Zampolin, David Simmons, and even CTO Paul Dix are just a few of the regular contributors to the conversation over there.

And as always, I like to help. As you work through your own exploration of InfluxData, feel free to reach out via Twitter or LinkedIn if you have comments, questions, or ideas.

Wrangling Promises in Node.js: 3 Misconceptions Resolved

ES6 (i.e. proper Javascript) isn’t the first time Promises were introduced, but formally supports them now. Believe me, you want to start using Promises in your Node.js scripts, but if you’re new to the pattern, it can be tricky to get your head around. That’s what I hope to help you do in the next 5 minutes.

What Are Javascript/ES6 Promises?

My way of explaining it is that Promises are chaining pattern, a convention that helps decouple your code blocks from your execution pattern. Promises can dramatically improve your approach to asynchronous programming (such as how Node.js 8+ prefers) and simplify your callbacks by helping you express them in a linear fashion.

from Promise (object) on MDN web docs

The really easy thing about them is that a Promise ends in either of two conditions:

  • Become fulfilled by a value
  • Become rejected with an exception

Consider the following code example:

In the above example, the ‘fetchJSON’ function returns a Promise, not the result of executing the request. Expressing things this way allows us to execute the code immediately, or as part of an asynchronous chain, such as:

What’s the alternative? Well…I hesitate to show you (the interweb loves to copy and paste) because we would have to:

  • express every asynchronous action as a callback function (which is bulky)
  • indent/embed blocks in a recursive step pattern
  • chain commands by calling the next function from our executing function

So far, I’ve made a career of learning how to stand up and say ‘I will not build that’. We should do that more often #facebook and you should read this.

The 3 Misconceptions You Want to Immediately Overcome

Amongst many I had while learning to use Promises, these are the top three I and most others often struggle to overcome:

  1. You can’t mix synchronous/callback-oriented code with Promise-based code
  2. It’s okay to ignore catching errors because it’s optional to the Promise chain
  3. There’s no way to join parallel executing paths of asynchronous Promises

I focus on these 3 misconceptions because when they’re not in your head, you can focus on the simplicity of Promise code. When writing, ask yourself: “is the code I just wrote elegant?” If the answer is no, chances are you’re getting hung up on a misconception.

Mixing Synchronous/Callback-Oriented Code with Promises

You CAN inject legacy synchronous code (code that doesn’t emit Promises), but you have to handle Promise-related tie-ins manually. The code example in the last section does exactly that with the ‘request’ function. However you DO have to wrap it in a function/lambda that eventually calls the ‘resolve’ or ‘reject’ handlers.

For instance, in a recent integration to Twitter, their ‘stream’ function does not return a Promise and only provides a callback mechanism, so I wrapped it to resolve or reject:

I decided to ‘Promisify’ this functionality because I wanted to wrap this logic in a Promise-based retry mechanism so that if the initial stream negotiation failed, it would only fail out of the entire script when multiple attempts failed. I opted to pull in the ‘promise-retry’ package from npmjs. Simplified the calling code dramatically:

Can you see how powerful Promises are now? Imagine how coupled the retry code would be with the stream initialization logic. Again, not going to show you what it looked like before for fear of the copy-n-paste police.

Don’t Ignore Error Catching Simply Because the Code Validates!

At first, as I was re-writing blocks of code to Promise-savvy statements, I was getting a lot of these errors:

The problem was that I didn’t have ‘.catch’ statements anywhere in the Promise chain. Node.js was interpreting the code as valid until runtime when the error occurred. Bad. Really bad of me. Glad that Node 8 was warning me.

You don’t have to write ‘.catch’ after every Promise, particularly if you’re returning Promises through functions, so long as the error is handled in at least one place up the Promise chain hierarchy. The Promise model provides you granularity on which errors you want to bubble up.

For instance, in the above code, I DON’T bubble up individual event/tweet errors, but I DO allow stream initialization errors to bubble up to the calling retry code. I can also selectively extend the individual stream event errors to become a bigger problem if the message back from twitter is something like ‘420 Enhance Your Calm’ which essentially means “back the fuck off, asshole”.

You CAN Join/Wait for Parallel Executing Promises

The Promise chain lets us string together as many sequential steps as we want via the ‘.then’ handler. But what about waiting for parallel threads of code?

Using the ‘Promise.all’ function, we can execute separate Promises asynchronously in parallel to each other, but wait in a parent async function by prefixing with the ‘await’ statement. For example:

Within an async function, the above code will wait for both Promises to complete before printing the final statement to the console.

I can tell, now you’re really getting the power of decoupling code expression from code execution. I told you that you’d want to start using Promises. As such, I suggest reading up on the links at the end of the article.

Hidden Lesson: Don’t Bury Your Head in the Sand!

My takeaway from all this learning is that I should have been applying lessons learned in my Java 7 work to other areas like Node.js. Promises isn’t a new idea (i.e. Java Futures, Async C#). If a pattern emerges in one language or framework, it’s very likely to already exist in others. If not, find people and contribute to the solution yourself.

If you run into issues, ping me up on Twitter or LinkedIn, and I’ll do my best to help in a timely manner.

More reading:

Performance Is (Still) a Feature, Not a Test!

Since I presented the following perspective at APIStrat Chicago 2014, I’ve had many opportunities to clarify and deepen it within the context of Agile and DevOps development:

It’s more productive to view system performance as a feature than to view it as a set of tests you run occasionally.

The more teams I work with, the more I see how performance as a critical aspect of their products. But why is performance so important?

‘Fast’ Is a Subconscious User Expectation

Whether you’re building an API, an app, or whatever, its consumers (people, processes) don’t want to wait around. If your software is slow, it becomes a bottleneck to whatever real-world process it facilitates.

Your Facebook feed is a perfect example. If it is even marginally slower to scroll through it today than it was yesterday, if it is glitchy, halty, or jenky in any way, your experience turns from dopamine-inducing self-gratification to epinephrine fueled thoughts of tossing your phone into the nearest body of water. Facebook engineers know this, which is why they build data centers to test and monitor mobile performance on a per-commit basis. For them, this isn’t a luxury; it’s a hard requirement, as it is for all of us whether we choose to address it or not. Performance is everyone’s problem.

Performance is as critical to delighting people as delivering them features they like. This is why session abandonment rates are a key metric on Cyber Monday.

‘Slow’ Compounds Quickly

Performance is a measurement of availability over time, and time always marches forward. Performance is an aggregate of many dependent systems, and even just one slow link can cause an otherwise blazingly fast process to grind to a halt long enough for people to turn around and walk the other way.

Consider a mobile app; performance is everything. The development team slaves over which list component scrolls faster and more smoothly, spends hours getting asynchronous calls and spinners to provide the user critical feedback so that they don’t think the app has crashed. Then a single misbehaving REST call to some external web API suddenly slows by 50% and the whole user experience is untenable.

The performance of a system is only as strong as it’s weakest link. In technical terms, this is about risk. You at least need to know the risk introduced by each component of a system; only then can you chose how to mitigate the risk accordingly. ‘Risk’ is a huge theme in ISO 29119 and the upcoming IEEE 2675 draft I’m working on, and any seasoned architect would know why it matters.

Fitting Performance into Feature Work

Working on ‘performance’ and working on a feature shouldn’t be two separate things. Automotive designers don’t do this when they build car engines and performance is paramount throughout even the assembly process as well. Neither should it be separate in software development.

However, in practice if you’ve never run a load test, tracked power consumption of a subroutine or analyzed aggregate results, it will be different than building stuff for sure. Comfortability and efficiency come with experience. A lack of experience or familiarity doesn’t remove the need for something critical to occur; it accelerates the need to ask how to get it done.

A reliable code pipeline and testing schedule make all the difference here. Many performance issues take time or dramatic conditions to expose, such as battery degradation, load balancing, and memory leaks. In these cases, it isn’t feasible to execute long-running performance tests for every code check-in.

What does this mean for code contributors? Since they are still responsible for meeting performance criteria, it means that they can’t always press the ‘done’ button today. It means we need reliable delivery pipelines to push code through that checks its performance pragmatically. As pressure to deliver value incrementally mounts, developers are taking responsibility for the build and deployment process through technologies like Docker, Jenkins Pipeline, and Puppet.

It also means that we need to adopt a testing schedule that meets the desired development cadence and real-world constrains on time or infrastructure:

  • Run small performance checks on all new work (new screens, endpoints, etc.)
  • Run local baselines and compare before individual contributors check in code
  • Schedule long-running (anything slower than 2mins) performance tests into pipeline stage after build verification in parallel
  • Schedule nightly performance regression checks on all critical risk workflows (i.e. login, checkout, submit claim, etc.)

How Do You Bake Performance Into Development?

While it’s perfectly fine to adopt patterns like ‘spike and stabilize’ on feature development, stabilization is a required payback of the technical debt you incur when your development spikes. To ‘stabilize’ isn’t just to make the code work, it’s to make it work well. This includes performance (not just acceptance) criteria to be considered complete.

A great place to start making measurable performance improvements is to measure performance objectively. Every user story should contain solid performance criteria, just as it should with acceptance criteria. In recent joint research, I found that higher performing development teams include performance criteria on 50% more of their user stories.

In other words, embedding tangible performance expectations in your user stories bakes performance in to the resulting system.

There are a lot of sub-topics under the umbrella term “performance”. When we get down to brass tacks, measuring performance characteristics often boils down to three aspects: throughput, reliability, and scalability. I’m a huge fan of load testing because it helps to verify all three measurable aspects of performance.

Throughput: from a good load test, you can objectively track throughput metrics like transactions/sec, time-to-first-byte (and last byte), and distribution of resource usage (i.e. are all CPUs being used efficiently). These give you a raw and necessarily granular level of detail that can be monitored and visualized in stand-ups and deep-dives equally.

Reliability: load tests also exercise your code far more than you can independently. It takes exercise to expose if a process is unreliable; concurrency in a load test is like exercise on steroids. Load tests can act as your robot army, especially when infrastructure or configuration changes push you into unknown risk territory.

Scalability: often, scalability mechanisms like load balancing, dynamic provisioning, and network shaping throw unexpected curveballs into your user’s experience. Unless you are practicing a near-religious level of control over deployment of code, infrastructure, and configuration changes into production, you run the risk of affecting real users (i.e. your paycheck). Load tests are a great way to see what happens ahead of time.

 

Short, Iterative Load Testing Fits Development Cycles

I am currently working with a client to load test their APIs, to simulate mobile client bursts of traffic that represent real-world scenarios. After a few rounds of testing, we’ve resolve many obvious issues, such as:

  • Overly verbose logs that write to SQL and/or disk
  • Parameter formats that cause server-side parsing errors
  • Throughput restrictions against other 3rd-party APIs (Google, Apple)
  • Static data that doesn’t exercise the system sufficiently
  • Large images stored as SQL blobs with no caching

We’ve been able to work through most of these issues quickly in test/fail/fix/re-test cycles, where we conduct short all-hands sessions with a developer, test engineer, and myself. After a quick review of significant changes since the last session (i.e. code, test, infrastructure, configuration), we use BlazeMeter to kick of a new API load test written in jMeter and monitor the server in real-time. We’ve been able to rapidly resolve a few anticipated, backlogged issues as well as learn about new problems that are likely to arise at future usage tiers.

The key here is to ‘anticipate iterative re-testing‘. Again I say: “performance is a feature, not a test”. It WILL require re-design and re-shaping as the code changes and system behaviors are better understood. It’s not a one-time thing to verify how a dynamic system behaves given a particular usage pattern.

The outcome from a business perspective of this load testing is that this new system is perceived to be far less of a risky venture, and more the innovation investment needed to improve sales and the future of their digital strategy.

Performance really does matter to everyone. That’s why I’m available to chat with you about it any time. Ping me on Twitter and we’ll take it from there.