Put That in Your Pipeline and Smoke Test It!

I rarely bother to open my mouth as a speaker and step into a spotlight anymore. I’ve been mostly focused on observing, listening, and organizing tech communities in my local Boston area for the past few years. I just find that others’

A friend of mine asked if I would present at the local Ministry of Testing meetup, and since she did me a huge last-minute favor last month, I was more than happy to oblige.

“Testing Is Always Interesting Enough to Blog About”

Permissioned quote from the Boston DevOps community, Dec 12th 2019. James Goin, DevOps Engineer

The state and craft of quality (not to mention performance) engineering has changed dramatically in the past 5 years since I purposely committed to it. After wasting most of my early tech career as a developer not writing testable software, the latter part of my career as of late has been what some might consider penance to that effect.

I now work in the reliability engineering space. More specifically, I’m a Director of Customer Engineering at a company focusing on the F500. As a performance nerd, everything inherits a statistical perspective, not excluding how I view people, process, and technology. In this demographic, “maturity” models are a complex curve across dozens of teams and a history of IT decisions, not something you can pull out of an Agilista’s sardine can or teach like the CMMI once thought it could.

A Presentation as Aperitif to Hive Minding

This presentation is a distillation of those experiences to date as research and mostly inspired to learn what other practitioners like me think when faced with challenges in translating the importance of holistic thinking around software quality to business leaders.

Slides: bit.ly/put-that-in-your-pipeline-2019

Like I say at the beginning of this presentation, the goal is to incite collaboration about concepts, sharing the puzzle pieces I am actively working to clarify so that the whole group can get involved with each other in a constructive manner.

Hive Minding on What Can/Must/Shouldn’t Be Tested

The phrase ‘Hive Minding‘ is (to my knowledge and Google results) a turn-of-phrase invention of my own. It’s one incremental iteration past my work and research in open spaces, emphasizing the notions of:

  • Collective, aggregated collaboration
  • Striking a balance between personal and real-time thinking
  • Mindful, structured interactions to optimize outcomes

At this meetup, I beta launched the 1-2-4-All method from Liberating Structures that seemed to work so well when I was in France at a product strategy session last month. It so well balanced the opposite divergent and convergent modes of thinking, as discussed in The Creative Thinker’s Toolkit, that I was compelled again to continue my active research into improving group facilitation.

Even after a few people had to leave the meetup early, there were still six groups of four. In France there were eight contributors, so I felt that this time I had a manageable but still scaled (4x) experiment of how hive minding works with larger groups.

My personal key learnings

Before I share some of the community feedbacks (below), I should mention what I as the organizer saw during and as outcomes after the meetup:

  • I need to use a bell or chime sound on my phone rather than having to interrupt people once the timers elapse for each of the 1-2-4 sessions; I hate stopping good conversation just because there’s a pre-agreed-to meeting structure.
  • We were able to expose non-quality-engineer people (like SysOps and managers) to concepts new to them, such as negative testing and service virtualization; hopefully next time they’re hiring a QA manager, they’ll have new things to chat about
  • Many people confirmed some of the hypotheses in the presentation with real-world examples; you can’t test all the things, sometimes you can’t even test the thing because of non-technical limitations such as unavailability of systems, budget, or failure of management to understand the impact on organizational risk
  • I was able to give shout-outs to great work I’ve run across in my journeys, such as Resilient Coders of Boston and technical projects like Mockiato and OpenTelemetry
  • Quite a few people hung out afterward to express appreciation and interest in the sushi menu of ideas in the presentation. They are why I work so hard on my research areas.
  • I have to stop saying “you guys”. It slipped out twice and I was internally embarrassed that this is still a latent habit. At least one-third of the attendees were women in technology and as important as being an accomplice to improving underrepresented communities (including non-binary individuals), my words need work.

A Few Community Feedbacks, Anonymized

Consolidated outcomes of “Hive Minding” on the topics “What must be tested?” and “What can’t we test?”
  • What must we test?
    • Regressions, integrations, negative testing
    • Deliver what you promised
    • Requirements & customer use cases
    • Underlying dependency changes
    • Access to our systems
    • Monitoring mechanisms
    • Pipelines
    • Things that lots of devs use (security libraries)
    • Things with lots of dependencies
  • What can’t we test?
    • Processes that never finish (non-deterministic, infinite streams)
    • Brute-force enterprise cracking
    • Production systems
    • Production data (privacy concerns)
    • “All” versions of something, some equipment, types of data
    • Exhaustive testing
    • Randomness
    • High-fidelity combinations where dimensions exponentially multiply cases
    • Full system tests (takes too long for CI/CD)

A few thoughts from folks in Slack (scrubbed for privacy)…

Anonymized community member:

Writing up my personal answers to @paulsbruce’s hivemind questions yesterday evening: What can/should you test?

  • well specified properties of your system, of the form if A then B. Only test those when your gut tells you they are complex enough to warrant a test, or as a preliminary step to fixing a bug, and making sure it won’t get hit again (see my answer to the next question).
  • your monitoring and alerting pipeline. You can never test up front for everything, things will break. The least you can do is test for end failure, and polish your observability to make debugging/fixing easier.

What can’t/shouldn’t you test?

  • my answer here is a bit controversial, and a bit tongue in cheek (I’m the person writing more than 80% of the tests at my current job). You should test the least amount possible. In software, writing tests is very expensive. Tests add code, sometimes very complex code that is hard to read and hard to test in itself. This means it will quickly rot, or worse, it will prevent/keep people from modifying the software architecture or make bold moves because tests will break/become obsolete. For example, assume you tested every single detail of your current DB schema and DB behaviour. If changing the DB schema or moving to a new storage backend is “the right move” from a product standpoint, all your tests become obsolete.
  • tests will often add a lot of complexity to your codebase, only for the purpose of testing. You will have to add mocking at every level. You will have to set up CICD jobs. The cost of this depends on what kind of software you write, the problem is well solved for webby/microservicy/cloudy things, much less so for custom software / desktop software / web frontends / software with complex concurrency. For example, in my current job (highly concurrent embedded firmware, everything is mocked: every state machine, every hardware component, every ocmmunication bus is mocked so that individual state machines can be tested against. This means that if you add a new hardware sensor, you end up writing over 200 lines of boilerplate just to satisfy the mocking requirements. THis can be alleviated with scaffolding tools, some clever programming language features, but there is no denying the added complexity)

To add to this, I think this is especially a problem for junior developers / developers who don’t have enough experience with large scale codebases. They are either starry-eyed about TDD and “best practices” and “functional programming will save the world”, and so don’t exercise the right judgment on where to test and where not to test. So you end up with huge test suites that basically test that calling database.get_customer('john smith') == customer('john smith') which is pretty useless. much more useful would be logging that result.name != requested_name in the function get_customer

the first is going to be run in a mocked environment either on the dev machine, on the builder, or in a staging environment, and might not catch a race condition between writers and readers that happens under load every blue moon. the logging will, and you can alert on it. furthermore, if the bug is caught as a user bug “i tried to update the customer’s name, but i got the wrong result”, a developer can get the trace, and immediately figure out which function failed

Then someone else chimed in:

It sounds like you’re pitting your anecdotal experience against the entire history of the industry and all the data showing that bugs are cheaper and faster to fix when found “to the left” i.e. before production. The idea that a developer can get a trace and immediately figure out which function failed is a starry-eyed fantasy when it comes to most software and systems in production in the world today.

The original contributor then continues with:

yeah, this is personal experience, and we don’t just yeet stuff into production. as far data-driven software engineering, I find mostly scientific studies to be of dubious value, meaning we’re all back to personal experience. as for trace driven debugging, it’s working quite well at my workplace, I can go much more into details about how these things work (I had a webinar with qt up online but I think they took it down)

as said, it’s a bit tongue in cheek, but if there’s no strong incentive to test something, I would say, don’t. the one thing i do is keep tabs on which bugs we did fix later on, which parts of the sourcecode were affected, who fixed them, and draw conclusions from that

Sailboat Retrospective

Using the concept of a sailboat retrospective, a few things that I’d like to improve are below, namely:

Things that propel us:

  • Many people said they really liked the collaborative nature of hive minding and would love to do this again because it got people to share learnings and ideas
  • Reading the crowd in real-time, I could see that people were connecting with the ideas and message; there were no “bad actors” or trolls in the crowd
  • Space, food, invites and social media logistics were handled well (not on me)

Things that slowed us:

  • My presentation was 50+ mins, way too long for a meetup IMO.

    To improve this, I need to:
    • Break my content and narratives up to smaller chunks, ones that I can actually stick to a 20min timeframe on. If people want to hear more, I can chain on topics.
    • Recruit a timekeeper from the audience, someone who provides accountability
    • Don’t get into minutia and examples that bulk out my message, unless asked
  • Audio/video recording and last-minute mic difficulties kind of throws speakers off

    To fix this? Maybe bring my own recording and A/V gear next time.
  • Having to verbally interrupt people at the agree upon time-breaks in 1-2-4-All seems counter to collaborative spirit.

    To improve this, possibly use a Pavlovian sound on my phone (ding, chime, etc.)

Things to watch out for:

  • I used the all-to-common gender-binary phrase “you guys” twice. Imagine rooms where it would somehow be fine to say that, but saying “hey ladies” to a mixed crowd would be considered pejorative to many cisgender men. Everything can be improved and this is certainly one thing I plan to be very conscious of.
  • Though it’s important to have people write things down themselves, not everyone’s handwriting can be read back by others after, and certainly not without high-fidelity photos of the post-its afterward.

    To improve this, maybe stand with the final group representatives and if needed re-write the key concepts they verbalize to the “all” group on the whiteboard next to their post-it.

More reading:

Afterthoughts on Hive Minding

It’s a powerful thing to understand how your brain works, what motivates you, and what you don’t care about. There are so many things that can distract, but at the end of the day, there are very few things measurable immediately worth having done. Shipping myself to Europe until next week, for example, has already had measurable personal and professional impact.

One thing I experienced this week after injecting a little disruption to conformity yesterday was what I now call “hive minding”, or otherwise assisting independent contributors in rowing in the same direction. The classical stereotype of “herding cats” infers that actors only care about themselves, but unlike cats, a bee colony shares an intuitive, survival imperative to build and improve the structure that ensures their survival. Each bee might not consciously think about “lasting value”, but it’s built into their nature.

Be Kind, Rewind

I’m always restless, every success followed by a new challenge, and I wouldn’t have it any other way, but it does lead to a growing consideration about plateauing. Plateauing is a million times worse than burning out. There are plenty of people and companies that have burned out already but are still doing something “functional” in a dysfunctional industry, and if the decision is to flip that investment, it’s an easy one to make. Fire them, trade or cut funding; but what do you do with a resource when they plateau?

I think you’ll know you’ve plateaued when you find yourself without restlessness. If necessity is the mother of invention, restlessness is the chambermaid of clean mind. Al least for me, like a hungry tiger in a cave, I must feed my restlessness with purposeful and aligned professional work. The only problematic moment with me…I like to get ahead of the problem of someone telling me what to do by figuring out what we (everyone, me and them) should be doing before someone dictates it with less context.

The sweet spot of this motion is to do this together, not in isolation and not dictatorially, but coalescing the importance of arriving at the “right” goals and in alignment at the same time. The only surprises when you’re riding the wave together is what comes next, and when you engineer this into the process, surprises are mostly good.

It took a while to arrive at this position. I had to roll up sleeves, work with many different teams in multiple organizations, listen to those whose shoes I don’t have the time or aptitude to fill, figure out how to synthesize their inputs into cogent and agreeable outcomes, and do so with a level of continuity that distinguishes this approach from traditional forms of management and group facilitation.

Don’t Try This On Your Own

The cost of adaptability is very high. If I didn’t have an equally dedicated partner to run the homefront, none of this would work. She’s sought out the same kind of commitment and focus on raising the kids as I do with what goes into pays the bills. There are very few character traits and creature comforts we share, but in our obsession over the things that make the absolute best come out of what we have, she more than completes the situation.

In this lifestyle, I have to determine day by day and week by week what net-new motions/motivations I need to pick up and which I need to put down, either temporarily or permanently. This can feel like thrash to some, but for me, every day is a chance to re-assess based on all the days before now; I can either take that opportunity or not, but it is there despite whether I do or not take it. If my decisions are only made in big batches, similar to code/product releases, I inherit the complexities and inefficiencies of “big measurement”…namely, granularity in iterative improvement.

Feedback Loops, Everywhere

As I explore the dynamics of continuous feedback loops beyond software and into human systems, a model of frequency in feedback and software delivery not as separate mechanisms, but as symbiotic, emerges. The more frequently you release, the more chances there are for feedback. The more feedback you can synthesize into value, the more frequently you want to release. One does not ‘predict’ the other; their rate bounds each other, like a non-binary statistical model.

What I mean is that a slow-release cycle predicts slow feedback and slow feedback predicts low value from releasing frequently; a fast feedback mechanism addicts people to faster release cycles. They share the relationship and depending on how extreme the dynamics feeding into one side of the relationship, the other one suffers. Maybe at some point, it’s a lost cause.

An example from the performance and reliability wheelhouse is low/slow performance observability. When you can’t see what’s causing a severe production incident, the live investigation and post-mortem activity is slow and takes time away from engineering a more reliable solution. Firefighting takes dev, SRE, ops, and product management time…it’s just a fact. Teams that understand the underlying relationship and synthesize that back into their work tend to use SEV1 incidents as teachable moments to improve visibility on underlying systems AND behavioral predictors (critical system queue lengths, what levels of capacity use constitute “before critical”, architectural bottlenecks that inform priorities on reducing “tech debt”, etc.).

The point is that feedback loops take time and iterative learning to properly inject in a way that has a positive, measurable impact on product delivery and team dynamics.

Going from Feedback Loops to Iterations…Together

All effect feedback loops have one thing in common: they measure achievement levels framed by a shared goal. So you really have to work to uncovered shared goals in a team. If they suit you and/or if you can accept the awesome responsibility to challenge and change them over time, it’s a wild ride of learning and transforming. If not, find another team, company, or tribe. Everyone needs a mountain they can traverse and shouldn’t put themselves up to a trail that will destroy them. This is why occasionally stepping back, collaborating, and reporting out what works and what doesn’t is so important. Re-enter the concept of “team meetings”.

Increasingly, most engineers I talk to abhor the notion of more meetings, usually because they’ve experienced their fair share of meetings that don’t respect their time or where their inputs have not been respectfully synthesized in a way they can see. So what, meetings are a bad thing?

Well, no, not if your meetings are very well run. This is not one person’s job, though scrumbags and mid-level management with confirmation bias abound, and especially so because they don’t have built-in NPS (net promoter score). A solution I’ve seen to the anti-pattern of ineffective meetings is to establish common knowledge of what/how/why an “effective” meeting looks like and expect these behaviors from everyone in on the team and in the org.

How to Encourage Effective Collaboration in Meetings

Learn to listen, synthesize, and articulate back in real-time. Too much time goes by, delay and context evaporate like winter breath. Capture as much of this context as you can while respecting the flow of the conversation. This will help you and others with remembering and respecting the “why”, and will allow people to see what was missing (perspectives, thinking, constructs), afterward. Examples of capture include meeting minutes, pictures of post-its, non-private notes from everyone, and even recordings.

But in just about every team and organization there’s a rampant misconception that ALL meetings must produce outcomes that look like decisions or action items. These are very beneficial, but I’ve seen people become anti-productive when treating themselves and others as slaves to these outcomes. Taking decisions too early drives convergent attitudes that are often uninformed, under-aligned, and often destructive.

Some of the most effective meetings I’ve had share the following patterns:

  • know why you’re meeting, provide context before, and set realistic expectations
  • have the “right” people in the room
    • who benefit from the anticipated outcomes and therefore are invested in them
    • who bring absolutely critical perspective, where otherwise invalidates outcomes or cause significant toil to refactor back in afterward; not to few
    • who contribute to functional outcome (as opposed to those who are known to bring dysfunction, don’t respect the time of others, argue over align); too many
  • agree on what positive and negative outcomes look like before starting in
  • use communication constructs to keep people on track with producing outcomes
  • have someone to ensure (not necessarily do all the) capture; note and picture taker
  • outcomes are categorized as:
    • clear, aligned decisions (what will happen, what worked, what didn’t, what next)
    • concrete concerns and missing inputs that represent blockers to the above
    • themes and sense of directional changes (i.e. we think we need to change X)
    • all info captured and provided as additional context for others

Trust AND Verify

One thing I keep finding useful is to challenge the “but” in “trust, but verify”. In English, the word “but” carries a negating connotation. It invalidates all that was said before it. “Your input was super important, BUT it’s hard to understand how it’s useful”…basically means “Your input was not important because it was not usable.”

My alternative is to “trust and verify”, but with a twist. If you’re doing it right, trust is easy if you preemptively provided an easy means to verify it. If you provide evidence along with your opinion, reasonable people are likely to trust your judgment. For me, rolling up the sleeves is a very important tool in my toolbelt to produce evidence for or against a particular position. I know there are other methods, both legitimate and nefarious, but I find that practical experience is far more defensible than constructing decisions based on shaky foundations.

All this said, even if you’re delivering self-evident verification with your work, people relationships take time and certainly take more than one or two demonstrative examples of trustability to attain a momentum of their own. Trust takes time, is all.

Takeaways and Action Items from This Week

Democratic decision processes are “thrashy”. Laws and sausages: no one wants to know how they’re made. In small teams going fast, we don’t have the luxury of being ignorant of outcomes and the context behind them. For some people, “democracy” feels better than dictatorial decisions being handed down without context; but for those who still find a way to complain about the outcomes, they need to ask themselves, “did I really care enough to engage in a functional and useful way, and did I even bother to educate myself on the context behind the decision I don’t like?”

Just like missing a critical perspective in a software team, in a global organization, when one region or office dominates an area of business (U.S. on sales, EU on security, for instance), this will inevitably bias outcomes and decisions affecting everyone. As the individual that I report to puts it, “scalability matters to every idea, not just when we’re ready to deploy that idea”. Make sure you have the right “everyone” in the room, depending on the context of your work and organizational culture.

Someone I once met and deeply respect once told me “it’s not enough to be an ally, you need to be an accomplice“. In context, she was referring to improving the epic dysfunction of modern technology culture by purposefully including underrepresented persons. Even if we make a 10% improvement to women’s salaries, hire more African-American engineers, create a safer place for LGBTQ, I still agree with the premise that doing these things isn’t good enough. Put it another way, receiving critical medical treatment for a gushing head wound isn’t an “over-compensation”, it’s a measured response to the situation. The technology gushing head wound, in this case, is an almost complete denial from WGLM (white guys like me) that there is a problem, that doing nothing continuously enables the causes of the problem, that leadership on this doesn’t necessarily look or think like us, and that this isn’t necessarily needed now.

Bringing it back to the wheelhouse of this article, true improvement culture doesn’t just take saying “sure, let me wave at you as an ally while you go improve the team”. It takes being an accomplice (think a getaway driver), we should ALL be complicit in decisions and improvement. Put some skin in the game, figure out how something truly worth improvement and your effort maps to your current WiP (work in progress) limits, and you may find that you need to put something less worth your time down before you can effectively contribute to improvement work. Surrounding yourself with folks who get this too will also increase the chances that you’ll all succeed. This is not an over-compensation, it is what everyone needs to do now to thrive, not just survive.

On Lack of Transparency in SaaS Providers

As many organizations transition their technical systems to SaaS offerings they don’t own or operate, I find it surprising that when a company acquires a 3rd-party offering deployed on said offerings, they are often told to “just trust us” about security, performance, and scalability. I’m a performance nerd, that and DevOps mindset are my most active areas of work and research, so this perspective is scoped to that topic.

In my experience amongst large organizations and DevOps teams, the “hope is not a strategy” principle seems to be missing in the transition from internal team speak and external service agreement. Inside a 3rd-party vendor, say Salesforce Commerce Cloud, I’m sure they very skilled at what they do (I’m not guessing here, I know folks who work in technical teams in Burlington MA). But even espousing a trust-but-verify culture internally, when your statement to customers who are concerned about performance at scale of your offering is “just trust us”, seems maligned.

TL;DR: SaaS Providers, Improve Your Transparency

If you provide a shared tenancy service that’s based on cloud and I can’t acquire service-level performance, security audits, and error logs that are isolated to my account, it’s a transparent view into how little your internal processes (if they even exist around these concerns) actually improve service for me, your customer.

If you do provide these metrics to internal [product] teams, ask “why do we do that in the first place?” Consider that the same answers you come up with almost always equally apply to those external consumers who pay for your services that are also technologists, have revenue on the line, and care about delivering value successfully with minimal issues across a continuous delivery model.

If you don’t do a good job internally of continuously measuring and synthesizing the importance of performance, security, and error/issue data, please for the love of whatever get on that right now. It helps you, the teams you serve, and ultimately customers to have products and services that are accurate, verifiable, and reliable.

How Do You Move from “Trust Us” to Tangible Outcomes?

Like any good engineer, when a problem is big or ambiguous, start breaking that monolith up. If someone says “trust us”, be specific about what you’re looking to achieve and what you need to do that, which puts the ownness on them to map what they have to your terms. Sometimes this is easy, other times it’s not. Both are useful areas of useful information, what you do know and what you don’t. Then you can double-click into how to unpack unknowns (unknowables) in the new landscape.

For SaaS performance, at a high level we look for:

  • Uptime and availability reports (general) and the frequency of publication
  • Data on latency, the more granular to service or resource the better
  • Throughput (typically in Mbps etc.) for the domains hosted or serviced
  • Error # and/or rate, and if error detail is also provided in the form of logs
  • Queueing or otherwise service ingress congestion
  • Some gauge or measure of usage vs. [account] limits and capacity
  • Failover and balancing events (such as circuit breaks or load balancing changes)

You may be hard-pressed to expect some of these pieces of telemetry provided in real-time from your SaaS provider, but they serve as concrete talking points of what typical performance engineering practices need to verify about systems under load.

Real-world Example: Coaching a National Retailer

A message I sent today to a customer, names omitted:

[Dir of Performance Operations],

As I’m on a call with the IEEE on supplier/acquirer semantics in the context of DevOps, it occurs to me that the key element missing in [Retailer’s] transition from legacy web solution last year to that which is now deployed via Commerce Cloud, the lack of transparency (or simply not asking on our part) over service underpinnings is a significant risk, both in terms of system readiness and unanticipated costs. My work with the standard brought around two ideas in terms of what [Retailer] should expect from Salesforce:

A) what their process is for verifying the readiness of the services and service-level rendered to [Retailer], and

B) demonstrated evidence of what occurs (service levels and failover mechanisms) under significant pressure to their services

In the past, [Retailer’s] performance engineering practice had the agency to both put pressure on your site/services AND importantly how to measure the impact on your infrastructure. The latter is missing in their service offering, which means that if you run tests and the system results don’t meet your satisfaction, the dialog to resolve them with Salesforce lacks minimum-viable technical discussion points on what is specifically going wrong and how to fix it. This will mean sluggish MTTR and potentially synthesizing the expectation of longer feedback cycles into project/test planning.

Because of shared tenancy, you can’t expect them to hand over server logs, service-level measurements, or real-time entry points to their own internal monitoring solutions. Similarly, no engineering-competent service provider can reasonably expect for consumers to “just trust” that an aggregate product-plus-configuration-plus-customizations solution will perform at large scale, particularly when mission-critical verification was in place before fork-lifting your digital front door to Salesforce. We [vendor] see this need for independent verification of COTS all the time across many industries, despite a lack of proof of failure in the past.

My recommendation is that, as a goal of what you started by creating a ticket with them on this topic, we should progressively seek to receive thorough information on points A and B above from a product-level authority (i.e. product team). If that’s via a support or account rep, that’s fine, but it should be adequate for you to be able to ask more informed questions about architectural service limits, balancing, and failover.

//Paul

What Do You Think?

I’m always seeking other perspectives that my own. If you have a story to tell, question, or otherwise augmentation to this post, please do leave a comment. You can also reach out to me on Twitter, LinkedIn, or email [“me” -at– “paulsbruce” –dot- “io”]. My typical SLA for latency is less than 48hrs unless requests are malformed or malicious.

Performance Engineer vs. Tester

A performance engineer’s job is to get things to work really, really well.

Some might say that the difference between being a performance tester and a performance engineer boils down to scope. The scope of a tester is testing, to construct, execute and verify test results. An engineer seeks to understand, validate, and improve the operational context of a system.

Sure, let’s go with that for now, but really the difference is an appetite for curiosity. Some people treat monoliths as something to fear or control. Others explore them, learn how to move beyond them, and how to bring others along in the journey.

Testing Is Just a Necessary Tactic of an Engineer

Imagine being an advisor to a professional musician, their performance engineer. What would that involve? You wouldn’t just administer tests, you would carefully coach, craft instruction, listen and observe, seek counsel from other musicians and advisors, ultimately to provide the best possible path forward to your client. You would need to know their domain, their processes, their talents and weaknesses, their struggle.

With software teams and complex distributed systems, a lot can go wrong very quickly. Everyone tends to assume their best intentions manifest into their code, that what they build is today’s best. Then time goes by and everything more than 6 months old is already brownfield. What if the design of a thing is already so riddled with false assumptions and unknowns that everything is brownfield before it even begins.

Pretend with me for a moment, that if you were to embody the software you write, become your code, and look at your operational lifecycle as if it was your binary career, your future would be a bleak landscape of retirement options. Your code has a half-life.

Everything Is Flawed from the Moment of Inception

Most software is like this…not complete shit but more like well-intentioned gift baskets full of fruits, candies, pretty things, easter eggs, and bunny droppings. Spoils the whole fucking lot when you find them in there. A session management microservice that only starts to lose sessions once a few hundred people are active. An obese 3mb CSS file accidentally included in the final deployment. A reindexing process that tanks your order fulfillment process to 45 seconds, giving customers just enough time to rethink.

Performance engineer doesn’t simply polish turds. We help people not to build broken systems to begin with. In planning meetings, we coach people to ask critical performance questions by asking those questions in a way that appeals to their ego and curiosity at a time that’s cost effective to do so. We write in BIG BOLD RED SHARPIE in a corner of the sprint board what the percentage slow-down to the login process the nightly build as now caused. We develop an easy way to assess the performance of changes and new code, so that task templates in JIRA can include the “performance checkbox” in a meaningful way with simple steps on a wiki page.

Engineers Ask Questions Because Curiosity Is Their Skill

We ask how a young SRE’s good intentions of wrapping u statistical R models from a data sciences product team in Docker containers to speed deployment to production will affect resources, how they intend on measuring the change impact so that the CFO isn’t going to be knocking down their door the next day.

We ask why the architects didn’t impose requirements on their GraphQL queries to deliver only the fields necessary within JSON responses to mobile app clients, so that developers aren’t even allowed to reinvent the ‘SELECT * FROM’ mistake so rampant in legacy relational and OLAP systems.

We ask what the appropriate limits should be to auto-scaling and load balancing strategies and when we’d like to be alerted that our instance limits and contractual bandwidth limits are approaching cutoff levels. We provide cross-domain expertise from Ops, Dev, and Test to continuously integrate the evidence of false assumptions back into the earliest cycle possible. There should be processes in place to expose and capture things which can’t always be known at the time of planning.

Testers ask questions (or should) before they start testing. Entry/exit criteria, requirements gathering, test data, branch coverage expectations, results format, sure. Testing is important but is only a tactic.

Engineers Improve Process, Systems, and Teams

In contrast, engineering has the curiosity and the expertise to get ahead of testing so that when it comes time, the only surprises are the ones that are actually surprising, those problems that no one could have anticipated, and to advise on how to solve them based on evidence and team feedbacks collected throughout planning, implementation, and operation cycles.

An engineer’s greatest hope is to make things work really, really well. That hope extends beyond the software, the hardware, and the environment. It includes the teams, the processes, the business risks, and the end-user expectations.

Holiday IoT and the Performance Imperative

A few words to manufacturers and vendors of tech toys: to really be ready for the holiday, if your product requires software updates in order to work or is in any way internet connected, make sure your site stays up. Otherwise, you just shipped coal.

  • Provide more than one distribution point for downloadable updates/binaries
  • Rely on CDNs for static assets (like installers and documentation)
  • If your update process must rely on live services, make sure they’re scalable
    • Load test subcomponents/microservices AND the end-to-end process
  • Be prepared for damage control by:
    • Monitoring site uptime and availability to know when things are broken
    • Proactively establish a communication channel with customers during issues
    • Properly staff IT and support for during AND post-season issues
  • Make sure the cost of downtime is factored into your next sales cycle

Santa Brought Us a Brick?

Let me start by admitting how 1st-world this example is. Robots as play-things are still not exactly ‘so easy, a child could do it’, and Roomba’s have been around for almost two decades, but we still have yet to see a really down-to-earth home robotics project that really works for children under 10. I don’t just mean toys that are not pre-assembled, even the right-out-of-the-box kind often require firmware updates or online services to really work as expected.

Case in point, the Meccano MAX. Though it only took about 3hrs total to put together, this morning when we finally turned it on and went to connect for a firmware update, the vendor’s website was down…hard. The instructions said, before anything else, update the ‘MeccaMind’ and voice commands weren’t working without, so, blocker.

As an ops nerd, I slapped an uptime monitor on it to know when (if ever) it was back up:

That didn’t stop the whole multi-day experience from deflating to a dud. We all worked on this thing together and then before it can do anything, we are stuck guessing about when we can actually enjoy it. Don’t blame Santa, blame the geniuses at Meccano.

Performance, of your product, of your service, of your site, is imperative to delivering what you sold people. Availability, uptime, scalability, and reliability matter by default now. Everyone has downtime, but 4hr recovery time on your corporate domain isn’t just irresponsible and costly, it’s plain embarrassing…and transparent.

Why Is This Even Important?

My 7yr old is crazy into coding right now. Granted, we use a visual Code Block Editor mostly for lights and tones, but it’s a great way to introduce concepts like flow control and formal logic. As soon as she saw an example I built that used function blocks to encapsulate and reuse logic, she instantly understood and started refactoring her programs.

But finding the right project for varying stages of aptitude, appetite, and enjoyment is a real challenge. No thanks to marketing, but also unanticipated road-blocks like service and subscription dependencies are hard for consumers to factor in when purchasing. Even when you do find a right-fit project, if bone-headed problems like website downtime occur, it can become a negative experience for the child (or student).

It’s important for STEM product manufacturers and software vendors to really think about the impact of what they’re selling, how they’re delivering it, and how to support people who paid them money for something to accomplish a goal. If you don’t have the optimal consumer’s experience in mind, it will eventually cost you.

Need the Robot Software Updater?

I can’t archive everything on their site, and it’s their job to provide reliable content distribution, but in case you find yourself stuck like I was, here are links to at least the firmware updater tool:

Also, never ask a consumer if they want to choose (null).

And if the updater gives you flashbacks to DirectX drivers from 1997, don’t worry. It only looked like it bricked my robot for about 4mins before providing UI feedback:


Once you do get back to the modern era, a 98.6MB mobile app to control it shouldn’t be too hard on your data plan. They also need to know your GPS location, phone contacts, and file storage for some reason.

AllDayDevOps 2018: Progressive Testing to Meet the Performance Imperative

Mostly as an appendix for references and readings, but whenever I can I like to have a self-hosted post to link everything back to about a particular presentation.

My slides for the presy: https://docs.google.com/presentation/d/1OpniWRDgdbXSTqSs8g4ofXwRpE78RPjLmTkJX03o0Gg/edit?usp=sharing

Video stream: http://play.vidyard.com/hjBQebJBQCnnWjWKnSqr6C

A few thoughts from my journal today (spelling and grammar checks off):

Will update as the conversation unfolds.

Performance Is (Still) a Feature, Not a Test!

Since I presented the following perspective at APIStrat Chicago 2014, I’ve had many opportunities to clarify and deepen it within the context of Agile and DevOps development:

It’s more productive to view system performance as a feature than to view it as a set of tests you run occasionally.

The more teams I work with, the more I see how performance as a critical aspect of their products. But why is performance so important?

‘Fast’ Is a Subconscious User Expectation

Whether you’re building an API, an app, or whatever, its consumers (people, processes) don’t want to wait around. If your software is slow, it becomes a bottleneck to whatever real-world process it facilitates.

Your Facebook feed is a perfect example. If it is even marginally slower to scroll through it today than it was yesterday, if it is glitchy, halty, or jenky in any way, your experience turns from dopamine-inducing self-gratification to epinephrine fueled thoughts of tossing your phone into the nearest body of water. Facebook engineers know this, which is why they build data centers to test and monitor mobile performance on a per-commit basis. For them, this isn’t a luxury; it’s a hard requirement, as it is for all of us whether we choose to address it or not. Performance is everyone’s problem.

Performance is as critical to delighting people as delivering them features they like. This is why session abandonment rates are a key metric on Cyber Monday.

‘Slow’ Compounds Quickly

Performance is a measurement of availability over time, and time always marches forward. Performance is an aggregate of many dependent systems, and even just one slow link can cause an otherwise blazingly fast process to grind to a halt long enough for people to turn around and walk the other way.

Consider a mobile app; performance is everything. The development team slaves over which list component scrolls faster and more smoothly, spends hours getting asynchronous calls and spinners to provide the user critical feedback so that they don’t think the app has crashed. Then a single misbehaving REST call to some external web API suddenly slows by 50% and the whole user experience is untenable.

The performance of a system is only as strong as it’s weakest link. In technical terms, this is about risk. You at least need to know the risk introduced by each component of a system; only then can you chose how to mitigate the risk accordingly. ‘Risk’ is a huge theme in ISO 29119 and the upcoming IEEE 2675 draft I’m working on, and any seasoned architect would know why it matters.

Fitting Performance into Feature Work

Working on ‘performance’ and working on a feature shouldn’t be two separate things. Automotive designers don’t do this when they build car engines and performance is paramount throughout even the assembly process as well. Neither should it be separate in software development.

However, in practice if you’ve never run a load test, tracked power consumption of a subroutine or analyzed aggregate results, it will be different than building stuff for sure. Comfortability and efficiency come with experience. A lack of experience or familiarity doesn’t remove the need for something critical to occur; it accelerates the need to ask how to get it done.

A reliable code pipeline and testing schedule make all the difference here. Many performance issues take time or dramatic conditions to expose, such as battery degradation, load balancing, and memory leaks. In these cases, it isn’t feasible to execute long-running performance tests for every code check-in.

What does this mean for code contributors? Since they are still responsible for meeting performance criteria, it means that they can’t always press the ‘done’ button today. It means we need reliable delivery pipelines to push code through that checks its performance pragmatically. As pressure to deliver value incrementally mounts, developers are taking responsibility for the build and deployment process through technologies like Docker, Jenkins Pipeline, and Puppet.

It also means that we need to adopt a testing schedule that meets the desired development cadence and real-world constrains on time or infrastructure:

  • Run small performance checks on all new work (new screens, endpoints, etc.)
  • Run local baselines and compare before individual contributors check in code
  • Schedule long-running (anything slower than 2mins) performance tests into pipeline stage after build verification in parallel
  • Schedule nightly performance regression checks on all critical risk workflows (i.e. login, checkout, submit claim, etc.)

How Do You Bake Performance Into Development?

While it’s perfectly fine to adopt patterns like ‘spike and stabilize’ on feature development, stabilization is a required payback of the technical debt you incur when your development spikes. To ‘stabilize’ isn’t just to make the code work, it’s to make it work well. This includes performance (not just acceptance) criteria to be considered complete.

A great place to start making measurable performance improvements is to measure performance objectively. Every user story should contain solid performance criteria, just as it should with acceptance criteria. In recent joint research, I found that higher performing development teams include performance criteria on 50% more of their user stories.

In other words, embedding tangible performance expectations in your user stories bakes performance in to the resulting system.

There are a lot of sub-topics under the umbrella term “performance”. When we get down to brass tacks, measuring performance characteristics often boils down to three aspects: throughput, reliability, and scalability. I’m a huge fan of load testing because it helps to verify all three measurable aspects of performance.

Throughput: from a good load test, you can objectively track throughput metrics like transactions/sec, time-to-first-byte (and last byte), and distribution of resource usage (i.e. are all CPUs being used efficiently). These give you a raw and necessarily granular level of detail that can be monitored and visualized in stand-ups and deep-dives equally.

Reliability: load tests also exercise your code far more than you can independently. It takes exercise to expose if a process is unreliable; concurrency in a load test is like exercise on steroids. Load tests can act as your robot army, especially when infrastructure or configuration changes push you into unknown risk territory.

Scalability: often, scalability mechanisms like load balancing, dynamic provisioning, and network shaping throw unexpected curveballs into your user’s experience. Unless you are practicing a near-religious level of control over deployment of code, infrastructure, and configuration changes into production, you run the risk of affecting real users (i.e. your paycheck). Load tests are a great way to see what happens ahead of time.

 

Short, Iterative Load Testing Fits Development Cycles

I am currently working with a client to load test their APIs, to simulate mobile client bursts of traffic that represent real-world scenarios. After a few rounds of testing, we’ve resolve many obvious issues, such as:

  • Overly verbose logs that write to SQL and/or disk
  • Parameter formats that cause server-side parsing errors
  • Throughput restrictions against other 3rd-party APIs (Google, Apple)
  • Static data that doesn’t exercise the system sufficiently
  • Large images stored as SQL blobs with no caching

We’ve been able to work through most of these issues quickly in test/fail/fix/re-test cycles, where we conduct short all-hands sessions with a developer, test engineer, and myself. After a quick review of significant changes since the last session (i.e. code, test, infrastructure, configuration), we use BlazeMeter to kick of a new API load test written in jMeter and monitor the server in real-time. We’ve been able to rapidly resolve a few anticipated, backlogged issues as well as learn about new problems that are likely to arise at future usage tiers.

The key here is to ‘anticipate iterative re-testing‘. Again I say: “performance is a feature, not a test”. It WILL require re-design and re-shaping as the code changes and system behaviors are better understood. It’s not a one-time thing to verify how a dynamic system behaves given a particular usage pattern.

The outcome from a business perspective of this load testing is that this new system is perceived to be far less of a risky venture, and more the innovation investment needed to improve sales and the future of their digital strategy.

Performance really does matter to everyone. That’s why I’m available to chat with you about it any time. Ping me on Twitter and we’ll take it from there.