Time-Tested: 7 Ways to Improve Velocity When A/B Testing a New UX

(Originally posted on Indeed’s Medium blog here on August 23, 2019 and re-posted here with permission)

A/B testing holistic redesigns can be tough. At Indeed, we learned this firsthand when we tried to take our UX from this to this:

The Indeed mobile Search Engine Results Page (SERP) circa mid-2017 (left) and circa mid-2018 (right)

Things didn’t go so hot. We’d spent months and months coming up with a beautiful design vision, and then months and months trying to test all of these changes all at once. We had a bunch of metrics move (mostly down) and it was super confusing because we couldn’t figure out what UI changes caused what effects.

So, we took a new approach. In the middle of 2018, we founded the Job Search UI Lab, a cross-functional team with one goal: to scientifically test as many individual UI elements as we could to understand the levers on our job search experience. In just the last 12 months, our team ran over 52 tests with over 502 groups. We’ve since used our learnings to successfully overhaul the Job Search UX on both desktop and mobile browsers.

In this blog post, we share some of the A/B test accelerating approaches we incorporated in the JSUI Lab — approaches that garnered us the 2018 Indeed Engineering Innovation Award. Whether you’re interested in doing a UX overhaul or just trying out a new feature, we think you can incorporate each of these tips into your A/B testing, too!

#1: Have a healthy backlog

No one should ever be waiting around to start development on a new test. Having a healthy backlog of prioritized A/B tests for developers helps you roll out A/B tests one after the other.

One way that the JSUI Lab creates a backlog is by gathering all of our teammates — regardless of their role — at the beginning of each quarter to brainstorm tests. We pull up the current UX on mobile and desktop and ask questions about how each element or feature works. Each question or idea ends up on its own sticky note. We end up with over 40 test ideas, which we then prioritize based off of how each test might address a job seeker pain point or improve the design system while minimizing effort. And while we may not get to every test in our backlog, we never have to worry about not having tests lined up.

#2: Write down hypotheses ahead of time

In the hustle and bustle of product development, sometimes experimenters don’t take the time to specify their hypotheses for a given A/B test ahead of time. Sure, not writing hypotheses may save you 10–30 minutes up front. But this can come back to bite you once the test is completed, when your team is looking at dozens of metrics and trying to make a decision about what to do next.

Not only is it confusing to see some metrics go up while others go down, chances are you’re probably also seeing some false positives (also known as Type I error) the more metrics you look at. You may even catch yourself looking at metrics that wouldn’t feasibly be affected by your test (e.g., “How does changing this UI element from orange to yellow affect whether or not a job seeker gets a call back from an employer?!”).

So do yourself a solid. Pick 3–4 metrics that your test could reasonably be expected to move, conduct a power analysis for each one, and write down your hypotheses for the test ahead of time.

#3: Test UI elements one at a time

Indeed’s mobile SERP circa mid-2019

This one’s a little counterintuitive. It might seem like it would increase the amount of time it would take to do a UX holistic redesign by testing each and every UI element separately. But by testing elements one at a time, the conclusions about our tests were more sound. Why? Because we could more clearly establish causality. Consequently, we were able to take all of the learnings from our tests and roll them into one big test that we were fairly confident would perform well. Rather than see metrics tank like the first time we did a holistic design test, we actually saw some of Indeed’s biggest user engagement wins for 2018, in less than half the time of the first attempt.

By running tests on UI elements one at a time, we were able to iterate on our design vision in a data-driven way and set up our holistic test for success. So, what do these tests look like in practice? Below are a few examples of some of the groups we ran. You’ll notice that the only real difference between the treatments is a minor change, like font size or spacing.

#4: Consider multivariate tests

Multivariate tests (sometimes referred to as “factorial tests”) test all possible combinations of each of the factors of interest in your A/B test. So, in a way, they’re more like an A/B/C/D/E/… test! What’s cool about multivariate tests is that you end up with winning combinations that you would have missed had you tested each factor one at a time.

An example from the JSUI Lab illustrates this benefit. We knew from UX research that our job seekers really cared about salary when making the decision to learn more about a job. In 2018, this was how we displayed salary on each result:

We wanted to see if increasing the visual prominence using color, font size, and bolding would increase job seeker engagement with search results. So, we developed four font size variants, four color variants, and two variants that were bolded or unbolded. We ended up with 4x4x2 groups for 32 total groups including control.

While multivariate tests can speed up how you draw conclusions about different UI elements, they’re not without their drawbacks. First and foremost, you’ll need to weigh the tradeoffs to statistical power, or the likelihood that you’ll detect a given effect if one actually exists (also known as Type II error). Without sufficient statistical power, you risk not detecting an effect of your test if there is one.

The closed form calculation is a power test between two proportions

Power calculations are a closed-form equation that require your product team to make tradeoffs between your α-level and β-level of choice, your sample size (n), and the effect size you care about your treatment having (p1). On Indeed, we have the benefit of having over 220M+ unique users each month. That level of traffic may not be available to you and your team. So, to have sufficient statistical power, you’ll potentially need to run your experiment for longer, run groups at higher allocations, cut some groups, or be willing to introduce more Type I error, depending on how small of an effect you’d like to confidently detect.

With a typical A/B test, it’s usually relatively straightforward to analyze with a t-test. Multivariate tests, however, will benefit from multivariate regression models, which will allow you to suss out the effects of particular variables and their interaction effects. Here’s a simplified regression equation:

And an example of a regression equation for one of the tests we ran that modified both font size and the spacing on the job card:

Another caveat of multivariate tests is that they can quickly become infeasible. If we had 10 factors with 2 levels each, we’d have a 2^10 multivariate test, with a whopping 1,024 test groups. In cases like these, running what’s called a fractional factorial experiment might make more sense.

Finally, multivariate tests may sometimes yield some zany combinations. In our salary example above, our UX Design team was mildly mortified when we introduced the salary variant with 16pt, green, and bolded font. We lovingly referred to this variant as “the Hulk.” In some cases, it may not be feasible to run a variant due to accessibility concerns. In the JSUI Lab, we determine on a case-by-case basis whether the tradeoff of statistical rigor is worth a temporarily poor user experience.

#5: Deploy CSS and JavaScript changes differently

Sometimes a typical deploy cycle can get in the way of testing new features quickly. At Indeed, we developed a tool called CrashTest, which allows us to sidestep the deploy cycle. CrashTest relies on a separate code base of CSS and JavaScript files that are injected into “hooks” in our main code base. While installing CrashTest hooks follows the standard deploy, once hooks are set up, we can inject new CSS and JavaScript treatments and see the changes reflected in our product in just a few minutes.

In the JSUI Lab, we rely on our design technologist Christina to quickly develop CSS and JavaScript treatments for dozens of groups at a time. With CrashTest, Christina can develop her features and get them QAed by Cory. We can push them into production that same day using our open source experimentation management platform Proctor. Had we relied on the typical deploy cycle, it would have taken Christina’s work several more days to be seen by job seekers, and that much more time until we had results from our A/B tests.

#6: Have a democratized experimentation platform

Combing through logs and tables to figure out how your tests performed is not the best use of time. Instead, consider building or buying an experimentation platform for your team. As a data-driven company, Indeed has an internal tool for this called TestStats. The tool displays how each test group performed on key metrics and whether the test has enough statistical power to draw meaningful conclusions at the predetermined effect size. This makes it easy to share and discuss results with others.

#7: Level up everyone’s skills through cross-training

On the JSUI team, we firmly believe that allowing everyone to contribute to team decisions equally helps our team function better. Our teammates include product managers, UX designers, QA engineers, data scientists, program managers, and design technologists. Each of us brings a unique background to the team. Teaching each other the skills we use in our day-to-day jobs helps increase velocity for our A/B tests because we’re able to talk one another’s language more readily.

For instance, I’m a product scientist, and led a training on A/B testing. This allowed all of the other members of JSUI Lab to feel more empowered to make test design decisions without my direct guidance every time. Our UX designer Katie shadowed our product managers CJ and Kevin as they turned on tests. Katie now turns on tests herself. Not only does this kind of cross-training reduce the “bus factor” on your team, it can also be a great way of helping your teammates master their subject and improve their confidence in their own expertise.

Now it’s time to test!

Whether you take only one or two tips or all seven, they can be a great way of improving your velocity when running A/B tests. The Job Search UI Lab has already started sharing these simple steps with other teams at Indeed. We think they’re more broadly applicable to other companies and hope you’ll give them a try, too.

Special thanks to my fellow Job Search UI Lab teammates Cory Chambers, Rachel Gilbert, Katie Hicks, CJ Hong, Christina Jeeves, Kevin Jiang, and Jonathan Peterson. Y’all rock!