Line coverage is not branch coverage.

Tags: Python, Software Engineering, Testing

I’ve spent the past couple of weeks using line coverage to search out weakly tested sections of code in a large codebase I’m working on. While this has been quick at identifying areas that have room for improvement I think it’s worth highlighting that line coverage isn’t always the best indicator of quality of your tests.

Take this highly contrived example…

Hypothetical big bank requirement

We’re a new developer and we’ve been tasked with turning a new product requirement on a mortgage service into reality. We’re not actually checking whether a mortgage is deemed as affordable by some regulatory checks.

Depending upon the size of the mortgage and the user there is some variation on these checks, so the calculation needs to be configurable. The requirements are:

  1. Any mortgage under £20k is not subject to regulatory constraints, so all are approved.
  2. Existing customers will already have passed an earlier ‘affordability check’ so are approved.
  3. Customers where there the mortgage is lower than a multiple of their salary are approved. This tends to be 4 to 5.

Anyone not meeting these requirements is rejected.

You write some code up quickly that covers this in a nice consise OR statement that covers all the bases.

# src/demo.py
def validate_affordability(
    mortgage: int,
    salary: int,
    affordability: float,
    existing_customer: bool,
) -> bool:
    """Check whether a proposed mortgage is afforadble.
    
    Args:
        mortgage: total size of the mortagage to be taken out
        salary: total salary of the individual
        affordability: expected minimum ratio of mortagage to salary.
        existing_customer: are they an existing customer?

    Returns:
        bool: True for a successful affordability test or False for failure.
    """
    if existing_customer or mortgage < 20000 or (mortgage / salary) <= affordability:
        return True

    return False

Now you want to be a good developer so you write some tests. Tests are required for all code to be checked in, 100% coverage is enforced as we have strict code quality requirements.

# tests/test_demo.py
from src.demo import validate_affordability


def test_afforadbility_success_standard() -> None:
    validate_affordability(100000, 30000, 4.5, existing_customer=False)


def test_afforadbility_failure_low_income() -> None:
    validate_affordability(100000, 18000, 4.5, existing_customer=False)

Now to run these…

% pytest tests --cov=src --cov-report=term-missing
==================================== test session starts ====================================
platform darwin -- Python 3.11.4, pytest-8.0.0, pluggy-1.4.0
rootdir: /Users/spegler/Repos/lines
plugins: cov-4.1.0, hypothesis-6.97.4
collected 2 items

tests/test_demo.py ..                                                                                                                                                                                [100%]

---------- coverage: platform darwin, python 3.11.4-final-0 ----------
Name              Stmts   Miss  Cover   Missing
-----------------------------------------------
src/__init__.py       0      0   100%
src/demo.py           4      0   100%
-----------------------------------------------
TOTAL                 4      0   100%


==================================== 2 passed in 0.21s ====================================

Now look at that, your feature has 100% coverage and is ready to be merged.

So back in the real world…

I think we can all tell that these tests are hot garbage that don’t actually test anything at all. I’d hope that in the real world these tests would never make it into a codebase, that however doesn’t mean that we might commit good tests that cover other behaviour that just so happen to exercise these lines and mark them as tested in a coverage test. This is so often the case, you add a regression test that covers off some other behaviour, satisfies the line coverage but doesn’t actually test the function.

Given this, maybe we can do better. We should at minimum be checking all of the branches of our code, often with or and switch statements this can quickly spiral out of hand.

Going back to our test code I my world I’d probably have written this using TDD, writing a failing test for each outcome and layering on the logic. In this case where we’ve already got some code it’s much harder to know whether you’ve got all the branches covered.

Eyeballing our cases I come out with these tests:

# tests/test_demo.py
import pytest
from src.demo import validate_affordability


@pytest.mark.parametrize(
    "mortgage,income,affordability,outcome",
    [
        (10000, 0, 4.5, True),        # covers low mortage size
        (100000, 30000, 4.5, True),   # covers standard expected outcome
        (100000, 30000, 4.0, True),   # covers lower multiplier
        (100000, 20000, 4.0, False),  # covers rejection
    ],
)
def test_afforadbility_success_standard(
    mortgage: int, income: int, affordability: float, outcome: bool
) -> None:
    check = validate_affordability(
        mortgage, income, affordability, existing_customer=False
    )
    assert check is outcome

All of these tests will run using the default arg for the existing customer. We should probably add another test that covers this:

# tests/test_demo.py
def test_afforadbility_success_existing_customer() -> None:
    check = validate_affordability(
        10000, 1000, 4.5, existing_customer=True
    )
    assert check is True

So we add that as well we’re now covering off a test case for every example of our logic. We’re at a pretty good state and for many projects this might be enough.

Running our tests we see that we’re up at 100% coverage and we can say that we’ve covered all the code branches.

% pytest tests --cov=src --cov-report=term-missing
==================================== test session starts ====================================
platform darwin -- Python 3.11.4, pytest-8.0.0, pluggy-1.4.0
rootdir: /Users/spegler/Repos/lines
plugins: cov-4.1.0, hypothesis-6.97.4
collected 5 items

tests/test_demo.py .....                                                                                                                                                                             [100%]

---------- coverage: platform darwin, python 3.11.4-final-0 ----------
Name              Stmts   Miss  Cover   Missing
-----------------------------------------------
src/__init__.py       0      0   100%
src/demo.py           4      0   100%
-----------------------------------------------
TOTAL                 4      0   100%


==================================== 5 passed in 0.15s ====================================

But have we?

Mutation testing

If you’re looking for a tool to test your tests, mutation testing is what you’re after. Mutation testing goes through your codebase and changes the structure and reruns your test. If your coverage is good then theoretically this new code should fail your test suite. A lower number of mutants normally equates to better tests.

The mutmut package provides a super quick way to get started doing this in Python. You set up the target files to mutate and give it a test command to run. The tool then runs through your code and makes small individual changes, re-running the test between each change to validate the mutation. An example for this might be changing a return True in a function to return False.

Mutmuts full list of mutations is available here.

# We don't care about string mutation.
% mutmut run --disable-mutation-types=string,decorator

- Mutation testing starting -

These are the steps:
1. A full test suite run will be made to make sure we
   can run the tests successfully and we know how long
   it takes (to detect infinite loops for example)
2. Mutants will be generated and checked

Results are stored in .mutmut-cache.
Print found mutants with `mutmut results`.

Legend for output:
🎉 Killed mutants.   The goal is for everything to end up in this bucket.
⏰ Timeout.          Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious.       Tests took a long time, but not long enough to be fatal.
🙁 Survived.         This means your tests need to be expanded.
🔇 Skipped.          Skipped.

mutmut cache is out of date, clearing it...
1. Running tests without mutations
⠏ Running...Done

2. Checking mutants
⠏ 7/7  🎉 30  🤔 0  🙁 4  🔇 0

To see the resulting mutations you can use mutmut show.

# Show available mutations
% mutmut show
To apply a mutant on disk:
    mutmut apply <id>

To show a mutant:
    mutmut show <id>


Survived 🙁 (5)

---- src/demo.py (5) ----

2-4, 6, 9
# And show a specific mutation
% mutmut show 2
--- src/demo.py
+++ src/demo.py
@@ -15,7 +15,7 @@
     Returns:
         bool: True for a successful affordability test or False for failure.
     """
-    if existing_customer or mortgage < 20000 or (mortgage / salary) <= affordability:
+    if existing_customer or mortgage < 20001 or (mortgage / salary) <= affordability:
         return True

     return False

So our tests of run and we’ve got 9 total paths that can be mutated, and we’re catching 4 of them. Not great but not terrible. We can improve…

Property tests

Property testing is a software testing methodology where developers write guided tests where you define the inputs, types and properties that must hold true for a function. The test runner then tests this against a range of inputs guiding any failures to a narrow example. We can use this to write high quality tests that test a large number of inputs hold true against our expected behaviour.

Property based testing is very good at:

For Python by far the most popular property testing framework is Hypothesis, this is what we will be using in the examples here.

So, a simple implementation of a property test would be:

# tests/test_demo.py
from hypothesis import given, strategies as st, settings


@given(st.integers(), st.integers(), st.floats())
def test_afforadbility_success_property_test(
    mortgage: int, salary: int, affordability: float
):
    outcome = validate_affordability(
        mortgage, salary, affordability, existing_customer=False
    )
    assert outcome is (mortgage / salary < affordability)

In the above we declare the inputs using strategies, with a specific strategy per datapoint type of the test. We implement the expected output behaviour in the assert which is then tested in the test suite through a neat hypothesis pytest plugin. It’s worth noting that the more cardiniality you have in your inputs the longer your tests are going to take. With that in mind we’re only targeting new customers, existing customers are always approved so we can skip testing those inputs.

So, if we run our new test we get…

% pytest tests
==================================== test session starts ====================================
platform darwin -- Python 3.11.4, pytest-8.0.0, pluggy-1.4.0
rootdir: /Users/spegler/Repos/lines
plugins: hypothesis-6.98.6, cov-4.1.0
collected 6 items

tests/test_demo.py .....F                                                                                  [100%]

==================================== FAILURES ====================================
____________________________________ test_afforadbility_success_property_test ____________________________________
  + Exception Group Traceback (most recent call last):
    ...
  |     raise the_error_hypothesis_found
  | ExceptionGroup: Hypothesis found 3 distinct failures. (3 sub-exceptions)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/Users/spegler/Repos/lines/tests/test_demo.py", line 35, in test_afforadbility_success_property_test
    |     outcome = validate_affordability(
    |               ^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/Users/spegler/Repos/lines/src/demo.py", line 8, in validate_affordability
    |     if existing_customer or mortgage < 20000 or (mortgage / salary) <= affordability:
    |                                                   ~~~~~~~~~~^~~~~~~~
    | ZeroDivisionError: division by zero
    | Falsifying example: test_afforadbility_success_property_test(
    |     # The test sometimes passed when commented parts were varied together.
    |     mortgage=20000,
    |     salary=0,  # or any other generated value
    |     affordability=0.0,  # or any other generated value
    | )
    +---------------- 2 ----------------
    | Traceback (most recent call last):
    |   File "/Users/spegler/Repos/lines/tests/test_demo.py", line 38, in test_afforadbility_success_property_test
    |     assert outcome is True if mortgage / salary > affordability else False
    | AssertionError: assert False
    | Falsifying example: test_afforadbility_success_property_test(
    |     mortgage=0,
    |     salary=1,  # or any other generated value
    |     affordability=0.0,
    | )
    +---------------- 3 ----------------
    | Traceback (most recent call last):
    |   File "/Users/spegler/Repos/lines/tests/test_demo.py", line 38, in test_afforadbility_success_property_test
    |     assert outcome is (mortgage / salary < affordability)
    | ZeroDivisionError: division by zero
    | Falsifying example: test_afforadbility_success_property_test(
    |     # The test sometimes passed when commented parts were varied together.
    |     mortgage=0,  # or any other generated value
    |     salary=0,  # or any other generated value
    |     affordability=0.0,  # or any other generated value
    | )
    +------------------------------------
============================================ short test summary info =============================================
FAILED tests/test_demo.py::test_afforadbility_success_property_test - ExceptionGroup: Hypothesis found 3 distinct failures. (3 sub-exceptions)

So hypothesis has come up with some variables and tested our code. It’s not good for us as we’ve missed a couple of things. First things first we’ve got two ZeroDivsionErrors. One of these is in the test and one of these has found a bug in our code. If a customer has no salary we can’t check against their affordability. In this case we should probably do two things; firstly check with product and see if they want to reject these customers, let’s say they do.

We should then tweak our test to catch those zero salary cases and always assert false for them. While we’re here we should also set some min and max values for our affordability bounds, we don’t want them to be less than 1 or higher than 10. This makes it a bit easier for pytest to generate data that is reflective of your cases.

We tweak to our tests with the following:

# tests/test_demo.py
@given(
    st.integers(min_value=5000),
    st.integers(min_value=0),
    st.floats(min_value=1, max_value=10),
)
def test_afforadbility_success_property_test(
    mortgage: int, salary: int, affordability: float
):
    outcome = validate_affordability(
        mortgage, salary, affordability, existing_customer=False
    )
    if not salary:
        assert outcome is False
    else:
        assert outcome is (mortgage / salary < affordability)

While this will work, when running the mutation tests quite a few now come back as suspicious. Our tests are taking a longer amount of time probably down to hypothesis passing in 0 to the salary value in a decent proportion of tests causing us to miss most of the value. What we can do here is pass in a single example with 0 and then raise the salary min value above zero.

# tests/test_demo.py
from hypothesis import example

...
@example(mortgage=25000, salary=0, affordability=1)
@given(
    st.integers(min_value=5000),
    st.integers(min_value=1),
    st.floats(min_value=1, max_value=10),
)
...

This works better and cuts out the time taken for hypothesis to close on in on failing values. We still have one thing left to fix however, our assertion doesn’t cover low value loans where the mortgage is under 20000. Let’s fix that and migrate our paramaterised examples from our pytest example to this.

# tests/test_demo.py
@example(mortgage=25000, salary=0, affordability=1)
@example(mortgage=19999, salary=1, affordability=2)
@example(mortgage=20000, salary=1, affordability=2)
@given(
    st.integers(min_value=5000),
    st.integers(min_value=1),
    st.floats(min_value=1, max_value=10),
)
def test_afforadbility_success_property_test(
    mortgage: int, salary: int, affordability: float
):
    outcome = validate_affordability(
        mortgage, salary, affordability, existing_customer=False
    )
    if not salary:
        assert outcome is False
    elif mortgage < 20000:
        assert outcome is True
    else:
        assert outcome is (mortgage / salary < affordability)

Running our mutation tests us again gives us almost 100% coverage, one last case isn’t covered:

% mutmut show 7
--- src/demo.py
+++ src/demo.py
@@ -8,7 +8,7 @@
     if not salary:
         return False

-    if existing_customer or mortgage < 20000 or (mortgage / salary) <= affordability:
+    if existing_customer or mortgage < 20000 or (mortgage / salary) < affordability:
         return True

     return False

One more predefined example and we should be there…

# tests/test_demo.py
@example(mortgage=25000, salary=0, affordability=1)
@example(mortgage=20000, salary=1, affordability=2)
@example(mortgage=100000, salary=20000, affordability=5)
@example(mortgage=19999, salary=1, affordability=2)
@given(
    st.integers(min_value=5000),
    st.integers(min_value=1),
    st.floats(min_value=1, max_value=10),
)
def test_afforadbility_success_property_test(
    mortgage: int, salary: int, affordability: float
):
    outcome = validate_affordability(
        mortgage, salary, affordability, existing_customer=False
    )
    if not salary:
        assert outcome is False
    elif mortgage < 20000:
        assert outcome is True
    else:
        assert outcome is (mortgage / salary <= affordability)

Boom, rerunning the mutation tests gives us 100% coverage

% mutmut run --disable-mutation-types=string,decorator

- Mutation testing starting -

These are the steps:
1. A full test suite run will be made to make sure we
   can run the tests successfully and we know how long
   it takes (to detect infinite loops for example)
2. Mutants will be generated and checked

Results are stored in .mutmut-cache.
Print found mutants with `mutmut results`.

Legend for output:
🎉 Killed mutants.   The goal is for everything to end up in this bucket.
⏰ Timeout.          Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious.       Tests took a long time, but not long enough to be fatal.
🙁 Survived.         This means your tests need to be expanded.
🔇 Skipped.          Skipped.

1. Running tests without mutations
⠏ Running...Done

2. Checking mutants
⠋ 9/9  🎉 90  🤔 0  🙁 0  🔇 0

So what have we learned?

While this entire exercise is incredibly contrived and forced, I do think there are seme good takeaways for the real world:

The full repo is available here.