Testing Counterfactual Generation Algorithms

5 min readJul 13, 2021

This article is based on the work from the repository: https://github.com/ADMAntwerp/CounterfactualBenchmark and from the following preprint article: https://arxiv.org/abs/2107.04680

eXplainable Artificial Intelligence (XAI) has major importance today because of the widespread use of complex models that can not be easily interpreted like deep neural networks.
One of the explainability methodologies that has gained popularity over time is counterfactual (CF) explanations. They give an individual and personalized explanation for each point, indicating an alternate feature state in which the prediction class changes.

Since then, several methodologies were created to generate counterfactual explanations. However, one of the biggest challenges is testing such algorithms under different conditions and datasets.

To solve that, I present the Universal Counterfactual Benchmark Framework, which is a tool that can be used to easily test and debug your counterfactual generation algorithm.

What’s the tool’s purpose?

Test counterfactual generation algorithms (for neural networks or model agnostic) under different models and datasets.

What are the requirements?

Ubuntu 18.04
Anaconda 2020 or later

Step 0: Creating a simple counterfactual generator

If you have your own CF generation algorithm, you might want to skip this part.

Let’s first create a simple counterfactual generation algorithm.

Simple CF generation algorithm

The algorithm above has a very simple logic, that is shown in the schema below:

Logic to generate a counterfactual explanation used by the algorithm above. On left, there’s the original factual input [1,0,1], the algorithm generate the possible CF in the first loop, verifies if any changed the output classification (≥0.5), if not, gets the best improvement (shown in orange), and follow to a new round of possible CF generation until generates one that flipped the original classification (shown in purple).

Although this counterfactual (CF) looks fine, it has a major flaw. It fails when dealing with numerical continuous data. However, to test that, one would be required to get a dataset, preprocess, generate the model and configure it to run with the CF generator. With the Universal Counterfactual Benchmark Framework this task is greatly reduced.

Step 1: Clone Universal Counterfactual Benchmark Framework repository

This is a simple step which you need to clone the repo:

git clone https://github.com/ADMAntwerp/CounterfactualBenchmark.git

Then, move to the repository folder:

cd CounterfactualBenchmark/

Step 2: Create a virtual environment

If you use an IDE like PyCharm, you may proceed differently

With Anaconda, create a virtual environment with Python 3.8:

conda create --name CFBenchTest python=3.8 -y

Then, activate it:

conda activate CFBenchTest

Step 3: Install required packages

Now you have to install the required packages to run the counterfactual benchmark framework, this can be simply done by:

pip install -r requirements.txt

Step 4: Test mocked CF generator

The file simple_test.py already has one mocked CF generator that simply takes the factual instance and returns the same factual instance as counterfactual.

This is important because you can run it before adding your algorithm to check any potential error.

To do that, simply run the simple_test.py file on terminal or your preferred IDE:

python3 simple_test.py

If successful, it will test several factual cases and return several results like:

CF GENERATION: DATA - BalanceScale / C - 0 /ROW - 0
Failed counterfactual!
Factual class:3.676223957410976e-14
CF class:3.676223957410976e-14

You don’t need to wait for the entire benchmark, then click ctrl+c to stop the algorithm.

Step 5: Include CF Generation algorithm

Now it’s time to include our (or yours) counterfactual generation algorithm to be tested.

On the same file simple_test.py there are 6 fields (identified by comments !!CHANGE BELOW!!) that might be changed:

1 — Framework name

# !!CHANGE BELOW!!: Add the name of your CF generation framework
framework_name = 'NAME_OF_FRAMEWORK'

There, simply replace the NAME_OF_FRAMEWORK by your framework name, this will help you to get your generated data in the final.

2 — Dataset types to be tested

# !!CHANGE BELOW!!:(optional) Indicate the types of dataset you want to test, below, all dataset types are tested
test_ds_types = ['categorical', 'numerical', 'mixed']

In this part, you define which kind of data you want to test. In this case, we want to test everything, so we will keep in that way.

3 — Number of output classes

# !!CHANGE BELOW!!:(optional) Add the output number of classes for the neuronal network (it can be 1 or 2)
# If 1 it will return only one binary probability result (from 0.0 to 1.0)
# If 2 it will return two probabilities, where the first is the factual class and second the counterfactual
output_number = 1

There, you define the number of classes required by your generator. If your generator works with one output (one number from 0.0 to 1.0 representing the probability to be classified as 1) you should indicate 1 as above. If you need a two output (first number representing the probability of class 0, second number the probability of class 1), you should change to 2.

4 — Setup CF generator

# !!CHANGE BELOW!!: Below there's a dummy function to represent a CF generator, it just takes the factual
# instance and returns the same factual instance without any modification
my_cf_generator = lambda factual_instance: factual_instance

In this part, you must include everything to setup your counterfactual generator. If you will test the algorithm suggested on step 0, replace my_cf_generator by the suggested function.

It’s important to highlight the function the code above is located (framework_tester) has several features, and their descriptions, that your CF generator might use. Please, refer to the function docstrings to have more information about each feature available.

5 — Add script to run the CF generator

# Here you generate the counterfactual
# !!CHANGE BELOW!!: Here you use your function to generate the counterfactual
cf_generation_output = my_cf_generator(factual_oh)

Now you must include the code to generate the CF using your (or the suggested) algorithm.

If you are using the suggested algorithm, replace that line by:

cf_generation_output = simple_cf_generator(factual_oh, adapted_nn, tolerance=100)

6 — Extract the CF result as a list

# After that, you can adapt the counterfactual generation result to be in the required format (a simple list)
# !!CHANGE BELOW!!: Here you make additional preprocessing on your result to return a simple list
# (one hot encoded if categorical features are present) with the proposed counterfactual
cf = cf_generation_output

In the next code, you must extract your CF result. There are two important rules in this CF extraction:

The result CF must include the one-hot encoded features if categorical features are present. Then, if the factual has 2 categorical columns that generate, in total, 10 one-hot encoded features, your result must include all these 10 features.
The result must be a Python list (type(cf)==list).

If you are using the suggested algorithm, replace that line by:

cf = list(cf_generation_output[0])

Step 6: Run CF tester

Now, run in your terminal or preferred IDE the simple_test.py script. It will test (if all dataset types included) all 22 datasets, 20 factual points each, generating a total of 440 CF solutions.

python3 simple_test.py

If you used the suggested algorithm, it will fail in the first numerical dataset it test (BCW dataset). This was intended to show how this algorithm can help developers to test and find possible failures in their algorithm.

Step 7: Evaluate results

The results are stored in the folder benchmark_results/results where they are organized as:

OriginalClass_FactualIndex_DatasetName_FrameworName.csv => CF generated by the algorithm
TIME_OriginalClass_FactualIndex_DatasetName_FrameworName.csv => Time taken by the algorithm to generate the CF

For the next post, I will create a tutorial on how to use the benchmark functionality to compare one CF generation algorithm with the others already tested by our benchmark.