Dataset transformation¶

The RuleXAI library can also be used to transform a dataset. Often datasets contain missing values and nominal values. Most available algorithms do not support either missing values or nominal values. Many algorithms require the data to be rescaled beforehand. The RuleXAI library is able to convert a dataset with nominal and missing values into a binary dataset containing as attributes the conditions describing the dataset and as values “1” when the condition is satisfied for the example and “0” when the condition is not satisfied.

The data used in this notebook comes from https://sci2s.ugr.es/keel/missing.php?order=mis#sub2. It is an Australian dataset that has 14 attributes: 8 numeric and 6 nominal and 690 examples. 70% of this dataset are missing values. The attributes of this dataset are described below.

@relation australian+MV
@attribute A1 {0, 1}
@attribute A2 real[16.0,8025.0]
@attribute A3 real[0.0,26335.0]
@attribute A4 {1, 2, 3}
@attribute A5 integer[1,14]
@attribute A6 integer[1,9]
@attribute A7 real[0.0,14415.0]
@attribute A8 {0, 1}
@attribute A9 {0, 1}
@attribute A10 integer[0,67]
@attribute A11 {0, 1}
@attribute A12 {1, 2, 3}
@attribute A13 integer[0,2000]
@attribute A14 integer[1,100001]
@attribute Class {0,1}
@inputs A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14
@output Class
@data

Data load¶

[1]:

import pandas as pd
import numpy as np

train_df = pd.read_csv("./data/australian_train.csv")
test_df = pd.read_csv("./data/australian_test.csv")

train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)
test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)

for column in train_df.select_dtypes('object').columns.tolist():
    train_df[column] = train_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})
    test_df[column] = test_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})

[2]:

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621 entries, 0 to 620
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      559 non-null    object
 1   A2      569 non-null    float64
 2   A3      554 non-null    float64
 3   A4      541 non-null    object
 4   A5      568 non-null    float64
 5   A6      556 non-null    float64
 6   A7      559 non-null    float64
 7   A8      560 non-null    object
 8   A9      567 non-null    object
 9   A10     563 non-null    float64
 10  A11     561 non-null    object
 11  A12     549 non-null    object
 12  A13     558 non-null    float64
 13  A14     561 non-null    float64
 14  Class   621 non-null    object
dtypes: float64(8), object(7)
memory usage: 72.9+ KB

[3]:

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      69 non-null     object
 1   A2      69 non-null     float64
 2   A3      69 non-null     float64
 3   A4      69 non-null     object
 4   A5      69 non-null     float64
 5   A6      69 non-null     float64
 6   A7      69 non-null     float64
 7   A8      69 non-null     object
 8   A9      69 non-null     object
 9   A10     69 non-null     float64
 10  A11     69 non-null     object
 11  A12     69 non-null     object
 12  A13     69 non-null     float64
 13  A14     69 non-null     float64
 14  Class   69 non-null     object
dtypes: float64(8), object(7)
memory usage: 8.2+ KB

[4]:

train_org = train_df.copy()
test_org = test_df.copy()

Data preprocessing¶

original data

[5]:

train_df.head(5)

[5]:

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	Class
0	0	2958.0	175.0	1	4.0	4.0	125.0	0	None	0.0	1	2	280.0	1.0	0
1	0	NaN	115.0	1	5.0	3.0	0.0	1	1	11.0	1	None	0.0	1.0	1
2	1	2017.0	817.0	2	6.0	4.0	196.0	1	1	NaN	0	2	60.0	159.0	1
3	1	1742.0	65.0	2	3.0	4.0	125.0	0	None	0.0	0	2	NaN	101.0	0
4	None	5867.0	446.0	2	11.0	8.0	304.0	1	1	6.0	0	2	43.0	561.0	1

imputation of missing values

[6]:

cateogry_columns=train_df.select_dtypes('object').columns.tolist()
number_columns=train_df.select_dtypes('number').columns.tolist()

for column in train_df:
    if train_df[column].isnull().any():
        if(column in cateogry_columns):
            train_df[column].fillna(train_df[column].mode()[0], inplace=True)
        else:
            train_df[column].fillna(train_df[column].mean(), inplace=True)

[7]:

train_df.head(5)

[7]:

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	Class
0	0	2958.000000	175.0	1	4.0	4.0	125.0	0	0	0.00000	1	2	280.000000	1.0	0
1	0	2693.896309	115.0	1	5.0	3.0	0.0	1	1	11.00000	1	2	0.000000	1.0	1
2	1	2017.000000	817.0	2	6.0	4.0	196.0	1	1	2.49556	0	2	60.000000	159.0	1
3	1	1742.000000	65.0	2	3.0	4.0	125.0	0	0	0.00000	0	2	185.802867	101.0	0
4	1	5867.000000	446.0	2	11.0	8.0	304.0	1	1	6.00000	0	2	43.000000	561.0	1

one hot encoding

[ ]:

data = pd.concat([train_df, test_df], axis = 0)
data.reset_index(drop=True,inplace=True)
data_with_dummies = pd.get_dummies(data.drop(["Class"], axis=1))

train_df_encoded =  data_with_dummies[:train_df.shape[0]]
train_df_encoded["Class"] = data[:train_df.shape[0]]["Class"]

test_df_encoded =  data_with_dummies[train_df.shape[0]:]
test_df_encoded["Class"] = data[train_df.shape[0]:]["Class"]

[9]:

train_df_encoded.head(5)

[9]:

	A2	A3	A5	A6	A7	A10	A13	A14	A1_0	A1_1	...	A8_0	A8_1	A9_0	A9_1	A11_0	A11_1	A12_2	Class
0	2958.000000	175.0	4.0	4.0	125.0	0.00000	280.000000	1.0	1	0	...	1	0	1	0	0	1	1	0
1	2693.896309	115.0	5.0	3.0	0.0	11.00000	0.000000	1.0	1	0	...	0	1	0	1	0	1	1	1
2	2017.000000	817.0	6.0	4.0	196.0	2.49556	60.000000	159.0	0	1	...	0	1	0	1	1	0	1	1
3	1742.000000	65.0	3.0	4.0	125.0	0.00000	185.802867	101.0	0	1	...	1	0	1	0	1	0	1	0
4	5867.000000	446.0	11.0	8.0	304.0	6.00000	43.000000	561.0	0	1	...	0	1	0	1	1	0	1	1

5 rows × 23 columns

normalization

[10]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

train_df_encoded_and_scaled = train_df_encoded.copy()
train_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.fit_transform(train_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])

test_df_encoded_and_scaled = test_df_encoded.copy()
test_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.transform(test_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])

[11]:

train_df_encoded_and_scaled.head(5)

[11]:

	A2	A3	A5	A6	A7	A10	A13	A14	A1_0	A1_1	...	A8_0	A8_1	A9_0	A9_1	A11_0	A11_1	A12_2	Class
0	0.182967	-0.348773	-0.952571	-0.356525	-0.240176	-0.518373	5.508121e-01	-0.196556	1	0	...	1	0	1	0	0	1	1	0
1	0.000000	-0.370201	-0.667503	-0.887967	-0.331488	1.766525	-1.086471e+00	-0.196556	1	0	...	0	1	0	1	0	1	1	1
2	-0.468944	-0.119484	-0.382434	-0.356525	-0.188311	0.000000	-7.356248e-01	-0.166737	0	1	...	0	1	0	1	1	0	1	1
3	-0.659460	-0.388059	-1.237640	-0.356525	-0.240176	-0.518373	1.661943e-16	-0.177684	0	1	...	1	0	1	0	1	0	1	0
4	2.198282	-0.251986	1.042910	1.769244	-0.109417	0.727935	-8.350313e-01	-0.090868	0	1	...	0	1	0	1	1	0	1	1

5 rows × 23 columns

[12]:

X_train = train_df_encoded_and_scaled.drop(columns = "Class")
y_train =train_df_encoded_and_scaled["Class"]

X_test = test_df_encoded_and_scaled.drop(columns = "Class")
y_test = test_df_encoded_and_scaled["Class"]

Building a Random Forest model on a preprocessed dataset¶

[13]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)

[13]:

RandomForestClassifier(random_state=42)

Balanced accuracy on training set

[14]:

balanced_accuracy_score(y_train,clf.predict(X_train))

[14]:

1.0

Balanced accuracy on test set

[15]:

balanced_accuracy_score(y_test,clf.predict(X_test))

[15]:

0.8153846153846154

Using RuleXAI to transform the original set¶

[16]:

X_train_org = train_org.drop(columns = "Class")
y_train_org = train_org["Class"]

X_test_org = test_org.drop(columns = "Class")
y_test_org = test_org["Class"]

[17]:

X_train_org.head(5)

[17]:

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14
0	0	2958.0	175.0	1	4.0	4.0	125.0	0	None	0.0	1	2	280.0	1.0
1	0	NaN	115.0	1	5.0	3.0	0.0	1	1	11.0	1	None	0.0	1.0
2	1	2017.0	817.0	2	6.0	4.0	196.0	1	1	NaN	0	2	60.0	159.0
3	1	1742.0	65.0	2	3.0	4.0	125.0	0	None	0.0	0	2	NaN	101.0
4	None	5867.0	446.0	2	11.0	8.0	304.0	1	1	6.0	0	2	43.0	561.0

[18]:

from rulexai.explainer import Explainer

explainer =  Explainer(X = X_train_org,model_predictions = y_train_org, type = "classification").explain()

[19]:

X_train_tranformed = explainer.fit_transform(X_train_org, selector=None)

[20]:

X_train_tranformed.head(5)

[20]:

	A2 = <19.0, 7037.5)	A8 = {0}	A10 = (-inf, 10.5)	A13 = (-inf, 216.0)	A2 = <2445.5, 4429.0)	A5 = (-inf, 3.5)	A2 = <1816.5, 3779.0)	A13 = <110.0, inf)	...	A6 = <2.0, inf)	A7 = <168.0, inf)	A2 = <29.5, inf)	A5 = <7.5, inf)	A14 = (-inf, 1069.5)	A3 = (-inf, 1080.0)	A5 = <6.5, inf)	A13 = (-inf, 591.5)	A6 = <3.5, inf)
0	1	1	1	0	1	0	1	1	...	1	0	1	0	1	1	0	1	1
1	0	0	0	1	0	0	0	0	...	1	0	0	0	1	1	0	1	0
2	1	0	0	1	0	0	1	0	...	1	1	1	0	1	1	0	1	1
3	1	1	1	0	0	1	0	0	...	1	0	1	0	1	1	0	0	1
4	1	0	1	1	0	0	0	0	...	1	1	1	1	1	1	1	1	1

5 rows × 99 columns

Building a Random Forest model on a prepared dataset by RuleXAI¶

[21]:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train_tranformed, y_train_org)

[21]:

RandomForestClassifier(random_state=42)

[22]:

X_test_transformed = explainer.transform(X_test_org)

Balanced accuracy on training set

[23]:

balanced_accuracy_score(y_train_org,clf.predict(X_train_tranformed))

[23]:

1.0

Balanced accuracy on test set

[24]:

balanced_accuracy_score(y_test_org,clf.predict(X_test_transformed))

[24]:

0.844871794871795

Comparing the results obtained with RandomForest on the preprocessed original set (imputation, dummification, normalization) and on the original set transformed with RuleXAI, it can be seen that these results are similar.