Dataset transformation

The RuleXAI library can also be used to transform a dataset. Often datasets contain missing values and nominal values. Most available algorithms do not support either missing values or nominal values. Many algorithms require the data to be rescaled beforehand. The RuleXAI library is able to convert a dataset with nominal and missing values into a binary dataset containing as attributes the conditions describing the dataset and as values “1” when the condition is satisfied for the example and “0” when the condition is not satisfied.

The data used in this notebook comes from https://sci2s.ugr.es/keel/missing.php?order=mis#sub2. It is an Australian dataset that has 14 attributes: 8 numeric and 6 nominal and 690 examples. 70% of this dataset are missing values. The attributes of this dataset are described below.

@relation australian+MV
@attribute A1 {0, 1}
@attribute A2 real[16.0,8025.0]
@attribute A3 real[0.0,26335.0]
@attribute A4 {1, 2, 3}
@attribute A5 integer[1,14]
@attribute A6 integer[1,9]
@attribute A7 real[0.0,14415.0]
@attribute A8 {0, 1}
@attribute A9 {0, 1}
@attribute A10 integer[0,67]
@attribute A11 {0, 1}
@attribute A12 {1, 2, 3}
@attribute A13 integer[0,2000]
@attribute A14 integer[1,100001]
@attribute Class {0,1}
@inputs A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14
@output Class
@data

Data load

[1]:
import pandas as pd
import numpy as np

train_df = pd.read_csv("./data/australian_train.csv")
test_df = pd.read_csv("./data/australian_test.csv")

train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)
test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)

for column in train_df.select_dtypes('object').columns.tolist():
    train_df[column] = train_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})
    test_df[column] = test_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})
[2]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621 entries, 0 to 620
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      559 non-null    object
 1   A2      569 non-null    float64
 2   A3      554 non-null    float64
 3   A4      541 non-null    object
 4   A5      568 non-null    float64
 5   A6      556 non-null    float64
 6   A7      559 non-null    float64
 7   A8      560 non-null    object
 8   A9      567 non-null    object
 9   A10     563 non-null    float64
 10  A11     561 non-null    object
 11  A12     549 non-null    object
 12  A13     558 non-null    float64
 13  A14     561 non-null    float64
 14  Class   621 non-null    object
dtypes: float64(8), object(7)
memory usage: 72.9+ KB
[3]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      69 non-null     object
 1   A2      69 non-null     float64
 2   A3      69 non-null     float64
 3   A4      69 non-null     object
 4   A5      69 non-null     float64
 5   A6      69 non-null     float64
 6   A7      69 non-null     float64
 7   A8      69 non-null     object
 8   A9      69 non-null     object
 9   A10     69 non-null     float64
 10  A11     69 non-null     object
 11  A12     69 non-null     object
 12  A13     69 non-null     float64
 13  A14     69 non-null     float64
 14  Class   69 non-null     object
dtypes: float64(8), object(7)
memory usage: 8.2+ KB
[4]:
train_org = train_df.copy()
test_org = test_df.copy()

Data preprocessing

  • original data

[5]:
train_df.head(5)
[5]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 Class
0 0 2958.0 175.0 1 4.0 4.0 125.0 0 None 0.0 1 2 280.0 1.0 0
1 0 NaN 115.0 1 5.0 3.0 0.0 1 1 11.0 1 None 0.0 1.0 1
2 1 2017.0 817.0 2 6.0 4.0 196.0 1 1 NaN 0 2 60.0 159.0 1
3 1 1742.0 65.0 2 3.0 4.0 125.0 0 None 0.0 0 2 NaN 101.0 0
4 None 5867.0 446.0 2 11.0 8.0 304.0 1 1 6.0 0 2 43.0 561.0 1
  • imputation of missing values

[6]:
cateogry_columns=train_df.select_dtypes('object').columns.tolist()
number_columns=train_df.select_dtypes('number').columns.tolist()

for column in train_df:
    if train_df[column].isnull().any():
        if(column in cateogry_columns):
            train_df[column].fillna(train_df[column].mode()[0], inplace=True)
        else:
            train_df[column].fillna(train_df[column].mean(), inplace=True)
[7]:
train_df.head(5)
[7]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 Class
0 0 2958.000000 175.0 1 4.0 4.0 125.0 0 0 0.00000 1 2 280.000000 1.0 0
1 0 2693.896309 115.0 1 5.0 3.0 0.0 1 1 11.00000 1 2 0.000000 1.0 1
2 1 2017.000000 817.0 2 6.0 4.0 196.0 1 1 2.49556 0 2 60.000000 159.0 1
3 1 1742.000000 65.0 2 3.0 4.0 125.0 0 0 0.00000 0 2 185.802867 101.0 0
4 1 5867.000000 446.0 2 11.0 8.0 304.0 1 1 6.00000 0 2 43.000000 561.0 1
  • one hot encoding

[ ]:
data = pd.concat([train_df, test_df], axis = 0)
data.reset_index(drop=True,inplace=True)
data_with_dummies = pd.get_dummies(data.drop(["Class"], axis=1))

train_df_encoded =  data_with_dummies[:train_df.shape[0]]
train_df_encoded["Class"] = data[:train_df.shape[0]]["Class"]

test_df_encoded =  data_with_dummies[train_df.shape[0]:]
test_df_encoded["Class"] = data[train_df.shape[0]:]["Class"]
[9]:
train_df_encoded.head(5)
[9]:
A2 A3 A5 A6 A7 A10 A13 A14 A1_0 A1_1 ... A8_0 A8_1 A9_0 A9_1 A11_0 A11_1 A12_1 A12_2 A12_3 Class
0 2958.000000 175.0 4.0 4.0 125.0 0.00000 280.000000 1.0 1 0 ... 1 0 1 0 0 1 0 1 0 0
1 2693.896309 115.0 5.0 3.0 0.0 11.00000 0.000000 1.0 1 0 ... 0 1 0 1 0 1 0 1 0 1
2 2017.000000 817.0 6.0 4.0 196.0 2.49556 60.000000 159.0 0 1 ... 0 1 0 1 1 0 0 1 0 1
3 1742.000000 65.0 3.0 4.0 125.0 0.00000 185.802867 101.0 0 1 ... 1 0 1 0 1 0 0 1 0 0
4 5867.000000 446.0 11.0 8.0 304.0 6.00000 43.000000 561.0 0 1 ... 0 1 0 1 1 0 0 1 0 1

5 rows × 23 columns

  • normalization

[10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

train_df_encoded_and_scaled = train_df_encoded.copy()
train_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.fit_transform(train_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])

test_df_encoded_and_scaled = test_df_encoded.copy()
test_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.transform(test_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])
[11]:
train_df_encoded_and_scaled.head(5)
[11]:
A2 A3 A5 A6 A7 A10 A13 A14 A1_0 A1_1 ... A8_0 A8_1 A9_0 A9_1 A11_0 A11_1 A12_1 A12_2 A12_3 Class
0 0.182967 -0.348773 -0.952571 -0.356525 -0.240176 -0.518373 5.508121e-01 -0.196556 1 0 ... 1 0 1 0 0 1 0 1 0 0
1 0.000000 -0.370201 -0.667503 -0.887967 -0.331488 1.766525 -1.086471e+00 -0.196556 1 0 ... 0 1 0 1 0 1 0 1 0 1
2 -0.468944 -0.119484 -0.382434 -0.356525 -0.188311 0.000000 -7.356248e-01 -0.166737 0 1 ... 0 1 0 1 1 0 0 1 0 1
3 -0.659460 -0.388059 -1.237640 -0.356525 -0.240176 -0.518373 1.661943e-16 -0.177684 0 1 ... 1 0 1 0 1 0 0 1 0 0
4 2.198282 -0.251986 1.042910 1.769244 -0.109417 0.727935 -8.350313e-01 -0.090868 0 1 ... 0 1 0 1 1 0 0 1 0 1

5 rows × 23 columns

[12]:
X_train = train_df_encoded_and_scaled.drop(columns = "Class")
y_train =train_df_encoded_and_scaled["Class"]

X_test = test_df_encoded_and_scaled.drop(columns = "Class")
y_test = test_df_encoded_and_scaled["Class"]

Building a Random Forest model on a preprocessed dataset

[13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)
[13]:
RandomForestClassifier(random_state=42)

Balanced accuracy on training set

[14]:
balanced_accuracy_score(y_train,clf.predict(X_train))
[14]:
1.0

Balanced accuracy on test set

[15]:
balanced_accuracy_score(y_test,clf.predict(X_test))
[15]:
0.8153846153846154

Using RuleXAI to transform the original set

[16]:
X_train_org = train_org.drop(columns = "Class")
y_train_org = train_org["Class"]

X_test_org = test_org.drop(columns = "Class")
y_test_org = test_org["Class"]
[17]:
X_train_org.head(5)
[17]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14
0 0 2958.0 175.0 1 4.0 4.0 125.0 0 None 0.0 1 2 280.0 1.0
1 0 NaN 115.0 1 5.0 3.0 0.0 1 1 11.0 1 None 0.0 1.0
2 1 2017.0 817.0 2 6.0 4.0 196.0 1 1 NaN 0 2 60.0 159.0
3 1 1742.0 65.0 2 3.0 4.0 125.0 0 None 0.0 0 2 NaN 101.0
4 None 5867.0 446.0 2 11.0 8.0 304.0 1 1 6.0 0 2 43.0 561.0
[18]:
from rulexai.explainer import Explainer

explainer =  Explainer(X = X_train_org,model_predictions = y_train_org, type = "classification").explain()
[19]:
X_train_tranformed = explainer.fit_transform(X_train_org, selector=None)
[20]:
X_train_tranformed.head(5)
[20]:
A2 = <19.0, 7037.5) A8 = {0} A10 = (-inf, 10.5) A13 = (-inf, 216.0) A5 = (-inf, 1.5) A2 = <2445.5, 4429.0) A5 = (-inf, 3.5) A9 = {0} A2 = <1816.5, 3779.0) A13 = <110.0, inf) ... A6 = <2.0, inf) A7 = <168.0, inf) A2 = <29.5, inf) A3 = (-inf, 12.5) A5 = <7.5, inf) A14 = (-inf, 1069.5) A3 = (-inf, 1080.0) A5 = <6.5, inf) A13 = (-inf, 591.5) A6 = <3.5, inf)
0 1 1 1 0 0 1 0 0 1 1 ... 1 0 1 0 0 1 1 0 1 1
1 0 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 0 1 1 0 1 0
2 1 0 0 1 0 0 0 0 1 0 ... 1 1 1 0 0 1 1 0 1 1
3 1 1 1 0 0 0 1 0 0 0 ... 1 0 1 0 0 1 1 0 0 1
4 1 0 1 1 0 0 0 0 0 0 ... 1 1 1 0 1 1 1 1 1 1

5 rows × 99 columns

Building a Random Forest model on a prepared dataset by RuleXAI

[21]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train_tranformed, y_train_org)
[21]:
RandomForestClassifier(random_state=42)
[22]:
X_test_transformed = explainer.transform(X_test_org)

Balanced accuracy on training set

[23]:
balanced_accuracy_score(y_train_org,clf.predict(X_train_tranformed))
[23]:
1.0

Balanced accuracy on test set

[24]:
balanced_accuracy_score(y_test_org,clf.predict(X_test_transformed))
[24]:
0.844871794871795

Comparing the results obtained with RandomForest on the preprocessed original set (imputation, dummification, normalization) and on the original set transformed with RuleXAI, it can be seen that these results are similar.