Dataset transformation¶
The RuleXAI library can also be used to transform a dataset. Often datasets contain missing values and nominal values. Most available algorithms do not support either missing values or nominal values. Many algorithms require the data to be rescaled beforehand. The RuleXAI library is able to convert a dataset with nominal and missing values into a binary dataset containing as attributes the conditions describing the dataset and as values “1” when the condition is satisfied for the example and “0” when the condition is not satisfied.
The data used in this notebook comes from https://sci2s.ugr.es/keel/missing.php?order=mis#sub2. It is an Australian dataset that has 14 attributes: 8 numeric and 6 nominal and 690 examples. 70% of this dataset are missing values. The attributes of this dataset are described below.
Data load¶
[1]:
import pandas as pd
import numpy as np
train_df = pd.read_csv("./data/australian_train.csv")
test_df = pd.read_csv("./data/australian_test.csv")
train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = train_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)
test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]] = test_df[["A1","A4", "A8", "A9", "A11", "A12", "Class"]].astype(str)
for column in train_df.select_dtypes('object').columns.tolist():
train_df[column] = train_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})
test_df[column] = test_df[column].apply(lambda x: x.split(".")[0]).replace({"nan": None})
[2]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621 entries, 0 to 620
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A1 559 non-null object
1 A2 569 non-null float64
2 A3 554 non-null float64
3 A4 541 non-null object
4 A5 568 non-null float64
5 A6 556 non-null float64
6 A7 559 non-null float64
7 A8 560 non-null object
8 A9 567 non-null object
9 A10 563 non-null float64
10 A11 561 non-null object
11 A12 549 non-null object
12 A13 558 non-null float64
13 A14 561 non-null float64
14 Class 621 non-null object
dtypes: float64(8), object(7)
memory usage: 72.9+ KB
[3]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A1 69 non-null object
1 A2 69 non-null float64
2 A3 69 non-null float64
3 A4 69 non-null object
4 A5 69 non-null float64
5 A6 69 non-null float64
6 A7 69 non-null float64
7 A8 69 non-null object
8 A9 69 non-null object
9 A10 69 non-null float64
10 A11 69 non-null object
11 A12 69 non-null object
12 A13 69 non-null float64
13 A14 69 non-null float64
14 Class 69 non-null object
dtypes: float64(8), object(7)
memory usage: 8.2+ KB
[4]:
train_org = train_df.copy()
test_org = test_df.copy()
Data preprocessing¶
original data
[5]:
train_df.head(5)
[5]:
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2958.0 | 175.0 | 1 | 4.0 | 4.0 | 125.0 | 0 | None | 0.0 | 1 | 2 | 280.0 | 1.0 | 0 |
1 | 0 | NaN | 115.0 | 1 | 5.0 | 3.0 | 0.0 | 1 | 1 | 11.0 | 1 | None | 0.0 | 1.0 | 1 |
2 | 1 | 2017.0 | 817.0 | 2 | 6.0 | 4.0 | 196.0 | 1 | 1 | NaN | 0 | 2 | 60.0 | 159.0 | 1 |
3 | 1 | 1742.0 | 65.0 | 2 | 3.0 | 4.0 | 125.0 | 0 | None | 0.0 | 0 | 2 | NaN | 101.0 | 0 |
4 | None | 5867.0 | 446.0 | 2 | 11.0 | 8.0 | 304.0 | 1 | 1 | 6.0 | 0 | 2 | 43.0 | 561.0 | 1 |
imputation of missing values
[6]:
cateogry_columns=train_df.select_dtypes('object').columns.tolist()
number_columns=train_df.select_dtypes('number').columns.tolist()
for column in train_df:
if train_df[column].isnull().any():
if(column in cateogry_columns):
train_df[column].fillna(train_df[column].mode()[0], inplace=True)
else:
train_df[column].fillna(train_df[column].mean(), inplace=True)
[7]:
train_df.head(5)
[7]:
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2958.000000 | 175.0 | 1 | 4.0 | 4.0 | 125.0 | 0 | 0 | 0.00000 | 1 | 2 | 280.000000 | 1.0 | 0 |
1 | 0 | 2693.896309 | 115.0 | 1 | 5.0 | 3.0 | 0.0 | 1 | 1 | 11.00000 | 1 | 2 | 0.000000 | 1.0 | 1 |
2 | 1 | 2017.000000 | 817.0 | 2 | 6.0 | 4.0 | 196.0 | 1 | 1 | 2.49556 | 0 | 2 | 60.000000 | 159.0 | 1 |
3 | 1 | 1742.000000 | 65.0 | 2 | 3.0 | 4.0 | 125.0 | 0 | 0 | 0.00000 | 0 | 2 | 185.802867 | 101.0 | 0 |
4 | 1 | 5867.000000 | 446.0 | 2 | 11.0 | 8.0 | 304.0 | 1 | 1 | 6.00000 | 0 | 2 | 43.000000 | 561.0 | 1 |
one hot encoding
[ ]:
data = pd.concat([train_df, test_df], axis = 0)
data.reset_index(drop=True,inplace=True)
data_with_dummies = pd.get_dummies(data.drop(["Class"], axis=1))
train_df_encoded = data_with_dummies[:train_df.shape[0]]
train_df_encoded["Class"] = data[:train_df.shape[0]]["Class"]
test_df_encoded = data_with_dummies[train_df.shape[0]:]
test_df_encoded["Class"] = data[train_df.shape[0]:]["Class"]
[9]:
train_df_encoded.head(5)
[9]:
A2 | A3 | A5 | A6 | A7 | A10 | A13 | A14 | A1_0 | A1_1 | ... | A8_0 | A8_1 | A9_0 | A9_1 | A11_0 | A11_1 | A12_1 | A12_2 | A12_3 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2958.000000 | 175.0 | 4.0 | 4.0 | 125.0 | 0.00000 | 280.000000 | 1.0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 2693.896309 | 115.0 | 5.0 | 3.0 | 0.0 | 11.00000 | 0.000000 | 1.0 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
2 | 2017.000000 | 817.0 | 6.0 | 4.0 | 196.0 | 2.49556 | 60.000000 | 159.0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
3 | 1742.000000 | 65.0 | 3.0 | 4.0 | 125.0 | 0.00000 | 185.802867 | 101.0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
4 | 5867.000000 | 446.0 | 11.0 | 8.0 | 304.0 | 6.00000 | 43.000000 | 561.0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
5 rows × 23 columns
normalization
[10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_df_encoded_and_scaled = train_df_encoded.copy()
train_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.fit_transform(train_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])
test_df_encoded_and_scaled = test_df_encoded.copy()
test_df_encoded_and_scaled[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']] = scaler.transform(test_df_encoded[['A2','A3','A5','A6', 'A7', 'A10', 'A13', 'A14']])
[11]:
train_df_encoded_and_scaled.head(5)
[11]:
A2 | A3 | A5 | A6 | A7 | A10 | A13 | A14 | A1_0 | A1_1 | ... | A8_0 | A8_1 | A9_0 | A9_1 | A11_0 | A11_1 | A12_1 | A12_2 | A12_3 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.182967 | -0.348773 | -0.952571 | -0.356525 | -0.240176 | -0.518373 | 5.508121e-01 | -0.196556 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 0.000000 | -0.370201 | -0.667503 | -0.887967 | -0.331488 | 1.766525 | -1.086471e+00 | -0.196556 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
2 | -0.468944 | -0.119484 | -0.382434 | -0.356525 | -0.188311 | 0.000000 | -7.356248e-01 | -0.166737 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
3 | -0.659460 | -0.388059 | -1.237640 | -0.356525 | -0.240176 | -0.518373 | 1.661943e-16 | -0.177684 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
4 | 2.198282 | -0.251986 | 1.042910 | 1.769244 | -0.109417 | 0.727935 | -8.350313e-01 | -0.090868 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
5 rows × 23 columns
[12]:
X_train = train_df_encoded_and_scaled.drop(columns = "Class")
y_train =train_df_encoded_and_scaled["Class"]
X_test = test_df_encoded_and_scaled.drop(columns = "Class")
y_test = test_df_encoded_and_scaled["Class"]
Building a Random Forest model on a preprocessed dataset¶
[13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
[13]:
RandomForestClassifier(random_state=42)
Balanced accuracy on training set
[14]:
balanced_accuracy_score(y_train,clf.predict(X_train))
[14]:
1.0
Balanced accuracy on test set
[15]:
balanced_accuracy_score(y_test,clf.predict(X_test))
[15]:
0.8153846153846154
Using RuleXAI to transform the original set¶
[16]:
X_train_org = train_org.drop(columns = "Class")
y_train_org = train_org["Class"]
X_test_org = test_org.drop(columns = "Class")
y_test_org = test_org["Class"]
[17]:
X_train_org.head(5)
[17]:
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2958.0 | 175.0 | 1 | 4.0 | 4.0 | 125.0 | 0 | None | 0.0 | 1 | 2 | 280.0 | 1.0 |
1 | 0 | NaN | 115.0 | 1 | 5.0 | 3.0 | 0.0 | 1 | 1 | 11.0 | 1 | None | 0.0 | 1.0 |
2 | 1 | 2017.0 | 817.0 | 2 | 6.0 | 4.0 | 196.0 | 1 | 1 | NaN | 0 | 2 | 60.0 | 159.0 |
3 | 1 | 1742.0 | 65.0 | 2 | 3.0 | 4.0 | 125.0 | 0 | None | 0.0 | 0 | 2 | NaN | 101.0 |
4 | None | 5867.0 | 446.0 | 2 | 11.0 | 8.0 | 304.0 | 1 | 1 | 6.0 | 0 | 2 | 43.0 | 561.0 |
[18]:
from rulexai.explainer import Explainer
explainer = Explainer(X = X_train_org,model_predictions = y_train_org, type = "classification").explain()
[19]:
X_train_tranformed = explainer.fit_transform(X_train_org, selector=None)
[20]:
X_train_tranformed.head(5)
[20]:
A2 = <19.0, 7037.5) | A8 = {0} | A10 = (-inf, 10.5) | A13 = (-inf, 216.0) | A5 = (-inf, 1.5) | A2 = <2445.5, 4429.0) | A5 = (-inf, 3.5) | A9 = {0} | A2 = <1816.5, 3779.0) | A13 = <110.0, inf) | ... | A6 = <2.0, inf) | A7 = <168.0, inf) | A2 = <29.5, inf) | A3 = (-inf, 12.5) | A5 = <7.5, inf) | A14 = (-inf, 1069.5) | A3 = (-inf, 1080.0) | A5 = <6.5, inf) | A13 = (-inf, 591.5) | A6 = <3.5, inf) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
3 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
4 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 99 columns
Building a Random Forest model on a prepared dataset by RuleXAI¶
[21]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_tranformed, y_train_org)
[21]:
RandomForestClassifier(random_state=42)
[22]:
X_test_transformed = explainer.transform(X_test_org)
Balanced accuracy on training set
[23]:
balanced_accuracy_score(y_train_org,clf.predict(X_train_tranformed))
[23]:
1.0
Balanced accuracy on test set
[24]:
balanced_accuracy_score(y_test_org,clf.predict(X_test_transformed))
[24]:
0.844871794871795
Comparing the results obtained with RandomForest on the preprocessed original set (imputation, dummification, normalization) and on the original set transformed with RuleXAI, it can be seen that these results are similar.