Note
Click here to download the full example code
Estimating Feature ImportanceΒΆ
This demonstrates how to estimate feature importance based on how often features are used as splitting criteria.
webkb
is available in the
relational-datasets package.
A brief webkb overview
is available with the relational-datasets documentation.
Calling load
will return training and test folds:
from relational_datasets import load
train, test = load("webkb", fold=1)
Weβll set up the learning problem and fit the classifier:
from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background
bkg = Background(
modes=[
"courseprof(-course,+person).",
"courseprof(+course,-person).",
"courseta(+course,-person).",
"courseta(-course,+person).",
"project(-proj,+person).",
"project(+proj,-person).",
"sameperson(-person,+person).",
"faculty(+person).",
"student(+person).",
],
number_of_clauses=8,
)
clf = BoostedRDNClassifier(
background=bkg,
target="faculty",
max_tree_depth=3,
node_size=3,
n_estimators=10,
)
clf.fit(train)
Out:
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/stable/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
", pass one or the other as an argument to suppress this warning.", FutureWarning)
BoostedRDNClassifier(background=setParam: numOfClauses=8.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=3.
setParam: maxTreeDepth=3.
mode: courseprof(-course,+person).
mode: courseprof(+course,-person).
mode: courseta(+course,-person).
mode: courseta(-course,+person).
mode: project(-proj,+person).
mode: project(+proj,-person).
mode: sameperson(-person,+person).
mode: faculty(+person).
mode: student(+person).
, n_estimators=10, neg_pos_ratio=2, solver='BoostSRL', target='faculty')
The built-in feature_importances_
attribute of a fit classifier is a
Counter of how many times a features appears across the trees:
Out:
Counter({'student': 10, 'sameperson': 10})
These should generally be looked at while looking at the trees, so weβll plot the first tree here as well.
It appears that the only features needed to determine if someone is a faculty member can roughly be stated as: βIs this person a student?β and βDo these two names refer to the same person?β
This might be surprising, but shows that we can induce concepts like βa faculty member is NOT a student.β
from srlearn.plotting import export_digraph, plot_digraph
plot_digraph(export_digraph(clf, 0), format='html')
Out:
<srlearn.plotting._GVPlotter object at 0x7f040d9053d0>
Total running time of the script: ( 0 minutes 12.877 seconds)