Estimating Feature ImportanceΒΆ

This demonstrates how to estimate feature importance based on how often features are used as splitting criteria.

webkb is available in the relational-datasets package. A brief webkb overview is available with the relational-datasets documentation.

Calling load will return training and test folds:

from relational_datasets import load

train, test = load("webkb", fold=1)

We’ll set up the learning problem and fit the classifier:

from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background

bkg = Background(
    modes=[
        "courseprof(-course,+person).",
        "courseprof(+course,-person).",
        "courseta(+course,-person).",
        "courseta(-course,+person).",
        "project(-proj,+person).",
        "project(+proj,-person).",
        "sameperson(-person,+person).",
        "faculty(+person).",
        "student(+person).",
    ],
    number_of_clauses=8,
)

clf = BoostedRDNClassifier(
    background=bkg,
    target="faculty",
    max_tree_depth=3,
    node_size=3,
    n_estimators=10,
)

clf.fit(train)
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/latest/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
  ", pass one or the other as an argument to suppress this warning.", FutureWarning)

BoostedRDNClassifier(background=setParam: numOfClauses=8.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=3.
setParam: maxTreeDepth=3.
mode: courseprof(-course,+person).
mode: courseprof(+course,-person).
mode: courseta(+course,-person).
mode: courseta(-course,+person).
mode: project(-proj,+person).
mode: project(+proj,-person).
mode: sameperson(-person,+person).
mode: faculty(+person).
mode: student(+person).
, n_estimators=10, neg_pos_ratio=2, solver='BoostSRL', target='faculty')

The built-in feature_importances_ attribute of a fit classifier is a Counter of how many times a features appears across the trees:

clf.feature_importances_
Counter({'student': 10, 'sameperson': 10})

These should generally be looked at while looking at the trees, so we’ll plot the first tree here as well.

It appears that the only features needed to determine if someone is a faculty member can roughly be stated as: β€œIs this person a student?” and β€œDo these two names refer to the same person?”

This might be surprising, but shows that we can induce concepts like β€œa faculty member is NOT a student.”

from srlearn.plotting import export_digraph, plot_digraph

plot_digraph(export_digraph(clf, 0), format='html')
<srlearn.plotting._GVPlotter object at 0x7eff97a2db10>

Total running time of the script: ( 0 minutes 11.383 seconds)

Gallery generated by Sphinx-Gallery