srlearn¶

Introduction¶
srlearn is a project and set of packages for statistical relational artificial intelligence.
Standard machine learning tends to focus on learning and inference inside of a feature-vector (fit a model such that \(\boldsymbol{X}\) predicts \(y\)). Statistical Relational Learning attempts to generalize this to arbitrary graph and hypergraph data: where the prediction problem may include a set of objects with attributes and relations on those objects.
from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background
from srlearn.datasets import load_toy_cancer
train, test = load_toy_cancer()
bk = Background(modes=train.modes)
clf = BoostedRDNClassifier(
background=bk,
target='cancer',
)
clf.fit(train)
clf.predict_proba(test)
# array([0.88079619, 0.88079619, 0.88079619, 0.3075821 , 0.3075821 ])
print(clf.classes_)
# array([1., 1., 1., 0., 0.])
Getting Started¶
1. Prerequisites¶
- Java (1.8)
- Python (3.6, 3.7)
If you do not have Java, you might install it with your operating system’s package manager.
For example, on Ubuntu:
sudo apt-get install openjdk-8-jdk
macOS:
brew install openjdk
Windows (with Chocolately):
choco install openjdk
Jenv might be a helpful way to manage Java versions as well. If you’re on MacOS it’s also failry easy to set up with Homebrew.
2. Installation¶
The package can be installed from the Python Package Index (PyPi) with pip
.
pip install srlearn
3. Test Installation¶
A simple test should be whether srlearn
can be imported:
>>> import srlearn
If you’ve reached this point, you should be ready for the User Guide.
User Guide¶
This guide walks through how to initialize, parametrize, and invoke the core methods. It may be helpful to consult the API documentation for the following modules as you progress:
Parametrize the core classes¶
1. Looking at our Data¶
This example uses the built-in example data set: “smokes-friends-cancer”. We want to model whether a person will develop cancer based on their smoking habits and their social network.
The conventional way to do machine learning might be to list the people and list their attributes. However, for social network problems it may be difficult or impossible to represent arbitrary social networks in a vector representation.
In order to get around this, we adopt Prolog clauses to represent our data:
>>> from srlearn.datasets import load_toy_cancer
>>> train, _ = load_toy_cancer()
>>> for predicate in train.pos:
... print(predicate)
...
cancer(alice).
cancer(bob).
cancer(chuck).
cancer(fred).
>>> from srlearn.datasets import load_toy_cancer
>>> train, _ = load_toy_cancer()
>>> for predicate in train.facts:
... print(predicate)
...
friends(alice,bob).
friends(alice,fred).
friends(chuck,bob).
friends(chuck,fred).
friends(dan,bob).
friends(earl,bob).
friends(bob,alice).
friends(fred,alice).
friends(bob,chuck).
friends(fred,chuck).
friends(bob,dan).
friends(bob,earl).
smokes(alice).
smokes(chuck).
smokes(bob).
Since this differs from the vector representation, this uses a srlearn.Database
object
to represent positive examples, negative examples, and facts.
1. Declaring our Backround Knowledge¶
The srlearn.Background
object helps declare background knowledge for a domain, as well as
some parameters for model learning (this last point may seem strange, but it is designed in order
to remain compatible with how
BoostSRL accepts background as input).
>>> from srlearn import Background
>>> bk = Background()
>>> print(bk)
setParam: numOfClauses=100.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=2.
setParam: maxTreeDepth=3.
<BLANKLINE>
This gives us a view into some of the default parameters. However, it is missing mode declarations [1].
We can declare modes as a list of strings:
>>> from srlearn import Background
>>> bk = Background(
... modes=[
... "friends(+person,-person).",
... "friends(-person,+person).",
... "cancer(+person).",
... "smokes(+person).",
... ],
... )
A full description of modes and how they constrain the search space is beyond the scope of the discussion here, but further reading may be warranted [1].
3. Initializing a Classifier¶
Here we will learn Relational Dependency Networks (RDNs) [2] [3] as classifiers for predicting if a person in this fictional data set will develop cancer.
>>> from srlearn.rdn import BoostedRDNClassifier
>>> from srlearn import Background
>>> bk = Background(
... modes=[
... "friends(+person,-person).",
... "friends(-person,+person).",
... "cancer(+person).",
... "smokes(+person).",
... ],
... )
>>> clf = BoostedRDNClassifier()
>>> print(clf)
BoostedRDNClassifier(background=None, n_estimators=10, neg_pos_ratio=2, solver='BoostSRL', target='None')
This pattern should begin to look familiar if you’ve worked with scikit-learn before.
This classifier is built on top of
sklearn.base.BaseEstimator
and sklearn.base.ClassifierMixin
,
but there are still a few things we need to declare before invoking
srlearn.rdn.BoostedRDNClassifier.fit()
.
Specifically, we need to include a “target” and “background” as parameters. The “background” is what we described above, and the “target” is what we aim to learn about: the cancer predicate.
>>> clf = BoostedRDNClassifier(background=bk, target="cancer")
Putting the pieces together¶
Now that we have seen each of the examples, we can put them together to learn a series of trees.
>>> from srlearn.rdn import BoostedRDNClassifier
>>> from srlearn import Background
>>> from srlearn.datasets import load_toy_cancer
>>> train, test = load_toy_cancer()
>>> bk = Background(
... modes=[
... "friends(+person,-person).",
... "friends(-person,+person).",
... "cancer(+person).",
... "smokes(+person).",
... ],
... )
>>> clf = BoostedRDNClassifier(background=bk, target="cancer")
>>> clf.fit(train)
BoostedRDNClassifier(background=setParam: numOfClauses=100.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=2.
setParam: maxTreeDepth=3.
mode: friends(+person,-person).
mode: friends(-person,+person).
mode: cancer(+person).
mode: smokes(+person).
, n_estimators=10, neg_pos_ratio=2, solver='BoostSRL', target='cancer')
>>> clf.predict(test)
array([ True, True, True, False, False])
Conclusion¶
For further reading, see the example gallery.
References¶
[1] | (1, 2) https://starling.utdallas.edu/software/boostsrl/wiki/basic-modes/ |
[2] | Sriraam Natarajan, Tushar Khot, Kristian Kersting, and Jude Shavlik, “Boosted Statistical Relational Learners: From Benchmarks to Data-Driven Medicine”. SpringerBriefs in Computer Science, ISBN: 978-3-319-13643-1, 2015 |
[3] | Sriraam Natarajan, Tushar Khot, Kristian Kersting, Bernd Gutmann, and Jude Shavlik, “Gradient-based boosting for statistical relational learning: The relational dependency network case”. Machine Learning Journal (MLJ) 2011. |
srlearn API¶
Core Classes¶
These classes form the set of core pieces for describing the data, providing background knowledge, and learning.
Database () |
Database of examples and facts. |
Background (*[, modes, ok_if_unknown, …]) |
Background Knowledge for a database. |
rdn.BoostedRDNClassifier ([background, …]) |
Relational Dependency Networks Estimator |
rdn.BoostedRDNRegressor ([background, …]) |
Relational Dependency Networks Regressor |
Data Sets¶
There are some toy datasets built into the srlearn package. For more datasets, see the relational-datasets package.
datasets.load_toy_cancer () |
Load and return the Toy Cancer dataset. |
datasets.load_toy_father () |
Load and return the Toy Father dataset. |
Plotting and Visualization¶
These may be helpful for visualizing trees.
plotting.export_digraph (booster[, …]) |
Create a digraph representation of a tree. |
plotting.plot_digraph (dot_string[, format]) |
Plot a digraph as an image. |
Utilities¶
Some of these are for behind-the-scenes operations, but tend to be useful for further development (contributions are welcome!).
base.BaseBoostedRelationalModel (*[, …]) |
Base class for deriving boosted relational models |
system_manager.FileSystem () |
BoostSRL File System |
system_manager.reset ([soft]) |
Reset the FileSystem |
Deprecated boostsrl objects¶
This is the old API style that has been deprecated. It is no longer tested or actively developed and is pending removal in 0.6.0.
srlearn.boostsrl |
(Deprecated) boostsrl class for training and testing. |
Basic Examples¶
These are some simple examples demonstrating srlearn
.
Note
Click here to download the full example code
Estimating Feature Importance¶
This demonstrates how to estimate feature importance based on how often features are used as splitting criteria.
webkb
is available in the
relational-datasets package.
A brief webkb overview
is available with the relational-datasets documentation.
Calling load
will return training and test folds:
from relational_datasets import load
train, test = load("webkb", fold=1)
We’ll set up the learning problem and fit the classifier:
from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background
bkg = Background(
modes=[
"courseprof(-course,+person).",
"courseprof(+course,-person).",
"courseta(+course,-person).",
"courseta(-course,+person).",
"project(-proj,+person).",
"project(+proj,-person).",
"sameperson(-person,+person).",
"faculty(+person).",
"student(+person).",
],
number_of_clauses=8,
)
clf = BoostedRDNClassifier(
background=bkg,
target="faculty",
max_tree_depth=3,
node_size=3,
n_estimators=10,
)
clf.fit(train)
Out:
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/stable/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
", pass one or the other as an argument to suppress this warning.", FutureWarning)
BoostedRDNClassifier(background=setParam: numOfClauses=8.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=3.
setParam: maxTreeDepth=3.
mode: courseprof(-course,+person).
mode: courseprof(+course,-person).
mode: courseta(+course,-person).
mode: courseta(-course,+person).
mode: project(-proj,+person).
mode: project(+proj,-person).
mode: sameperson(-person,+person).
mode: faculty(+person).
mode: student(+person).
, n_estimators=10, neg_pos_ratio=2, solver='BoostSRL', target='faculty')
The built-in feature_importances_
attribute of a fit classifier is a
Counter of how many times a features appears across the trees:
clf.feature_importances_
Out:
Counter({'student': 10, 'sameperson': 10})
These should generally be looked at while looking at the trees, so we’ll plot the first tree here as well.
It appears that the only features needed to determine if someone is a faculty member can roughly be stated as: “Is this person a student?” and “Do these two names refer to the same person?”
This might be surprising, but shows that we can induce concepts like “a faculty member is NOT a student.”
from srlearn.plotting import export_digraph, plot_digraph
plot_digraph(export_digraph(clf, 0), format='html')
Out:
<srlearn.plotting._GVPlotter object at 0x7f040d9053d0>
Total running time of the script: ( 0 minutes 12.877 seconds)
Note
Click here to download the full example code
Smokes-Friends-Cancer¶
The smokes-friends-cancer example is a common first example in probabilistic relational models, here
we use this set to learn a Relational Dependency Network (srlearn.rdn.BoostedRDN
).
This shows how the margin between positive and negative examples is maximized as the number of iterations of boosting increases.

Out:
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/stable/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
", pass one or the other as an argument to suppress this warning.", FutureWarning)
<matplotlib.legend.Legend object at 0x7f040d75e750>
from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background
from srlearn.datasets import load_toy_cancer
import numpy as np
import matplotlib.pyplot as plt
train, test = load_toy_cancer()
bk = Background(modes=train.modes)
clf = BoostedRDNClassifier(
background=bk,
target="cancer",
max_tree_depth=2,
node_size=2,
n_estimators=20,
)
clf.fit(train)
x = np.arange(1, 21)
y_pos = []
y_neg = []
thresholds = []
for n_trees in x:
clf.n_estimators = n_trees
probs = clf.predict_proba(test)
thresholds.append(clf.threshold_)
y_pos.append(np.mean(probs[np.nonzero(clf.classes_)]))
y_neg.append(np.mean(probs[clf.classes_ == 0]))
thresholds = np.array(thresholds)
y_pos = np.array(y_pos)
y_neg = np.array(y_neg)
plt.plot(x, y_pos, "b-", label="Mean Probability of positive examples")
plt.plot(x, y_neg, "r-", label="Mean Probability of negative examples")
plt.plot(x, thresholds, "k--", label="Margin")
plt.title("Class Probability vs. Number Trees")
plt.xlabel("Number of Trees")
plt.ylabel("Probability of belonging to Positive Class")
plt.legend(loc="best")
Total running time of the script: ( 0 minutes 13.920 seconds)
Note
Click here to download the full example code
Family Relationships Domain¶
Overview: This example motivates learning about family relationships from examples of Harry Potter characters, then applies those rules to characters from Pride and Prejudice.
from srlearn.datasets import load_toy_father
train, test = load_toy_father()
The training examples in the “Toy Father” dataset describes relationships and facts about Harry Potter characters.
The first positive example: father(harrypotter,jamespotter).
means
“James Potter is the father of Harry Potter.”
The first negative example: father(harrypotter,mrgranger).
can be interpreted as
“Mr. Granger is not the father of Harry Potter.”
print(train.pos[0], "→ James Potter is the father of Harry Potter.")
print(train.neg[0], " → Mr. Granger is not the father of Harry Potter.")
Out:
father(harrypotter,jamespotter). → James Potter is the father of Harry Potter.
father(harrypotter,mrgranger). → Mr. Granger is not the father of Harry Potter.
The facts contain three additional predicates: describing children
, male
,
and who is a siblingof
.
train.facts
Out:
['male(mrgranger).', 'male(jamespotter).', 'male(harrypotter).', 'male(luciusmalfoy).', 'male(dracomalfoy).', 'male(arthurweasley).', 'male(ronweasley).', 'male(fredweasley).', 'male(georgeweasley).', 'male(hagrid).', 'male(dumbledore).', 'male(xenophiliuslovegood).', 'male(cygnusblack).', 'siblingof(ronweasley,fredweasley).', 'siblingof(ronweasley,georgeweasley).', 'siblingof(ronweasley,ginnyweasley).', 'siblingof(fredweasley,ronweasley).', 'siblingof(fredweasley,georgeweasley).', 'siblingof(fredweasley,ginnyweasley).', 'siblingof(georgeweasley,ronweasley).', 'siblingof(georgeweasley,fredweasley).', 'siblingof(georgeweasley,ginnyweasley).', 'siblingof(ginnyweasley,ronweasley).', 'siblingof(ginnyweasley,fredweasley).', 'siblingof(ginnyweasley,georgeweasley).', 'childof(mrgranger,hermione).', 'childof(mrsgranger,hermione).', 'childof(jamespotter,harrypotter).', 'childof(lilypotter,harrypotter).', 'childof(luciusmalfoy,dracomalfoy).', 'childof(narcissamalfoy,dracomalfoy).', 'childof(arthurweasley,ronweasley).', 'childof(mollyweasley,ronweasley).', 'childof(arthurweasley,fredweasley).', 'childof(mollyweasley,fredweasley).', 'childof(arthurweasley,georgeweasley).', 'childof(mollyweasley,georgeweasley).', 'childof(arthurweasley,ginnyweasley).', 'childof(mollyweasley,ginnyweasley).', 'childof(xenophiliuslovegood,lunalovegood).', 'childof(cygnusblack,narcissamalfoy).']
Our aim is to learn about what a “father” is in terms of the facts we have available. This process is usually called induction, and is often portrayed as “learning a definition of an object.”
from srlearn.rdn import BoostedRDNClassifier
from srlearn import Background
bk = Background(
modes=[
"male(+name).",
"father(+name,+name).",
"childof(+name,+name).",
"siblingof(+name,+name)."
],
number_of_clauses=8,
)
clf = BoostedRDNClassifier(
background=bk,
target="father",
node_size=1,
n_estimators=5,
)
clf.fit(train)
Out:
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/stable/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
", pass one or the other as an argument to suppress this warning.", FutureWarning)
BoostedRDNClassifier(background=setParam: numOfClauses=8.
setParam: numOfCycles=100.
usePrologVariables: true.
setParam: nodeSize=1.
setParam: maxTreeDepth=3.
mode: male(+name).
mode: father(+name,+name).
mode: childof(+name,+name).
mode: siblingof(+name,+name).
, n_estimators=5, neg_pos_ratio=2, solver='BoostSRL', target='father')
It’s important to check whether we actually learn something useful. We’ll visually inspect the relational regression trees to see what they learned.
from srlearn.plotting import plot_digraph
from srlearn.plotting import export_digraph
plot_digraph(export_digraph(clf, 0), format="html")
Out:
<srlearn.plotting._GVPlotter object at 0x7f040d25e190>
There is some variance between runs, but in the concept that the trees pick up on is roughly that “A father has a child and is male.”
plot_digraph(export_digraph(clf, 1), format="html")
Out:
<srlearn.plotting._GVPlotter object at 0x7f040d25e290>
Here the data is fairly complete, and the concept that “A father has a child and is male” seems sufficient for the purposes of this data. Let’s apply our learned model to the test data, which includes facts about characters from Jane Austen’s Pride and Prejudice.
predictions = clf.predict_proba(test)
print("{:<35} {}".format("Predicate", "Probability of being True"), "\n", "-" * 60)
for predicate, prob in zip(test.pos + test.neg, predictions):
print("{:<35} {:.2f}".format(predicate, prob))
Out:
Predicate Probability of being True
------------------------------------------------------------
father(elizabeth,mrbennet). 0.66
father(jane,mrbennet). 0.66
father(charlotte,mrlucas). 0.66
father(charlotte,mrsbennet). 0.08
father(jane,mrlucas). 0.09
father(mrsbennet,mrbennet). 0.09
father(jane,elizabeth). 0.08
The confidence might be a little low, which is a good excuse to mention
one of the hyperparameters. “Node Size,” or node_size
corresponds to
the maximum number of predicates that can be used as a split in the
dependency network. We set node_size=1
above for demonstration, but the
concept that seems to be learned: father(A, B) = [childof(B, A), male(B)]
is of size 2.
We might be able to learn a better model by taking this new information into account:
bk = Background(
modes=[
"male(+name).",
"father(+name,+name).",
"childof(+name,+name).",
"siblingof(+name,+name)."
],
number_of_clauses=8,
)
clf = BoostedRDNClassifier(
background=bk,
target="father",
node_size=2, # <--- Changed from 1 to 2
n_estimators=5,
)
clf.fit(train)
plot_digraph(export_digraph(clf, 0), format="html")
Out:
/home/docs/checkouts/readthedocs.org/user_builds/srlearn/checkouts/stable/srlearn/base.py:70: FutureWarning: solver='BoostSRL' will default to solver='SRLBoost' in 0.6.0, pass one or the other as an argument to suppress this warning.
", pass one or the other as an argument to suppress this warning.", FutureWarning)
<srlearn.plotting._GVPlotter object at 0x7f040d23f450>
This seems to be much more stable, which should also be reflected in the probabilities assigned on test examples.
predictions = clf.predict_proba(test)
print("{:<35} {}".format("Predicate", "Probability of being True"), "\n", "-" * 60)
for predicate, prob in zip(test.pos + test.neg, predictions):
print("{:<35} {:.2f}".format(predicate, prob))
Out:
Predicate Probability of being True
------------------------------------------------------------
father(elizabeth,mrbennet). 0.74
father(jane,mrbennet). 0.74
father(charlotte,mrlucas). 0.74
father(charlotte,mrsbennet). 0.07
father(jane,mrlucas). 0.08
father(mrsbennet,mrbennet). 0.08
father(jane,elizabeth). 0.07
Total running time of the script: ( 0 minutes 3.647 seconds)
Questions? Contact Alexander L. Hayes (hayesall@iu.edu)
Getting started¶
Prerequisites and installation instructions for getting started with this package.
User guide¶
Guide to instantiate, parametrize, and invoke the core methods using a built-in data set.
API documentation¶
Full documentation for the modules.
Example gallery¶
A gallery of examples with figures and expected outputs. It complements and extends past the basic example from the User Guide.
Citing¶
If you find this helpful in your work, please consider citing:
@misc{hayes2019srlearn,
title={srlearn: A Python Library for Gradient-Boosted Statistical Relational Models},
author={Alexander L. Hayes},
year={2019},
eprint={1912.08198},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Contributing¶
Many thanks to those who have already made contributions:
- Alexander L. Hayes, Indiana University, Bloomington
- Harsha Kokel, The University of Texas at Dallas
- Siwen Yan, The University of Texas at Dallas
Many thanks to the known and unknown contributors to WILL/BoostSRL/SRLBoost, including: Navdeep Kaur, Nandini Ramanan, Srijita Das, Mayukh Das, Kaushik Roy, Devendra Singh Dhami, Shuo Yang, Phillip Odom, Tushar Khot, Gautam Kunapuli, Sriraam Natarajan, Trevor Walker, and Jude W. Shavlik.
We have adopted the Contributor Covenant Code of Conduct version 1.4. Please read, follow, and report any incidents which violate this.
Questions, Issues, and Pull Requests are welcome. Please refer to CONTRIBUTING.md for information on submitting issues and pull requests.