User Guide

This guide walks through how to initialize, parametrize, and invoke the core methods. It may be helpful to consult the API documentation for the following modules as you progress:

Parametrize the core classes

1. Looking at our Data

This example uses the built-in example data set: “smokes-friends-cancer”. We want to model whether a person will develop cancer based on their smoking habits and their social network.

The conventional way to do machine learning might be to list the people and list their attributes. However, for social network problems it may be difficult or impossible to represent arbitrary social networks in a vector representation.

In order to get around this, we adopt Prolog clauses to represent our data:

>>> from srlearn import example_data
>>> for predicate in example_data.train.pos:
...     print(predicate)
...
cancer(Alice).
cancer(Bob).
cancer(Chuck).
cancer(Fred).
>>> from srlearn import example_data
>>> for predicate in example_data.train.facts:
...    print(predicate)
...
friends(Alice, Bob).
friends(Alice, Fred).
friends(Chuck, Bob).
friends(Chuck, Fred).
friends(Dan, Bob).
friends(Earl, Bob).
friends(Bob, Alice).
friends(Fred, Alice).
friends(Bob, Chuck).
friends(Fred, Chuck).
friends(Bob, Dan).
friends(Bob, Earl).
smokes(Alice).
smokes(Chuck).
smokes(Bob).

Since this differs from the vector representation, this uses a srlearn.Database object to represent positive examples, negative examples, and facts.

2. Declaring our Backround Knowledge

The srlearn.Background object helps declare background knowledge for a domain, as well as some parameters for model learning (this last point may seem strange, but it is designed in order to remain compatible with how BoostSRL accepts background as input).

>>> from srlearn import Background
>>> bk = Background()
>>> print(bk)
setParam: nodeSize=2.
setParam: maxTreeDepth=3.
setParam: numberOfClauses=100.
setParam: numberOfCycles=100.
<BLANKLINE>

This gives us a view into some of the default parameters. However, it is missing mode declarations [1].

We can declare modes as a list of strings:

>>> from srlearn import Background
>>> bk = Background(
...     modes=[
...         "friends(+person,-person).",
...         "friends(-person,+person).",
...         "cancer(+person).",
...         "smokes(+person).",
...     ],
... )

A full description of modes and how they constrain the search space is beyond the scope of the discussion here, but further reading may be warranted [1].

3. Initializing a Classifier

Here we will learn Relational Dependency Networks (RDNs) [2] [3] as classifiers for predicting if a person in this fictional data set will develop cancer.

>>> from srlearn.rdn import BoostedRDN
>>> from srlearn import Background
>>> bk = Background(
...     modes=[
...         "friends(+person,-person).",
...         "friends(-person,+person).",
...         "cancer(+person).",
...         "smokes(+person).",
...     ],
... )
>>> clf = BoostedRDN()
>>> print(clf)
BoostedRDN(background=None, max_tree_depth=3, n_estimators=10, node_size=2,
           target='None')

This pattern should begin to look familiar if you’ve worked with scikit-learn before. This classifier is built on top of sklearn.base.BaseEstimator and sklearn.base.ClassifierMixin, but there are still a few things we need to declare before invoking srlearn.rdn.BoostedRDN.fit().

Specifically, we need to include a “target” and “background” as parameters. The “background” is what we described above, and the “target” is what we aim to learn about: the cancer predicate.

>>> clf = BoostedRDN(background=bk, target="cancer")

Putting the pieces together

Now that we have seen each of the examples, we can put them together to learn a series of trees.

>>> from srlearn.rdn import BoostedRDN
>>> from srlearn import Background
>>> from srlearn import example_data
>>> bk = Background(
...     modes=[
...         "friends(+person,-person).",
...         "friends(-person,+person).",
...         "cancer(+person).",
...         "smokes(+person).",
...     ],
... )
>>> clf = BoostedRDN(background=bk, target="cancer")
>>> clf.fit(example_data.train)
BoostedRDN(background=setParam: nodeSize=2.
setParam: maxTreeDepth=3.
setParam: numberOfClauses=100.
setParam: numberOfCycles=100.
useStdLogicVariables: true.
mode: friends(+person,-person).
mode: friends(-person,+person).
mode: cancer(+person).
mode: smokes(+person).
,
           max_tree_depth=3, n_estimators=10, node_size=2, target='cancer')
>>> clf.predict(example_data.test)
array([ True,  True,  True, False, False])

Conclusion

For further reading, see the example gallery.

References

[1](1, 2) https://starling.utdallas.edu/software/boostsrl/wiki/basic-modes/
[2]Sriraam Natarajan, Tushar Khot, Kristian Kersting, and Jude Shavlik, “Boosted Statistical Relational Learners: From Benchmarks to Data-Driven Medicine”. SpringerBriefs in Computer Science, ISBN: 978-3-319-13643-1, 2015
[3]Sriraam Natarajan, Tushar Khot, Kristian Kersting, Bernd Gutmann, and Jude Shavlik, “Gradient-based boosting for statistical relational learning: The relational dependency network case”. Machine Learning Journal (MLJ) 2011.