Overview¶

Distributions implements low-level primitives for Bayesian MCMC inference in Python and C++ including:

special numerical functions distributions.<flavor>.special,
samplers and density functions from a variety of distributions, distributions.<flavor>.random,
conjugate component models (e.g., gamma-Poisson, normal-inverse-chi-squared) distributions.<flavor>.models, and
clustering models (e.g., CRP, Pitman-Yor) distributions.<flavor>.clustering.

Python implementations are provided in up to three flavors:

Debug distributions.dbg are pure-python implementations for correctness auditing and error checking, and allowing debugging via pdb.
High-Precision distributions.hp are cython implementations for fast inference in python and numerical reference.
Low-Precision distributions.lp are inefficent wrappers of blazingly fast C++ implementations, intended mostly as wrappers to check that C++ implementations are correct.

Our typical workflow is to first prototype models in python, then prototype faster inference applications using cython models, and finally implement optimized scalable inference products in C++, while testing all implementations for correctness.

Feature Model API¶

Feature models are contained in modules in python and structs in C++. Below write Model.thing to denote module.thing in python and Model::thing in C++.

Most functions consume explicit entropy sources in C++ or global_rng implicitly in python

Below json denotes a python dict/list/number/string suitable for serialization with the json package.

Each feature model API consist of:

Datatypes.
- Shared - shared global model state including fixed parameters, hyperparameters, and, for datatypes with dynamic support, shared sufficient statistics.
- Value - observation state, i.e., datum
- Group - local component state including sufficient statistics and possibly group parameters
- Sampler - partially evaluated per-group sampling function (optional in python)
- Scorer - cached per-group scoring function (optional in python)
- Mixture - vectorized scoring functions for mixture models (optional in python)

Shared operations. These should be simple and fast:

shared = Model.Shared()
shared.protobuf_load(message)
shared.protobuf_dump(message)
shared.load(json)                               # python only
shared.dump() -> json                           # python only

Shared.from_dict(json) -> shared                # python only
Shared.from_protobuf(json, message)             # python only
Shared.to_protobuf(message) -> json             # python only

shared.add_value(value)
shared.add_repeated_value(value)
shared.remove_value(value)
shared.realize()
shared.plus_group(group) -> shared              # optional

Group operations. These should be simple and fast. These may consume entropy:

group = Model.Group()
group.protobuf_load(message)
group.protobuf_dump(message)
group.load(json)                                # python only
group.dump() -> json                            # python only

Group.from_values(shared, values) -> group      # python only
Group.from_dict(json) -> group                  # python only
Group.from_protobuf(json, message)              # python only
Group.to_protobuf(message) -> json              # python only

group.init(shared)
group.add_value(shared, value)
group.add_repeated_value(shared, value, count)
group.remove_value(shared, value)
group.merge(shared, other_group)
group.sample_value(shared)
group.score_value(shared)
group.vaidate()                                 # C++ only

Sampling. These may consume entropy:

sampler = Model.Sampler()
sampler.init(shared, group)
sampler.eval(sampler) -> value
group.sample_value(shared) -> value
Model.sample_group(shared, group_size) -> group   # python only

Scoring. These may also consume entropy, e.g. when implemented using monte carlo integration):

scorer = Model.Scorer()
scorer.init(shared, group)
scorer.eval(shared, value) -> float
group.score_value(shared, value) -> float

Mixture Slaves (optional in python). These provide batch operations on a collection of groups.:

mixture = Model.Mixture()
mixture.groups().push_back(group)                 # C++ only
mixture.append(group)                             # python only
mixture.init(shared)
mixture.add_group(shared)
mixture.remove_group(shared, groupid)
mixture.add_value(shared, groupid, value)
mixture.remove_value(shared, groupid, value)
mixture.score_value(shared, value, scores_accum)
mixture.score_data(shared) -> float
mixture.score_data_grid(shareds, scores_out)      # C++ only

Testing metadata. Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:

# in python
for example in Model.EXAMPLES:
    shared = Model.shared_load(example['shared'])
    values = example['values']
    ...

// in C++
Model::Shared shared = Model::Shared::EXAMPLE();
...

Clustering Model API¶

Sampling and scoring:

model = Model()
model.sample_assignments(sample_size)
model.score_counts(counts)
model.score_add_value(...)
model.score_remove_value(...)

Mixture driver (optional in python). These provide batch operations on a collection of groups. Clustering mixture drivers, referencing a clustering model:

mixture = model.Mixture()
mixture.counts().push_back(count)                       # C++ only
mixture.init(model)                                     # C++ only
mixture.init(model, counts)                             # python only
mixture.remove_group(shared, groupid)
mixture.add_value(shared, groupid, value) -> bool
mixture.remove_value(shared, groupid, value) -> bool
mixture.score_value(shared, value, scores_out)
mixture.score_data(shared) -> float

Mixture drivers and slaves coordinate using the pattern:

# driver is a single clustering model
# slaves is a list of feature models

def add_value(driver, slaves, groupid, value):
    added = driver.mixture.add_value(driver.shared, groupid, value)
    for slave in slaves:
        slave.mixture.add_value(slave.shared, groupid, value)
        if added:
            slave.mixture.add_group(slave.shared)

def remove_value(driver, slaves, groupid, value):
    removed = driver.mixture.remove_value(driver.shared, groupid, value)
    for slave in slaves:
        slave.mixture.add_value(slave.shared, groupid, value)
        if removed:
            slave.mixture.remove_group(slave.shared, groupid)

See examples/mixture/main.py for a working example.

Testing metadata (python only). Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:
```
ExampleModel.EXAMPLES = [ ...model specific... ]
```

Source of Entropy¶

The C++ methods explicity require a random number generator rng everywhere entropy may be consumed. The python models try to maintain compatibility with numpy.random by hiding this source either as the global numpy.random generator, or as single global_rng in wrapped C++.