Overview

Distributions implements low-level primitives for Bayesian MCMC inference in Python and C++ including:

  • special numerical functions distributions.<flavor>.special,
  • samplers and density functions from a variety of distributions, distributions.<flavor>.random,
  • conjugate component models (e.g., gamma-Poisson, normal-inverse-chi-squared) distributions.<flavor>.models, and
  • clustering models (e.g., CRP, Pitman-Yor) distributions.<flavor>.clustering.

Python implementations are provided in up to three flavors:

  • Debug distributions.dbg are pure-python implementations for correctness auditing and error checking, and allowing debugging via pdb.
  • High-Precision distributions.hp are cython implementations for fast inference in python and numerical reference.
  • Low-Precision distributions.lp are inefficent wrappers of blazingly fast C++ implementations, intended mostly as wrappers to check that C++ implementations are correct.

Our typical workflow is to first prototype models in python, then prototype faster inference applications using cython models, and finally implement optimized scalable inference products in C++, while testing all implementations for correctness.

Feature Model API

Feature models are contained in modules in python and structs in C++. Below write Model.thing to denote module.thing in python and Model::thing in C++.

Most functions consume explicit entropy sources in C++ or global_rng implicitly in python

Below json denotes a python dict/list/number/string suitable for serialization with the json package.

Each feature model API consist of:

  • Datatypes.

    • Shared - shared global model state including fixed parameters, hyperparameters, and, for datatypes with dynamic support, shared sufficient statistics.
    • Value - observation state, i.e., datum
    • Group - local component state including sufficient statistics and possibly group parameters
    • Sampler - partially evaluated per-group sampling function (optional in python)
    • Scorer - cached per-group scoring function (optional in python)
    • Mixture - vectorized scoring functions for mixture models (optional in python)
  • Shared operations. These should be simple and fast:

    shared = Model.Shared()
    shared.protobuf_load(message)
    shared.protobuf_dump(message)
    shared.load(json)                               # python only
    shared.dump() -> json                           # python only
    
    Shared.from_dict(json) -> shared                # python only
    Shared.from_protobuf(json, message)             # python only
    Shared.to_protobuf(message) -> json             # python only
    
    shared.add_value(value)
    shared.add_repeated_value(value)
    shared.remove_value(value)
    shared.realize()
    shared.plus_group(group) -> shared              # optional
    
  • Group operations. These should be simple and fast. These may consume entropy:

    group = Model.Group()
    group.protobuf_load(message)
    group.protobuf_dump(message)
    group.load(json)                                # python only
    group.dump() -> json                            # python only
    
    Group.from_values(shared, values) -> group      # python only
    Group.from_dict(json) -> group                  # python only
    Group.from_protobuf(json, message)              # python only
    Group.to_protobuf(message) -> json              # python only
    
    group.init(shared)
    group.add_value(shared, value)
    group.add_repeated_value(shared, value, count)
    group.remove_value(shared, value)
    group.merge(shared, other_group)
    group.sample_value(shared)
    group.score_value(shared)
    group.vaidate()                                 # C++ only
    
  • Sampling. These may consume entropy:

    sampler = Model.Sampler()
    sampler.init(shared, group)
    sampler.eval(sampler) -> value
    group.sample_value(shared) -> value
    Model.sample_group(shared, group_size) -> group   # python only
    
  • Scoring. These may also consume entropy, e.g. when implemented using monte carlo integration):

    scorer = Model.Scorer()
    scorer.init(shared, group)
    scorer.eval(shared, value) -> float
    group.score_value(shared, value) -> float
    
  • Mixture Slaves (optional in python). These provide batch operations on a collection of groups.:

    mixture = Model.Mixture()
    mixture.groups().push_back(group)                 # C++ only
    mixture.append(group)                             # python only
    mixture.init(shared)
    mixture.add_group(shared)
    mixture.remove_group(shared, groupid)
    mixture.add_value(shared, groupid, value)
    mixture.remove_value(shared, groupid, value)
    mixture.score_value(shared, value, scores_accum)
    mixture.score_data(shared) -> float
    mixture.score_data_grid(shareds, scores_out)      # C++ only
    
  • Testing metadata. Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:

    # in python
    for example in Model.EXAMPLES:
        shared = Model.shared_load(example['shared'])
        values = example['values']
        ...
    
    // in C++
    Model::Shared shared = Model::Shared::EXAMPLE();
    ...
    

Clustering Model API

  • Sampling and scoring:

    model = Model()
    model.sample_assignments(sample_size)
    model.score_counts(counts)
    model.score_add_value(...)
    model.score_remove_value(...)
    
  • Mixture driver (optional in python). These provide batch operations on a collection of groups. Clustering mixture drivers, referencing a clustering model:

    mixture = model.Mixture()
    mixture.counts().push_back(count)                       # C++ only
    mixture.init(model)                                     # C++ only
    mixture.init(model, counts)                             # python only
    mixture.remove_group(shared, groupid)
    mixture.add_value(shared, groupid, value) -> bool
    mixture.remove_value(shared, groupid, value) -> bool
    mixture.score_value(shared, value, scores_out)
    mixture.score_data(shared) -> float
    

    Mixture drivers and slaves coordinate using the pattern:

    # driver is a single clustering model
    # slaves is a list of feature models
    
    def add_value(driver, slaves, groupid, value):
        added = driver.mixture.add_value(driver.shared, groupid, value)
        for slave in slaves:
            slave.mixture.add_value(slave.shared, groupid, value)
            if added:
                slave.mixture.add_group(slave.shared)
    
    def remove_value(driver, slaves, groupid, value):
        removed = driver.mixture.remove_value(driver.shared, groupid, value)
        for slave in slaves:
            slave.mixture.add_value(slave.shared, groupid, value)
            if removed:
                slave.mixture.remove_group(slave.shared, groupid)
    

    See examples/mixture/main.py for a working example.

  • Testing metadata (python only). Example model parameters and datasets are automatically discovered by unit test infrastructures, reducing the cost of per-model test-writing:

    ExampleModel.EXAMPLES = [ ...model specific... ]
    

Source of Entropy

The C++ methods explicity require a random number generator rng everywhere entropy may be consumed. The python models try to maintain compatibility with numpy.random by hiding this source either as the global numpy.random generator, or as single global_rng in wrapped C++.