BOCD-GMM: Gaussian Mixture Model

Overview

The BOCD-GMM model is a particle-based, non-parametric approach for detecting changepoints in complex data distributions. It uses Gaussian Mixture Models (GMM) to handle multimodal data and provides robustness against outliers.

When to Use BOCD-GMM

Best suited for:

  • Multimodal data with multiple modes
  • Heavy-tailed distributions
  • Outlier-prone data streams
  • Non-Gaussian distributions
  • Complex, heterogeneous data

Advantages:

  • Handles multimodal and non-Gaussian data
  • Robust to outliers through mixture components
  • Flexible distribution modeling
  • Better performance on realistic data

Limitations:

  • Computationally more expensive than NIG
  • Many hyperparameters to tune
  • Requires more data for stable estimates
  • Slower execution than BOCD-NIG

Parameters

Initialization

from pybocd import BOCDGMM

model = BOCDGMM(
    # Component parameters
    alpha_0=2.0,
    beta_0=1.0,
    # Mean parameters
    m_0=0.0,
    kappa_0=1.0,
    # Precision parameters
    alpha_p_0=2.0,
    beta_p_0=1.0,
    # Mixture weight parameters
    mu_p_0=-2,
    sigma_p_sq_0=0.01,
    # Jitter (smoothing) parameters
    jitter_mu=0.01,
    jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01,
    jitter_pi=0.01,
    # Inference parameters
    l=100.0,  # Expected run length
    m=400,  # Number of mixture components
    n=500,  # Number of particles
    init_particle_n=5,  # Initial number of particles
)

Parameter Descriptions

Parameter Description
alpha_0, beta_0 Prior parameters for component weighting
m_0, kappa_0 Prior mean and precision for mixture component means
alpha_p_0, beta_p_0 Prior shape/rate for component precisions
mu_p_0, sigma_p_sq_0 Parameters for precision prior distribution
jitter_* Smoothing parameters for particle updates
l Expected run length between changepoints
m Expected number of particles after SOR
n Threshold number of particles before SOR
init_particle_n Initial particle count before resampling

Usage Example

import numpy as np
from pybocd import BOCDGMM
import matplotlib.pyplot as plt

# Generate synthetic multimodal data
np.random.seed(42)
data = np.concatenate(
    [
        np.random.normal(-2, 0.5, 100),  # Mode 1: mean=-2
        np.random.normal(2, 0.5, 100),  # Mode 2: mean=2
        np.random.normal(0, 0.5, 100),  # Mode 1 returns
        np.random.normal(-2, 0.5, 100),
    ]
)

# Add some outliers
outlier_indices = np.random.choice(len(data), 10, replace=False)
data[outlier_indices] += np.random.normal(0, 5, 10)

# Initialize GMM-based model
model = BOCDGMM(
    alpha_0=2.0,
    beta_0=0.5,
    m_0=0.0,
    kappa_0=0.1,
    alpha_p_0=2.0,
    beta_p_0=5.0,
    mu_p_0=-2.2,
    sigma_p_sq_0=0.01,
    jitter_mu=0.01,
    jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01,
    jitter_pi=0.01,
    l=40.0,
    m=400,
    n=500,
    init_particle_n=5,
)

# Process data
for t, x in enumerate(data):
    model.add_data(x)

    # Compute the MAP estimate of the run length
    weights = model.weights
    run_length = model.run_length
    combined_weights = np.bincount(run_length, weights=weights)
    r = np.argmax(combined_weights)
    print(f"Time step {t}: MAP = {r}")

Tuning the GMM Model

Number of Particles (n)

More particles = more accurate but slower:

model = BOCDGMM(..., n=50)   # Fast, less accurate
model = BOCDGMM(..., n=500)   # Balanced
model = BOCDGMM(..., n=5000)  # Slow, more accurate

Jitter Parameters

Smoothing for particle diversity:

# Less smoothing (sharper updates)
model = BOCDGMM(..., jitter_mu=0.001, jitter_sigma_sq=0.001)

# More smoothing (smoother updates)
model = BOCDGMM(..., jitter_mu=0.1, jitter_sigma_sq=0.1)

Prior Settings

Weak priors:

BOCDGMM(alpha_0=1.0, beta_0=1.0, kappa_0=0.1, ...)

Strong priors:

BOCDGMM(alpha_0=10.0, beta_0=10.0, kappa_0=10.0, ...)

Performance Considerations

  • Memory: Larger particle counts increase memory usage.
  • Speed: Slower than BOCD-NIG by 5-50x depending on parameters.
  • Accuracy: Good at detecting level shifts and anomalies simultaneously.
  • Stability: More stable with larger particle counts.

Comparison with BOCD-NIG

Aspect BOCD-NIG BOCD-GMM
Data Type Level shifts dominated Level shifts + anomalies
Speed Very fast Slower
Robustness Low to outliers High
Hyperparameters Few (4-5) Many (10+)
Computational Cost O(1) in each time step O(n) in each time step

References

For theoretical details, see the original BOCD paper and advanced particle filtering literature.