BOCD-GMM: Gaussian Mixture Model

Overview

The BOCD-GMM model is a particle-based, non-parametric approach for detecting changepoints in complex data distributions. It uses Gaussian Mixture Models (GMM) to handle multimodal data and provides robustness against outliers.

When to Use BOCD-GMM

Best suited for:

Multimodal data with multiple modes
Heavy-tailed distributions
Outlier-prone data streams
Non-Gaussian distributions
Complex, heterogeneous data

Advantages:

Handles multimodal and non-Gaussian data
Robust to outliers through mixture components
Flexible distribution modeling
Better performance on realistic data

Limitations:

Computationally more expensive than NIG
Many hyperparameters to tune
Requires more data for stable estimates
Slower execution than BOCD-NIG

Parameters

Initialization

from pybocd import BOCDGMM

model = BOCDGMM(
    # Component parameters
    alpha_0=2.0,
    beta_0=1.0,
    # Mean parameters
    m_0=0.0,
    kappa_0=1.0,
    # Precision parameters
    alpha_p_0=2.0,
    beta_p_0=1.0,
    # Mixture weight parameters
    mu_p_0=-2,
    sigma_p_sq_0=0.01,
    # Jitter (smoothing) parameters
    jitter_mu=0.01,
    jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01,
    jitter_pi=0.01,
    # Inference parameters
    l=100.0,  # Expected run length
    m=400,  # Number of mixture components
    n=500,  # Number of particles
    init_particle_n=5,  # Initial number of particles
)

Parameter Descriptions

Parameter	Description
`alpha_0`, `beta_0`	Prior parameters for component weighting
`m_0`, `kappa_0`	Prior mean and precision for mixture component means
`alpha_p_0`, `beta_p_0`	Prior shape/rate for component precisions
`mu_p_0`, `sigma_p_sq_0`	Parameters for precision prior distribution
`jitter_*`	Smoothing parameters for particle updates
`l`	Expected run length between changepoints
`m`	Expected number of particles after SOR
`n`	Threshold number of particles before SOR
`init_particle_n`	Initial particle count before resampling

Usage Example

import numpy as np
from pybocd import BOCDGMM
import matplotlib.pyplot as plt

# Generate synthetic multimodal data
np.random.seed(42)
data = np.concatenate(
    [
        np.random.normal(-2, 0.5, 100),  # Mode 1: mean=-2
        np.random.normal(2, 0.5, 100),  # Mode 2: mean=2
        np.random.normal(0, 0.5, 100),  # Mode 1 returns
        np.random.normal(-2, 0.5, 100),
    ]
)

# Add some outliers
outlier_indices = np.random.choice(len(data), 10, replace=False)
data[outlier_indices] += np.random.normal(0, 5, 10)

# Initialize GMM-based model
model = BOCDGMM(
    alpha_0=2.0,
    beta_0=0.5,
    m_0=0.0,
    kappa_0=0.1,
    alpha_p_0=2.0,
    beta_p_0=5.0,
    mu_p_0=-2.2,
    sigma_p_sq_0=0.01,
    jitter_mu=0.01,
    jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01,
    jitter_pi=0.01,
    l=40.0,
    m=400,
    n=500,
    init_particle_n=5,
)

# Process data
for t, x in enumerate(data):
    model.add_data(x)

    # Compute the MAP estimate of the run length
    weights = model.weights
    run_length = model.run_length
    combined_weights = np.bincount(run_length, weights=weights)
    r = np.argmax(combined_weights)
    print(f"Time step {t}: MAP = {r}")

Tuning the GMM Model

Number of Particles (`n`)

More particles = more accurate but slower:

model = BOCDGMM(..., n=50)   # Fast, less accurate
model = BOCDGMM(..., n=500)   # Balanced
model = BOCDGMM(..., n=5000)  # Slow, more accurate

Jitter Parameters

Smoothing for particle diversity:

# Less smoothing (sharper updates)
model = BOCDGMM(..., jitter_mu=0.001, jitter_sigma_sq=0.001)

# More smoothing (smoother updates)
model = BOCDGMM(..., jitter_mu=0.1, jitter_sigma_sq=0.1)

Prior Settings

Weak priors:

BOCDGMM(alpha_0=1.0, beta_0=1.0, kappa_0=0.1, ...)

Strong priors:

BOCDGMM(alpha_0=10.0, beta_0=10.0, kappa_0=10.0, ...)

Performance Considerations

Memory: Larger particle counts increase memory usage.
Speed: Slower than BOCD-NIG by 5-50x depending on parameters.
Accuracy: Good at detecting level shifts and anomalies simultaneously.
Stability: More stable with larger particle counts.

Comparison with BOCD-NIG

Aspect	BOCD-NIG	BOCD-GMM
Data Type	Level shifts dominated	Level shifts + anomalies
Speed	Very fast	Slower
Robustness	Low to outliers	High
Hyperparameters	Few (4-5)	Many (10+)
Computational Cost	O(1) in each time step	O(n) in each time step

References

For theoretical details, see the original BOCD paper and advanced particle filtering literature.