BOCD-GMM: Gaussian Mixture Model
Overview
The BOCD-GMM model is a particle-based, non-parametric approach for detecting changepoints in complex data distributions. It uses Gaussian Mixture Models (GMM) to handle multimodal data and provides robustness against outliers.
When to Use BOCD-GMM
Best suited for:
- Multimodal data with multiple modes
- Heavy-tailed distributions
- Outlier-prone data streams
- Non-Gaussian distributions
- Complex, heterogeneous data
Advantages:
- Handles multimodal and non-Gaussian data
- Robust to outliers through mixture components
- Flexible distribution modeling
- Better performance on realistic data
Limitations:
- Computationally more expensive than NIG
- Many hyperparameters to tune
- Requires more data for stable estimates
- Slower execution than BOCD-NIG
Parameters
Initialization
from pybocd import BOCDGMM
model = BOCDGMM(
# Component parameters
alpha_0=2.0,
beta_0=1.0,
# Mean parameters
m_0=0.0,
kappa_0=1.0,
# Precision parameters
alpha_p_0=2.0,
beta_p_0=1.0,
# Mixture weight parameters
mu_p_0=-2,
sigma_p_sq_0=0.01,
# Jitter (smoothing) parameters
jitter_mu=0.01,
jitter_sigma_sq=0.01,
jitter_tau_sq=0.01,
jitter_pi=0.01,
# Inference parameters
l=100.0, # Expected run length
m=400, # Number of mixture components
n=500, # Number of particles
init_particle_n=5, # Initial number of particles
)
Parameter Descriptions
| Parameter | Description |
|---|---|
alpha_0, beta_0 | Prior parameters for component weighting |
m_0, kappa_0 | Prior mean and precision for mixture component means |
alpha_p_0, beta_p_0 | Prior shape/rate for component precisions |
mu_p_0, sigma_p_sq_0 | Parameters for precision prior distribution |
jitter_* | Smoothing parameters for particle updates |
l | Expected run length between changepoints |
m | Expected number of particles after SOR |
n | Threshold number of particles before SOR |
init_particle_n | Initial particle count before resampling |
Usage Example
import numpy as np
from pybocd import BOCDGMM
import matplotlib.pyplot as plt
# Generate synthetic multimodal data
np.random.seed(42)
data = np.concatenate(
[
np.random.normal(-2, 0.5, 100), # Mode 1: mean=-2
np.random.normal(2, 0.5, 100), # Mode 2: mean=2
np.random.normal(0, 0.5, 100), # Mode 1 returns
np.random.normal(-2, 0.5, 100),
]
)
# Add some outliers
outlier_indices = np.random.choice(len(data), 10, replace=False)
data[outlier_indices] += np.random.normal(0, 5, 10)
# Initialize GMM-based model
model = BOCDGMM(
alpha_0=2.0,
beta_0=0.5,
m_0=0.0,
kappa_0=0.1,
alpha_p_0=2.0,
beta_p_0=5.0,
mu_p_0=-2.2,
sigma_p_sq_0=0.01,
jitter_mu=0.01,
jitter_sigma_sq=0.01,
jitter_tau_sq=0.01,
jitter_pi=0.01,
l=40.0,
m=400,
n=500,
init_particle_n=5,
)
# Process data
for t, x in enumerate(data):
model.add_data(x)
# Compute the MAP estimate of the run length
weights = model.weights
run_length = model.run_length
combined_weights = np.bincount(run_length, weights=weights)
r = np.argmax(combined_weights)
print(f"Time step {t}: MAP = {r}")
Tuning the GMM Model
Number of Particles (n)
More particles = more accurate but slower:
model = BOCDGMM(..., n=50) # Fast, less accurate
model = BOCDGMM(..., n=500) # Balanced
model = BOCDGMM(..., n=5000) # Slow, more accurate
Jitter Parameters
Smoothing for particle diversity:
# Less smoothing (sharper updates)
model = BOCDGMM(..., jitter_mu=0.001, jitter_sigma_sq=0.001)
# More smoothing (smoother updates)
model = BOCDGMM(..., jitter_mu=0.1, jitter_sigma_sq=0.1)
Prior Settings
Weak priors:
BOCDGMM(alpha_0=1.0, beta_0=1.0, kappa_0=0.1, ...)
Strong priors:
BOCDGMM(alpha_0=10.0, beta_0=10.0, kappa_0=10.0, ...)
Performance Considerations
- Memory: Larger particle counts increase memory usage.
- Speed: Slower than BOCD-NIG by 5-50x depending on parameters.
- Accuracy: Good at detecting level shifts and anomalies simultaneously.
- Stability: More stable with larger particle counts.
Comparison with BOCD-NIG
| Aspect | BOCD-NIG | BOCD-GMM |
|---|---|---|
| Data Type | Level shifts dominated | Level shifts + anomalies |
| Speed | Very fast | Slower |
| Robustness | Low to outliers | High |
| Hyperparameters | Few (4-5) | Many (10+) |
| Computational Cost | O(1) in each time step | O(n) in each time step |
References
For theoretical details, see the original BOCD paper and advanced particle filtering literature.