ABSTRACT

Systems for learning to detect anomalous email behavior, such as worms and viruses, tend to build either per user models or a single global model. Global models leverage a larger training corpus but often model individual users poorly. Per-user models capture fine grained behaviors but can take a long time to accumulate sufficient training data. Approaches that combine global and per-user information have the potential to address these limitations. We use the Latent Dirichlet Allocation model to transition smoothly from the global prior to a particular user’s empirical model as the amount of user data grows. Preliminary results demonstratelong-term accuracy comparable to per-user models, while also showing near-ideal performance almost immediately on new users.

vulnerability, Baye’sclassification, Latent Dirichlet Allocation, per-user mixture model, global mixture model, SMTP engines