Dos and Don'ts of Machine Learning in Computer Security
Pitfalls of machine learning in computer security refers to a set of common errors and methodological deficiencies that can arise when applying machine learning (ML) techniques to computer security problems. These pitfalls, if overlooked, can lead to invalid conclusions, over-optimistic performance estimates, and systems that are ineffective or insecure in practice.[1]
The topic has been the subject of significant academic study, as the complex and adversarial nature of computer security creates unique challenges for standard ML workflows.[1][2] Researchers have categorized these pitfalls across the typical stages of an ML pipeline, from data collection to real-world deployment.[1]
Categorization of Pitfalls
A 2022 study by Daniel Arp, et al., published at the USENIX Security Symposium, identified and analyzed ten distinct pitfalls by reviewing 30 papers from top-tier security conferences. They found that these issues were widespread, with the most common being sampling bias, data snooping, and lab-only evaluations.[1] This categorization provides a framework for discussing common methodological issues in the field.[2]
| ML Workflow Stage | Pitfall (P) | Description | Prevalence in Study[1] |
|---|---|---|---|
| Data Collection and Labeling | P1: Sampling Bias | The collected data does not sufficiently represent the true data distribution. | 60% |
| Data Collection and Labeling | P2: Label Inaccuracy | Ground-truth labels are inaccurate, unstable, or erroneous. | 10% |
| System Design and Learning | P3: Data Snooping | The learning model is trained with information typically unavailable in practice. | 57% |
| System Design and Learning | P4: Spurious Correlations | Artifacts unrelated to the security problem create shortcut patterns for separating classes. | 20% |
| System Design and Learning | P5: Biased Parameter Selection | Final parameters indirectly depend on the test set, as they were not entirely fixed at training time. | 10% |
| Performance Evaluation | P6: Inappropriate Baseline | Evaluation is conducted without, or with limited, baseline methods. | 20% |
| Performance Evaluation | P7: Inappropriate Performance Measures | Chosen measures do not account for application constraints, such as imbalanced data. | 33% |
| Performance Evaluation | P8: Base Rate Fallacy | Large class imbalance is ignored when interpreting performance measures. | 10% |
| Deployment and Operation | P9: Lab-Only Evaluation | System is solely evaluated in a laboratory setting, without discussing practical limitations. | 47% |
| Deployment and Operation | P10: Inappropriate Threat Model | The security of machine learning itself is not considered, exposing the system to attacks. | 17% |
Data Collection and Labeling
This stage involves acquiring and preparing data, which is often a source of subtle bias in security applications.[2]
Sampling bias (P1) occurs when the collected data does not reflect the real-world distribution of data. In security, this can happen when relying on limited public malware sources or mixing data from incompatible sources.[1]
Label inaccuracy (P2) arises when ground-truth labels are incorrect or unstable. For example, malware labels from sources like VirusTotal can be inconsistent, and adversary behavior can shift over time, causing "label shift."[1]
System Design and Learning
This stage includes feature engineering and model training, where information can be accidentally "leaked" to the model.
Data snooping (P3) is a common pitfall where a model is trained using information that would not be available in a real-world scenario.[1] This can happen by ignoring time dependencies (temporal snooping) or by cleansing the test set based on global knowledge (selective snooping).[2] The impact of data snooping has been a subject of further study, such as its effect on the performance of deep learning models for vulnerability detection in compiled code.[3]
Spurious correlations (P4) result when a model learns to associate artifacts with a label, rather than the underlying security-relevant pattern. For example, a malware classifier might learn to identify a specific compiler artifact instead of malicious behavior itself.[1][2]
Biased parameter selection (P5) is a form of data snooping where model hyperparameters (e.g., decision thresholds) are tuned using the test set, leading to over-optimistic results.[1]
Performance Evaluation
This stage measures a model's performance, but using inappropriate metrics can be misleading.
Inappropriate baseline (P6) involves failing to compare a complex new model against simpler, well-established baselines. A complex deep learning model may not justify its overhead if it does not significantly outperform a simple logistic regression or non-ML heuristic.[1]
Inappropriate performance measures (P7) means using metrics that do not align with the practical goals of the system. For instance, reporting only "accuracy" is insufficient for an intrusion detection system, where false-positive rates are critically important.[1]
Base rate fallacy (P8) is a failure to correctly interpret performance in the context of large class imbalances. In tasks like intrusion detection, a 0.1% false-positive rate may seem low but could result in an unmanageably high number of false alerts in practice.[1]
Deployment and Operation
This final stage concerns the model's performance and security in a live environment.
Lab-only evaluation (P9) is the practice of evaluating a system only in a controlled, static laboratory setting, which neglects real-world challenges like concept drift (where data distributions change over time) and performance overhead.[1]
Inappropriate threat model (P10) refers to failing to consider the ML system itself as an attack surface. This includes vulnerability to adversarial attacks (e.g., evasion attacks) that are specifically designed to fool the model.[1]
References
- ↑ 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 Arp, Daniel; Quiring, Erwin; Pendlebury, Feargus; Warnecke, Alexander; Pierazzi, Fabio; Wressnegger, Christian; Cavallaro, Lorenzo; Rieck, Konrad (2022). "Dos and Don'ts of Machine Learning in Computer Security" (PDF). 31st USENIX Security Symposium (USENIX Security 22). USENIX Association. pp. 207–224. ISBN 978-1-939133-31-1. Retrieved 10 November 2025.
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 Arp, Daniel; Quiring, Erwin; Pendlebury, Feargus; Warnecke, Alexander; Pierazzi, Fabio; Wressnegger, Christian; Cavallaro, Lorenzo; Rieck, Konrad (2023). "Taking the Red Pill: Lessons Learned on Machine Learning for Computer Security" (PDF). IEEE Security & Privacy. IEEE. 21 (5): 72–77. doi:10.1109/MSEC.2023.3287207. Retrieved 10 November 2025.
- ↑ 3.0 3.1 Beadle, Lucas C.; McCully, Mark E.; Al-Fileh, M. A. T. (2024). "Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code". Beadle Scholar. Dakota State University. 3 (1). Retrieved 10 November 2025. Unknown parameter
|note=ignored (help)
This article "Dos and Don'ts of Machine Learning in Computer Security" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Dos and Don'ts of Machine Learning in Computer Security. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.
