Multi-dataset evaluation using SHAP, LIME, and a novel XAI Confidence Score across CIC-IDS-2017, UNSW-NB15, and CSE-CIC-IDS-2018.
Three intrusion detection datasets with varying complexity and attack types.
14 classes including BENIGN, DoS, DDoS, Bot, Web attacks, and infiltration. Generated by CICFlowMeter from pcap captures.
10 classes with diverse attack types including Analysis, Backdoor, Exploits, Fuzzers, Reconnaissance, and Worms.
Binary classification of normal traffic vs DDoS attacks using LOIC-HTTP tool. Extension of the 2017 dataset.
Black-box models aren't enough for security operations.
Security analysts need to understand WHY traffic is flagged before taking action. XAI provides transparent reasoning.
Understanding feature contributions helps analysts quickly dismiss false alarms and focus on real threats.
Many regulations require explainable AI decisions. XCS provides quantifiable explanation reliability scores.
Understanding model decisions helps identify potential adversarial attacks targeting the IDS system.
End-to-end pipeline from raw traffic to explainable predictions.
Raw PCAP files processed by CICFlowMeter to extract 78 network flow features.
Top 20 features selected by XGBoost importance. RobustScaler applied (no data leakage).
XGBoost, Random Forest, LightGBM trained with SMOTE. VotingEnsemble combines all three.
SHAP for global importance, LIME for local explanations. XCS measures reliability.
From raw network traffic to explainable predictions.
Network flows captured from CIC-IDS-2017, UNSW-NB15, and CSE-CIC-IDS-2018 datasets with 78+ features per flow.
XGBoost, Random Forest, LightGBM, and VotingEnsemble trained on 20 selected features with SMOTE for class balance.
SHAP provides global feature importance. LIME explains individual predictions. Jaccard similarity measures agreement.
XCS = 0.4×Confidence + 0.3×SHAP Stability + 0.3×Jaccard. Scores > 0.3 indicate reliable explanations.
XGBoost, Random Forest, LightGBM, and VotingEnsemble across all datasets.
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| XGBoost Best | 0.9966 | 0.9964 | 0.9966 | 0.9964 |
| VotingEnsemble | 0.9886 | 0.9945 | 0.9886 | 0.9911 |
| RandomForest | 0.9857 | 0.9940 | 0.9857 | 0.9893 |
| LightGBM | 0.9744 | 0.9930 | 0.9744 | 0.9828 |
CIC-IDS-2017 classification results
Feature impact on predictions
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| XGBoost Best | 0.8004 | 0.8120 | 0.8004 | 0.7982 |
| VotingEnsemble | 0.7867 | 0.8240 | 0.7867 | 0.8002 |
| RandomForest | 0.7635 | 0.8202 | 0.7635 | 0.7848 |
| LightGBM | 0.7630 | 0.8291 | 0.7630 | 0.7863 |
UNSW-NB15 classification results
Feature impact on predictions
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| RandomForest Best | 0.9993 | 0.9993 | 0.9993 | 0.9992 |
| VotingEnsemble | 0.9993 | 0.9993 | 0.9993 | 0.9992 |
| XGBoost | 0.9990 | 0.9990 | 0.9990 | 0.9990 |
| LightGBM | 0.9990 | 0.9990 | 0.9990 | 0.9990 |
CSE-CIC-IDS-2018 classification results
Feature impact on predictions
Performance across all 3 datasets
Feature importance overlap analysis
Understanding model decisions with SHAP and LIME.
Top features driving predictions
Jaccard similarity of top-k features
Local explanation for normal traffic
Local explanation for DDoS detection
XAI Confidence Score across predictions
Correlation analysis
Measuring the reliability of individual explanations.
Cross-dataset Jaccard: 0.216 — indicating dataset-specific feature patterns. XCS > 0.3 indicates acceptable explanation reliability.
Get the same results on your machine.
git clone the repo and run pip install -r requirements.txt
Run python run_pipeline.py for synthetic data or --download for real data.
Open xai_ids_multidataset.ipynb on Kaggle with GPU (Tesla T4) for full multi-dataset evaluation.
Check RESULTS.md for full tables and model_metadata.json for raw metrics.
Select a traffic scenario to see XGBoost predictions with SHAP explanations and XCS scoring.
Loading real XGBoost model (2.6 MB)...
This runs the actual trained model in your browser
Evaluated on 3 real IDS datasets. XGBoost achieves 99.66% (CICIDS2017), 80% (UNSW-NB15), 99.9% (CICIDS2018).
UNSW-NB15 minority classes (Analysis, Backdoor, Shellcode, Worms) have lower detection rates (~40-50% recall).
Cross-dataset Jaccard = 0.216. Models trained on one dataset may not generalize well to others.
Novel XCS metric measures explanation reliability. XCS > 0.3 indicates acceptable confidence.