Abstract
This paper presents a three-stage approach to analyzing Common Vulnerabili- ties and Exposures (CVE) vulnerability datasets using machine learning tech- niques. In the first stage, K-Means clustering, and Linear discriminant analysis (LDA) topic modeling are applied to identify distinct clusters and topics within the dataset. The Elbow method is used to determine the optimal number of clusters for K-Means, while Grid Search is used to find the best topic model for LDA. After labeling 100 random samples from each cluster, the data is split into training and testing sets for use in various classification algorithms in the third stage. The paper contributes to the field by proposing a novel approach to analyzing CVE vulnerability datasets that combines clustering and classi- fication techniques. The use of K-Means clustering and LDA topic modeling allows for the identification of distinct clusters and topics within the dataset, which can be used to improve the accuracy of classification algorithms. The study highlights the importance of using pre-trained word embeddings and dis- cusses the limitations of the proposed approach. Overall, the paper provides valuable insights into the analysis of CVE vulnerability datasets and offers a framework for future research in this area.