Using Python and Machine Learning to Predict Cyber Attacks: A Summer Intern’s Story

August 25, 2021

Table of Contents

Last Updated on January 15, 2024

For summer 2021 I was excited to once again intern with Pivot Point Security. I enjoyed working there last summer and learned a lot about information security by engaging with technical subject matter experts to translate complex service offerings (e.g., CMMC, ISO 27001, Pen Testing) into marketing materials suitable for both technical and management audiences. I also conducted some initial research to gauge the feasibility of a Python-based document scraping tool that could be used when conducting vendor due diligence reviews. This year, I was I had the opportunity to build on that work and more deeply expand my technical skills, specifically in Python and machine learning.

In almost every job posting with some variation of “Engineer” in the title, one of the applicant requirements is a desired skill or proficiency in one or more coding languages. Trying to learn coding on your own is a daunting task, one that requires both an immense amount of time and also willingness to be continuously FRUSTRATED. Luckily, during those moments of intense frustration I had both my family and wonderful boyfriend to push me forward. It is thanks to them that I had success with this summer’s proof of concept (PoC) effort.

Unfortunately, my general engineering undergraduate program at James Madison University (JMU) does not have a strong focus on software engineering. With so little experience under my belt, I knew this was going to be a long summer.

The best way to learn is just to get your hands dirty and start

I have always had the mindset that an undergrad engineering program is above all else teaching you how to learn. To me, engineering is a mindset or a way of looking at problems. There were many classes during my undergrad where I may never use what I was taught. And while my transcript lists engineering science classes one after another, what all those classes have in common is that they focused on applied learning, and the only way to succeed and master such subjects is practice. Practice makes perfect… Cliché, right? Well, it’s true. Eventually if you have seen and solved most variations of a problem you will know where to start on related new problems. The same goes for coding—the best way to learn is just to get your hands dirty and start.

My project was a PoC to determine whether machine learning could be used to predict cyber-attacks. The driver for the initiative is that Pivot Point Security acts as the virtual CISO (vCISO) for an organization with affiliates in all 50 states. Intuitively, attacks correlate with economic, political, and technical factors. So, why not leverage available data to predict cyber security incidents to move security from a reactive to a proactive exercise? The idea was to source data from Open Threat Exchanges and Information Sharing Analysis Centers to identify situations where an attack is more likely to occur, so that resources could have a better understanding of what type and when an attack was more likely.

As JMU’s engineering program is project-based, knowing where to start should be easy, right? Wrong! The project got off to a slow start, getting lost in a YouTube wormhole as I struggled to learn Python well enough to dare to move into the even more daunting machine learning realm. The trick was knowing how much coding knowledge was enough. I decided to give myself two weeks for a refresher course in Python, working through countless YouTube channels. (While he most likely will never see this, I would like to shout out Keith Galli, who has recorded some of the best videos for learning Python.)

Python mastered :>), I moved on to getting the data. Machine learning should come with a big yellow warning label that reads,“ I am not too difficult myself, but finding and cleaning data is.”

Let the Selenium Shine in

The first challenge was that the Open Threat Exchanges and Information Sharing Analysis Centers did not have APIs, which forced me to use web scraping. Fortunately, Python is highly extensible and has additional packages that provide extensive, purpose-specific functionality. Key packages I used include Pandas, Numpy, Scikit-learn, Beautiful Soup and Selenium. Those unfamiliar with Python or various tools used in combination with Python perhaps may be asking themselves… She imported what? A panda? And what does soup have to do with this? Read on and it will all crystalize. Meanwhile, a second shout-out to Stack Overflow and the wonderful strangers who have asked and answered questions. It is a remarkable resource to learn from.

Deep into research, I came across the Beautiful Soup and Selenium packages. With these I was able to scrape all of my data from Open Threat Exchanges and Information Sharing Analysis Centers into (Panda) data frames. Beautiful Soup allowed me to extract data from HTML, which was particularly useful for the online databases. Beautiful Soup only got me so far; however, as gathering the data required programmatic interactions with the web page (e.g., entering a username/password, clicking a button, typing a string), which is where Selenium shines. Selenium allowed me to gather data from various tabs through a web driver and minimal lines of code. In the end, I scraped about 90 days of data that I unfortunately learned was not directly usable (yet) for machine learning.

To my initial surprise, one of the biggest challenges I encountered was cleaning the data I had scraped to produce data that is usable for machine learning. This step (remarkably) never comes up in the machine learning video tutorials online, as they use readily available clean data sets. The following meme will be funny to anyone who has tried to learn/use machine learning.

Cleaning data involves removing null values, removing extraneous characters from data sets, normalizing data, and converting data types. It was a painful process that I was happy to complete.

Started from the bottom and we’re here!

Finally, machine learning, here we are! I used supervised learning (also known as supervised machine learning) which is a subset of machine learning and artificial intelligence. Supervised learning uses labeled datasets to train algorithms that can predict outcomes with precision. Because the data within the data frames I was going to be using was continuous data, this specific project required regression modeling rather than classification. Regression analysis in simplest terms is a predictive modeling technique that analyzes the relation between target features (dependent variables) (the RESULT) and descriptive features (independent variable) (the PREDICTORS) in a dataset. Ultimately, running multivariable linear regression calculates a coefficient for each descriptive feature that will map a combination of our statistics to the predicted results. I also ran ridge regression as another method of estimating the coefficients and predicted values, as I felt this was a scenario where the independent variables were highly correlated. Comparing ridge regression to multivariable linear regression, the predicted values from ridge regression were slightly closer to the actual values.

For the regression testing (training the machine learning algorithm), I split my data into descriptive and target features, respectively. Some of the descriptive features included attack pattern, malware class, reconnaissance frequency, attack region, and attack industry. The target feature was the predicted number of cybersecurity incidents.

I took a deep breath and let it rip, and was relieved and excited to see that the results were very promising.

The calculated coefficients relating to the predictors closely correlated with our initial expectations; e.g., as reconnaissance frequency got higher the predicted number of attacks got higher. A positive coefficient indicated that as the value of the descriptive feature increases the mean of the target feature also tends to increase, representing a directly proportional relationship. Alternatively, a negative coefficient suggested that as the descriptive feature increases, the target variable decreases, which is an inversely proportional relationship. There were minor differences between the coefficients produced from multivariable linear regression and ridge regression, but both their positive and negative correlations were constant throughout both models.

It was then time to make predictions using the test data. While I could visually compare the actual versus predicted values, I also used R squared and RMSE to validate my model. The rule of thumb for regression modeling states that an RMSE value between 0.2 and 0.5 and an R squared greater than 0.75 are pretty good indicators for the accuracy of a model.

One of the interesting things about machine learning is gauging whether your result is actually “valuable.” Does my model have to be 99.99% accurate? Or does 70% accurate suffice? How do you gauge “success”?

It really depends. If there is no baseline for your model, then any improvement in accuracy is likely valuable. If the current machine learning baseline is 70%, then any number lower than 70% is likely not valuable. Further, truly gauging the value depends on whether the predictive capacity provides business value. For example, in our scenario, could we predict the likelihood with enough accuracy that the cost and time efforts to react to predicted events would be more than offset by the reduction in likelihood that the organization would be breached by the predicted attacks and incurs meaningful losses?

The PoC showed enough promise that PPS believes that continued investment makes sense. Moving this from a PoC to production will involve gathering and cleaning a much larger amount of data and training the model on a continuous basis. Unfortunately, some elements of making this successful are not directly within PPS’s control. Getting cleaner data (requiring better and more informed data input by end users reporting security events), normalizing data intra-source and/or inter-source, and making the data available via API will all require cooperation from the different sources.

What’s Next?

Considering the emphasis on security event information sharing in critical infrastructure in the 2021 Presidential Executive Order on Cybersecurity, it’s not unreasonable to expect that these changes will occur over the next year or two. I hope the model is ultimately used to predict attacks and reduce real-world cybersecurity risk, with or without my continued participation!