Researchers uncover fake base stations in cellular networks using machine learning
07-24-2023
Security researchers: Professor Elisa Bertino, Dr. Imtiaz Karim and graduate student Kazi Mubasshir
Recent work: researchers at Purdue CS showed high-quality datasets could be used to detect fake base stations in cellular networks using machine learning algorithms.
We rely on cellular networks for a range of applications, from making phone calls to accessing the internet. However, the proliferation of fake base stations in cellular networks, sometimes called stingrays, cell-site simulators, or IMSI catchers, poses a significant security threat, which can lead to serious consequences.
Security researchers at Purdue University’s Department of Computer Science led a recent project that showed how high-quality datasets could be used to detect fake base stations in cellular networks using machine learning algorithms.
Importance of cellular network security
Phones periodically and automatically broadcast their presence to the cell tower that is nearest to them, so that the phone carrier’s network can provide them with service in that location.
A fake base station masquerades as a cell tower to coerce phones into pinging it rather than legitimate cell towers, thereby revealing the phones' identities. In the past, this was achieved by emitting a signal stronger than that generated by legitimate cell towers in its vicinity.
Authentication was introduced to tackle this problem, but security researchers found a fake base station could still collect user data before the phone software could there was a problem and switch to a cellular tower that was authenticated. With this crucial mistake, the phone or other device reveals information about itself and its user to the operator of the fake base station.
"Prior efforts have used network scanning devices to scan the signals and detect malicious signals in a network to identify fake base stations. Unfortunately, these efforts are highly impractical due to the massive infrastructure cost required to cover all the possible areas.”
Dr. Imtiaz Karim, post doc at Purdue University
Attackers can use fake base stations as stepping stones to launch various multi-step attacks, such as signal counterfeiting, numb attacks, detach/downgrade attacks, energy depletion attacks, and panic attacks. These attacks can cause severe damage to individuals, organizations, and even governments.
“Because of these huge implications, it’s crucial to detect fake base stations hiding in a cellular network,” said Dr. Imtiaz Karim, postdoc at Purdue University. He added, “Prior efforts have used network scanning devices to scan the signals and detect malicious signals in a network to identify fake base stations. Unfortunately, these efforts are highly impractical due to the massive infrastructure cost required to cover all the possible areas.”
One practical solution can be found in device solutions that can detect fake base stations from the network traces and machine learning algorithms which have proven to be very successful in this task. This approach requires high-quality datasets and creating a high-quality dataset to train machine learning algorithms for fake base station detection is a challenging task.
FBS-Detector
Because it is illegal to create fake base stations in public areas and there are no publicly available datasets capable of training the algorithms, the researchers developed FBS-Detector using POWDER to create the dataset. POWDER provides a controlled environment to simulate different cellular network scenarios and create a high-quality dataset for training machine learning models.
The researchers created various topologies in POWDER and ran experiments to capture the legitimate and fake packets transmitted between mobile devices and base stations.
Datasets such as the one collected with POWDER can aid researchers in developing effective solutions for detecting fake base stations in cellular networks and securing them against attacks.
Each packet has multiple fields and each field has multiple attributes. Relevant features were extracted from the packets to represent the characteristics of the network traffic. Feature extraction includes parameters such as signal strength, packet timing, protocol usage, sequence patterns, or statistical properties.
Careful consideration was given to select features that effectively capture the differences between legitimate and fake base stations. The extracted features need to be represented in a suitable format for machine learning algorithms to process. This typically involves organizing the features into a tabular format, where each row corresponds to a sample (packet trace) and each column represents a specific feature.
The additional preprocessing steps, such as normalization or dimensionality reduction techniques like principal component analysis (PCA), were applied to enhance the dataset representation. The unprocessed dataset has a size of 2.5 gigabytes. After the preprocessing steps, there was a large tabular dataset with 200 features and 2500 rows that is sufficient for classical machine learning algorithms to effectively learn and generalize the detection of fake base stations from a network trace.
“Using the dataset, we trained machine learning models in unsupervised learning, which allowed us to detect the presence of fake base stations in cellular networks with high accuracy,” said Karim.
Creating the Dataset
To assess the performance of the machine learning model, the dataset was divided into training and testing subsets. The training set is used to train the model, while the testing set evaluates the model’s performance on unseen data. It is important to ensure an appropriate split ratio to avoid overfitting or underfitting the model, as the pipeline has multiple components.
The segmentation component aims to divide the packet traces into distinct segments or sessions. This is important because a single capture may contain multiple interactions or activities, and analyzing them separately can improve the accuracy of the detection process.
Segmentation can be based on various factors such as time intervals, and protocol types. The recognition component is the core of the pipeline and involves training a machine learning algorithm to recognize patterns or signatures indicative of a fake base station. This typically involves unsupervised learning, where the algorithm is trained using instances of fake base stations and legitimate base stations. Various machine learning techniques such as decision tree, random forest, support vector machine, and k-nearest-neighbor were employed for recognition and classification. The trained model can then be used to predict the presence of a fake base station in unseen packet traces.
The work on POWDER highlights the significance of a high-quality dataset in detecting fake base stations in cellular networks using machine learning algorithms. By carefully preparing the dataset, including appropriate feature extraction, representation, and splitting, researchers can ensure the effectiveness and reliability of the machine learning pipeline for detecting fake base stations. It is crucial to consider the quality, diversity, and representativeness of the dataset to enhance the pipeline’s performance and generalizability in real-world scenarios.
The POWDER platform proved to be a valuable tool in creating a high-quality dataset, which can aid researchers and practitioners in developing effective solutions for detecting fake base stations in cellular networks and securing them against attacks.
The work is led by Dr. Imtiaz Karim, postdoc at Purdue University, Department of Computer Science, and involves Professor Elisa Bertino and graduate student Kazi Mubasshir.
The work has been partially supported by the NSF grant 2112471 – AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE) and by a supplement by the PAWR program.
About the Department of Computer Science at Purdue University
Founded in 1962, the Department of Computer Science was created to be an innovative base of knowledge in the emerging field of computing as the first degree-awarding program in the United States. The department continues to advance the computer science industry through research. US News & Reports ranks Purdue CS #20 and #16 overall in graduate and undergraduate programs respectively, seventh in cybersecurity, 10th in software engineering, 13th in programming languages, data analytics, and computer systems, and 19th in artificial intelligence. Graduates of the program are able to solve complex and challenging problems in many fields. Our consistent success in an ever-changing landscape is reflected in the record undergraduate enrollment, increased faculty hiring, innovative research projects, and the creation of new academic programs. The increasing centrality of computer science in academic disciplines and society, and new research activities - centered around data science, artificial intelligence, programming languages, theoretical computer science, machine learning, and cybersecurity - are the future focus of the department. cs.purdue.edu
Media contact: Emily Kinsell, emily@purdue.edu
Sources: Imtiaz Karim, karim7@purdue.edu
Elisa Bertino, bertino@purdue.edu
Kazi Mubasshir, kmubassh@purdue.edu