GoBoiler Projects
The faculty at the Department of Computer Science conduct research in 14 broad research areas. With more than 80 faculty members and more than 500 PhD students, the opportunities for computational research is endless. Several of the faculty propose the following list of potential projects for GoBoiler interns.
Each project link below indicates the Research Area - Faculty mentor name - Project title.
Cities are ecosystems of socio-economic entities which provide concentrated living, working, education, and entertainment options to its inhabitants. Hundreds of years ago, the significantly smaller population and the abundance of natural resources made city design, and even the functioning of cities in relation to their hinterland, quite straightforward. Unfortunately, that is not the case today with over 3.5 billion people in cities. Rather, cities, and urban spaces of all sizes, are extremely complex and their modeling is far from being solved. In this project, we aim to pull together CS, engineering, agricultural-economics, and social science to collectively exploit our unique opportunity to address this emerging problem. Research activities will span many fields, including machine learning, data science, computer graphics/vision, and will perform cross-disciplinary research focused on the idea of designing and simulating the functioning of existing and future cities. Our desire is also to pool this knowledge, identify our unique strengths, and pursue large and ambitious computing projects.
Description: Our goal is to study the impact of Generative Artificial Intelligence (GenAI) tools on learning processes in core undergraduate computer science courses. This research aims to understand how GenAI tools can be effectively incorporated into the classroom, promoting constructive learning while mitigating potential risks of academic dishonesty. Furthermore, we will investigate the role and improvement of content generation frameworks for instructors using GenAI tools in education.
Motivation: GenAI, particularly models like OpenAI's GPT series, have shown disruptive promise in numerous domains, including education. As they become more commonplace in learning environments, there's a pressing need to understand their impact on the educational journey. While they can serve as valuable educational assistants, there's also potential misuse in the form of academic dishonesty. This project addresses this dichotomy and guides educators on practical and ethical GenAI tool integration.
Expected Contributions from the Student Intern:
- Literature Review:
- Conduct a comprehensive review of current research on using GenAI in educational settings, especially in computer science courses.
- Identify benefits, challenges, and common pitfalls of using such tools.
- Synthesize findings to inform the development and revision of in-class frameworks.
- Framework Development for In-class GenAI Tool Usage:
- Collaborate with the research team to develop frameworks and guidelines for incorporating GenAI tools in classroom settings.
- Propose methods to teach students about the ethical use of these tools.
- Suggest assessment modifications to ensure that while students benefit from AI tools, they are discouraged from misuse.
- Review and Enhance Content Generation Frameworks:
- Analyze current content generation frameworks that use GenAI for educational purposes.
- Propose improvements or modifications to make content more relevant, engaging, and beneficial for computer science students.
- Collaborate in developing prototypes or proof-of-concept implementations for proposed framework modifications.
- Feedback and Iteration:
- Conduct small pilot tests of the proposed frameworks within controlled classroom environments.
- Collect feedback from students and educators regarding their experiences.
- Iterate on the framework based on the feedback, ensuring its practicality and effectiveness.
Final Deliverables:
- A comprehensive literature review document.
- Frameworks for in-class usage of GenAI tools emphasizing constructive learning and academic integrity.
- Revised and enhanced content generation frameworks with prototype implementations.
- A final report detailing the project's findings, implications, and recommendations for educators and institutions.
Reconstruction and geometric modeling of developing 3D biological structures belong among the most interesting and most visually plausible problems in Computer Graphics. This project is part of Crops in Silico initiative and grant that attempts to understand how plants grow and how they can be genetically optimized to feed more people. The objective of this task is to generate biologically plausible 3D geometries of growing plants (maize, sorghum, and wheat) by reconstructing them from series of multiple images captured over time. Each plant grows for several weeks in a controlled environment and is regularly photographed from multiple directions by using RGB, multispectral, and infrared cameras. The data needs to be converted into 3D geometry by using deep learning and then the individual plants need to be combined into a functional structural plant model that also captures the temporal dimension (growth).
Information Security & Assurance
Program synthesis aims to automatically generate a program that satisfies the user intent expressed through some high-level specifications. For instance, one of the most popular styles of inductive synthesis, Counterexample-Guided Inductive Synthesis (CEGIS), starts with a specification--user defines what the desired program does--a synthesizer produces a candidate program that might satisfy the specification. A verifier decides whether that candidate program meets the desired specification. If the specification is satisfied, we are successful, and if not, the verifier provides feedback to the synthesizer using to guide its search for new candidate programs. While program synthesis has successfully used in the areas including computer-aided education, end-user programming, and data cleaning, the application and scope of program synthesis for security and safety is largely unexplored by the technical community. In this project, we will explore algorithms and techniques to automatically generate programs from formal or informal specifications to improve the security and safety of the users and environments. We will focus on programs used to automate heterogeneous and connected sensors/actuators.
Information Security & Assurance
Fuzzing is a process of testing inputs on a software application (the fuzz target) generated as per an input generation policy and detecting if any of the test inputs trigger a bug. Additionally, a fuzzer can use feedback from the fuzz target to guide its input generation. A fuzzer often includes two key components: an input generation, and a bug oracle. The input generation module implements the input generation process and may optionally take in feedback from the fuzz target to guide the generation. The bug oracle signals to the fuzzer if the generated input has triggered a bug. Fuzzing has been conventionally used to discover memory-access violation bugs, such as buffer overflows and use-after-free. The application of fuzzing, however, to detect safety and security policy violations in the Intenet of Things (IoT) and Cyber-Physical Systems (CPS) has not been well explored. This project aims at identifying and addressing the challenges to extend fuzzing to detect policy violations in IoT/CPS environments.
Information Security & Assurance
Today machine learning (ML) is touching almost all aspects of our lives. From health care to finance, from computer networks to traffic signals, ML-based solutions are offering improved automated solutions. However, the current centralized approaches introduce a huge privacy risk when the training data comes from different mutually distrusting sources. As decentralized training data and inference environment is a harsh reality in almost all of the above application domains, it has become important to develop privacy-preserving techniques for ML training inference and disclosure. In particular, we plan to work on the two key privacy-preserving ML challenges. Is it possible to train models on confidential data without ever exposing the data? Can a model classify a sample without ever seeing it? In this project, we will design and evaluate a novel, specialized secure multi-party computation (MPC) design to answer the above two questions. Although some theoretical solutions are already available in the literature, our focus will be on developing MPC solutions that significantly improve the efficiency w.r.t current approaches to privacy-preserving ML.
Pictures and videos are taken everywhere and processed to retrieve critical information, while running computer vision (CV) tasks such as image classification, semantic segmentation, object detection and tracking, and so many. However, such CV tasks require heavy computing and are often performed at the cloud/edge side (for example, human detection over videos captured by RING surveillance cameras). This cloud/edge-based paradigm exposes a real risk of leaking user/home privacy to the outsiders who run the cloud/edge service. In this project, we plan to explore our recent technique called OPA (One-Predict-All) to protect videos/pictures sent to the cloud/edge without hurting the accuracy of the CV tasks.
Information Security & Assurance
Thanks to rapid advances in AI, particularly Natural language processing (NLP) technologies, most customer voice calls are answered by machines. Due to unique features in answering machines, we find that they can be used to leak confidential voice call information. In this project, we aim to explore advanced machine learning techniques to infer confidential voice call information out of encrypted 5G/4G voice calls in the air. We plan to develop proof-of-concept attacks to expose real threats to 5G/4G call users while they are talking to popular voice call answering systems.
Machine Learning & Artificial Intelligence
This project explores the intersection of Large Language Models (LLMs) and knowledge graphs to address the inherent limitation of LLMs, which are often ungrounded in facts, people, and places. While LLMs excel in natural language understanding, their responses may lack context and factual accuracy. By integrating LLMs with knowledge graphs, which provide structured representations of information about facts, entities, and relationships, this research aims to bridge the gap between language models and grounded knowledge, enhancing the accuracy and relevance of natural language processing tasks. The key challenge of this project is the transfer of information between the positional word embeddings of LLMs and the permutation-equivariant representations of facts, people, and places obtained from knowledge graphs.
Related papers:
- J, Gao, Y. Zhou, J. Zhou, B. Ribeiro, Double Equivariance for Inductive Link Prediction for Both New Nodes and New Relation Types, arXiv:2302.01313.
- L. Cotta, B. Bevilacqua, N. Ahmed, B. Ribeiro, Causal Lifting and Link Prediction, Proceedings of the Royal Society A, 2023.
- SC Mouli, B. Ribeiro, Asymmetry learning for counterfactually-invariant classification in OOD tasks, ICLR 2022.
- B. Srinivasan, B. Ribeiro, On the Equivalence between Node Embeddings and Structural Graph Representations, ICLR 2020.
A fundamental challenge of detecting or preventing software bugs and vulnerabilities is to know programmers' intentions, formally called specifications. If we know the specification of a program (e.g., where a lock is needed, what input a deep learning model expects, etc.), a bug detection tool can check if the code matches the specification.
Building upon our expertise on being the first to extract specifications from code comments to automatically detect software bugs and bad comments, in this project, we will analyze various new sources of software textual information (such as API documents and StackOverflow Posts) to extract specifications for bug detection. For example, the API documents of deep learning libraries such as TensorFlow and PyTorch contain a lot of input constraint information about tensors.
Our recent piror work and background can be found here: [Software Text Analytics]
Programming Languages and Compilers
We will build cool and novel techniques to make deep learning code such as TensorFlow and PyTorch reliable and secure. We will build it on top of our award-winning paper (ACM SIGSOFT Distinguished Paper Award)!
Machine learning systems including deep learning (DL) systems demand reliability and security. DL systems consist of two key components: (1) models and algorithms that perform complex mathematical calculations, and (2) software that implements the algorithms and models. Here software includes DL infrastructure code (e.g., code that performs core neural network computations) and the application code (e.g., code that loads model weights). Thus, for the entire DL system to be reliable and secure, both the software implementation and models/algorithms must be reliable and secure. If software fails to faithfully implement a model (e.g., due to a bug in the software), the output from the software can be wrong even if the model is correct, and vice versa.
This project aims to use novel approaches including differential testing to detect and localize bugs in DL software (including code and data) to address the testing oracle challenge.
Our recent piror work and background can be found here: [EAGLE-ICSE22] [Fairness-NeurIPS21] [Variance-ASE20]
Information Security and Assurance
In this project, we will develop machine learning approaches including code language models to automatically learn bug and vulnerability patterns and fix patterns from historical data to detect and fix software bugs and security vulnerabilities. We will also study and compare general code language models and domain-specific language models.
Our recent piror work and background can be found here: [VulFix-ISSTA23] [CLM-ICSE23] [KNOD-ICSE23]
Artificial Intelligence, Machine Learning, and Natural Language Processing
Many deployed machine learning models such as ChatGPT and Codex are accessible via a pay-per-query system. It is profitable for an adversary to steal these models for either theft or reconnaissance. Recent model-extraction attacks on Machine Learning as a Service (MLaaS) systems have moved towards data-free approaches, showing the feasibility of stealing models trained with difficult-to-access data. However, these attacks are ineffective or limited due to the low accuracy of extracted models and the high number of queries to the models under attack. The high query cost makes such techniques infeasible for online MLaaS systems that charge per query.
In this project, we will design novel approaches to get higher accuracy and query efficiency than prior data-free model extraction techniques.
Our recent piror work and background can be found here: [DisGUIDE-AAAI23]