Project Details
Description
This project will improve the security of software. The project will focus on cybersecurity issues in regular expressions. Regular expressions are an important tool used by computer programmers to manipulate data. Regular expressions are applied in many ways, including to validate input in a web form and to check internet traffic for malicious activity. Unfortunately, computer programmers often use regular expressions incorrectly, leading to insecure program behavior. These behaviors result in errors with serious cybersecurity consequences, including allowing malicious actors to steal personal information, seize control of a computer, or cause many websites to crash. This project will address these limitations by improving regular expression engineering practices, and by and making more trustworthy the infrastructure on which regular expressions rely. The team will incorporate undergraduate researchers, develop educational material, and engage with K-12 students. The successful completion of the project will be a significant step towards eliminating cybersecurity incidents related to regular expressions.
This project will design, develop, and evaluate (Part 1) New techniques to make it easier for programmers to re-use high-quality regular expressions; and (Part 2) Novel regex engines that are safe from regular expression denial of service (ReDoS). In Part One, the team proposes processes and tools to help engineers develop correct regexes. The approach is grounded in the re-use paradigm, helping engineers learn from others' expertise. However, to enable re-use, open problems must be addressed in regex indexing, querying, matching, ranking, and comparison. Building on a dataset of 853,818 regexes, the team will develop regex clustering techniques, and integrate novel tool development with user studies to understand modalities and metrics for querying, ranking, and comparison. Synthesizing these techniques, machine learning and new algorithms to enable the reuse-based composition, synthesis, and repair of security sensitive regexes will be applied. Project findings will be embodied in a novel publicly-accessible regex search engine and accompanying tools. In Part Two, the team will improve the trustworthiness of regex engines by eliminating the problematic worst-case characteristics. The team has begun exploring algorithmic advances that address its worst-case super-linear behavior. The team will design a ReDoS-safe algorithm with a provably constant space bound and develop novel worst-case analyses for extended features (e.g., backreferences). For practicality, the team's regex engine changes must be transparent. However, backwards compatibility checking for regex engines is an open problem. The team will develop the first regex engine semantic testing techniques, based on metamorphic and differential testing; and enable regex engine performance regression testing through the first systematic regex performance benchmark.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
| Status | Finished |
|---|---|
| Effective start/end date | 06/15/22 → 05/31/26 |
Funding
- National Science Foundation: $274,000.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.