Distinguished paper The Impact of Regular Expression Denial

Distinguished paper The Impact of Regular Expression Denial

Distinguished paper The Impact of Regular Expression Denial of Service (ReDoS) in Practice James Davis Francisco Servant Christy Coghlan Dongyoon Lee -1- Contributions 1. ReDoS regexes are prevalent in the wild Not just one or two Thousands!

2. Heuristics to detect them are inaccurate 3. Developers (try to) fix them with 3 -2- Background -3- Regexes What are they? Examples Describe a language Used for Input validation Text manipulation

Some a /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ IPv4 /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ Username (Poorly tested) [W&S18]: Thursday! /^a+$/ ReDoS regex -4- Regex Engines What are they? match(regex, string) this string?

Does this regex match How do they work? 1. /^a+$/ 2. Simulate on input aaa -5- ReDoS Regexes Simple ReDoS regex /^(a+)+$/ NFA Malicious input aaaaaaaaaaaa! Mismatch

Recurrence relation T(n) = 2*T(n-1) Exponential paths = 2*(2*T(n-2)) = O(2n) -6- Regular Expression Denial of Service (ReDoS) [C&W 03] /^[a-zA-Z0-9]+([._]?[a-zA-Z0-9]+)*$/ aaaaaaa! T h ro u g h p u t Susie

Malicious input injected [DWL 18] ReDoS attack on our Node.js server 8000 6000 [S&P 18] 4000 2000 0 0 1 2 3

4 5 6 Time (seconds) 7 8 9 10 -7- Research -8-

Theme 1 ReDoS Regexes in the Wild -9- Collecting Regexes Module selection Module regex extraction 45% 565K 35% 125K Giant list of regexes

Can clone filter 375K (66%) 70K (58%) 350K 60K - 10 - Analyzing Regexes 60K 350K 2. Degree 1. ReDoS regexes

1000 [R&T 14] [WMBW 16] [WOHD 17] 900 800 700 600 500 400 3.6K (1%) 704 (1%) 13K (3%)

705 (1%) 300 200 100 0 0 5 10 15 20 25 30

35 - 11 - % o f R e D o S re g e x ReDoS Regexes are Usually Quadratic Degree of vulnerability 80 60 40 20 0 npm Exponential n^2

pypi n^3 n^4 - 12 - ReDoS Regexes occur in Prominent Places 1 regex 2 regexes 1 regex 3 regexes 3 regexes - 13 -

Theme 2 Do Developers Heuristics Work? - 14 - Heuristics Used in Practice Finding Heuristics Reference books Regex websites What they said Star height saferegex Watch out when different parts of the regex can match the same text

Our Heuristics Exponentia Star height ( ) Q.O.D. (a|a) Q.O.A. a a Polynomial - 15 - Few false negatives

Many false positives Developed detectors for these heuristics Imprecise modeled on safe-regex (star height) % False Positives % ReDoS Regexes 80 80 40 40 0 0 SH

QOD QOA SH QOD QOA - 16 - Bringing Research to Practice I am now the maintainer of safe-regex Fixed false negatives in star height heuristic Incorporating (improved) QOD and QOA heuristics - 17 - Theme 3

The Repair of ReDoS Regexes - 18 - (ReDoS) Regexes Are Hard to Understand /^(\+|-)?(\d+|(\d*\.\d*))? (E|e)?([-+])?(\d+)?$/ /([^\=\*\s]+)(\*)?\s*\=\s* (?:([^;'"\s]+\'[\w]*\ [^;\s]+)|(?:\"([^"]*)\")| ([^;\s]*))(?:\s*(?:;\s*)|$)/ 1. /^([\\s\\S]*?)((?:\\.{1,2}|[^\\\\\\/]+?|)(\\.[^.\\/\\\\]*|))(?:[\\\\\\/]*)$/ 2. /^(\\/?|)([\\s\\S]*?)((?:\\.{1,2}|[^\\/]+?|(\\.[^.\\/]*|))(?:[\\/]*)$/ 1.

2. 3. [CWS 17] /^\[email protected]\S+\.\S+$/ /^(.*?)([.,:;!]+)$/ /<(\/)?([^ ]+?)(?:(\s*\/)| .*?)?>/ 1. /\+OK.*(<[^>]+>)/ 2. /\s*#?\s*$/ 3. /^\s*/\*\s*(.+?)\s*\*/\s*$/ - 19 - Methodology Historic Disclosures & Fixes Study all ReDoS reports

in CVE and Snyk.io DBs What do developers prefer when they know all the fix strategies? Email 284 module maintainers Vulnerability disclosure Fix strategies - 20 - Fix Strategies For RedoS Regexes Original/^\[email protected]\S+\.\S+$/ Fix strategies Trim TRUNCATE(input, 1000) Revise /^[^@][email protected]([^\[email protected]]+\.)+$/ Replace* (Custom parser) Exactly one @, somewhere in the middle of the string

A . to the right of the @ But not immediately to the right - 21 - Fix strategies and correctness Tr im 1 incorrect 2 incorrect Tr im 40 30 20 10

0 Re vi se Re pl ac e 20 15 10 5 0 New Fixes Re vi se

Re pl ac e Historic All correct - 22 - Closing Remarks - 23 - (Some) Limitations and Future Work Reachability Module vs. application? Why do developers. use regexes? [C&S 16]

ReDoS regex == ReDoS? Scaling up [S&P 18]? Study how modules are used? - 24 - ReDoS regexes are a real problem in practice Regexes are widely used in JavaScript and Python modules 1% of unique regexes are ReDoS regexes ReDoS regexes occur in 1-3% of modules Heuristics are inaccurate Thank ReDoS regexes are hard to fix

you for your attention - 25 - Bonus material - 26 - Subset of all possible SL regexes We study the sub-class of regexes with super-linear structure Structure permits redundant state exploration via backtracking Graph reachability problem We ignore super-linear regex features Backreferences, lookaround assertions - 27 -

False negatives in the SL regex detectors The SL regex detectors are using fullMatch semantics Regex engines sometimes implement partialMatchunwisely Thus, and somewhat horrifyingly: /(a+)b/ may be super-linear Equivalent to the fullMatch /^.*?(a+)b/ QOA - 28 - Regex usage in practice - 29 - Heuristics

Heuristic Example Complexity Star height > 1 (a+)+ Exponential Q.O. Disjunction* (a|a)+ Exponential ''

.+.+ Polynomial (earlier example) 2 choices; 1 path explores Q.O. Adjacency* Why? 2 choices; each path explores the same states

- 30 - Domains User-agent strings Emails UR L HTML Numeric - 31 - Regex Domains Semantic

meaning Error message File name HTML URL CamelCase etc. Source code User-agent strings Whitespace Number Email # in npm # in pypi 22 K 10 K 8K 7K

4K 4K 3K 881 497 2.5 K 2K 1K 105 124 2K 762 444 441 238 97 - 32 -

ReDoS Regexes occur in Different Domains 8 % % ReDoS Regexes In Each Domain 6 4 2 0 npm pypi - 33 - Fix strategies

xing ReDoS Regexes is hard Histori c New Revising is popular Total Trim Revise Replac e Fix approach

37 8 18 11 # Unsafe 3 1 2 0 Fix approach

48 3 35 15 # Unsafe 0 0 0 0 - 34 - About That Microsoft Regex

We disclosed ~50 ReDoS regexes to Microsoft After several months Listed me in their July "Security Researcher Acknowledgments" Would not tell me what changes resulted from my report - 35 - More Limitations and Future Work ReDoS regex dects. More feature support [SJXYML

18] Reachability Why do devs. use regexes? ReDoS regex == ReDoS? Generalizability Scaling up [C&S 16] [S&P 18]? Trends across langs. / apps.? - 36 - Why not applications? Open-source applications may not be

representative of code in industry Module ecosystems are shared by open-source and industry Modules are sometimes authored by industry as a way to give back to the open-source community - 37 - Why git projects instead of artifacts? Bundling / Obfuscation complicates analysis on registry artifacts - 38 - ReDoS Regexes by Size and Popularity Downloads (log)

>1000 downloads/month Especially concerni Lines of code (log) - 39 -

Recently Viewed Presentations

  • Application of the &quot;UPCS Guidance and Protocol Clarifications ...

    Application of the "UPCS Guidance and Protocol Clarifications ...

    The media, politicians, and our own HUD staff, have raised questions concerning substandard repairs of inspectable items. To minimize the possibility of the H&S hazards that may result from substandard repairs. i.e. misaligned or improperly vented water heaters.
  • Romeo and Juliet

    Romeo and Juliet

    Peter, Sampson, Gregory - Juliet's servants. Introduction to characters. Neutral. Prince Escalus - ruler of Verona . Friar John - priest, friends with Friar Lawrence. Apothecary - sells potions/drugs. Introduction to characters. Read the prologue together.
  • Smith&#x27;s Hill

    Smith's Hill

    Assessment Task Schedule Summary Booklet Assessment Task Schedule in each HSC Course. All of these documents will be made available on Sentral this week. Students must comply with the rules for assessment outlined in the Assessment Policy. Students must complete...
  • Atelier de travail Poitiers, les 26 et 27

    Atelier de travail Poitiers, les 26 et 27

    Atelier de travail Poitiers, les 26 et 27 mars 2013 Mônica Macedo-Rouet Karine Aillerie Plan Programme de l'atelier Présentation d'iTEC Résultats préliminaires Témoignages d'enseignants iTEC - Designing the future classroom * Programme iTEC - Designing the future classroom * Questions...
  • Technical English: Fewer is better!

    Technical English: Fewer is better!

    TECHNICAL WRITING WORKSHOP. John Morris. KRIS, KMITL. previously. Engineering, Mahasarakham University. Electrical and Computer Engineering, The University of Auckland
  • An Update from the PCC URIs in MARC

    An Update from the PCC URIs in MARC

    [A Chew ChiatNaun quote from a Back Stage Metadata Matters event.] A Chew ChiatNaun (from Cornell) quote from a Back Stage Metadata Matters event. So if barcodes don't excite you, maybe dogs and chairs.
  • Emagram - UMD

    Emagram - UMD

    An area in the Tephigram denotes total HEAT or ENERGY added to a cyclic process Start 9/16/14 Tephigram*** ∮ đq = ∮ T d φ = cp∮ Tdθ /θ = cp∮ Td(lnθ) T lnP Dry adiabats The tephigram Allows a...
  • What is #LODLAM?! Understanding Linked Open Data in Libraries ...

    What is #LODLAM?! Understanding Linked Open Data in Libraries ...

    What is #LODLAM?! Understanding Linked Open Data in Libraries, Archives [and Museums] Alison Hitchens OLA Superconference January 29, 2015 Toronto, ON