Secrets Patterns DB: Building Open-Source Regex Database for Secret Detection- 4 mins
Detecting secrets is possible and can be automated. There are open-source tools for it that do a good job of analyzing the Git tree for potential secrets through two approaches:
A dataset of regular expressions (Regex) rules that point to valid and known patterns of passwords, API keys, API Tokens, and Cloud API Keys.
- If written correctly, it provides high-confidence findings and limited false-positive alerts.
- It can only see a very limited and small side of the picture: If there’s an API Token, and the rules cover 40-80 patterns that do not cover this particular API Token, it would not be discovered.
Shannon’s Entropy Checks
Shannon’s Entropy is an estimation of the average amount of information stored in a random value. Shannon’s entropy measures the predictable information contained in a message. It has a variety of use cases in Computer Science, including data compression, validating cryptography, and here, finding passwords and secrets.
- Context-agnostic: The algorithm can be applied against any language, framework, or codebase, and is trivial to compute.
- Does not require the configuration of pre-defined Regular Expressions.
- Can find secrets that would have never been found with pre-defined Regular Expressions.
- The false-positive rate is high: It requires a manual validation of alerts before opening tickets about leaked secrets. Analysts would need to verify what type of secret it is. This will generate overhead in triaging findings.
We’re doing Regex scanning wrong. Let’s fix this, together
While several open-source tools utilize regular expressions to detect secrets in codebases, the number of built-in rules for these tools is limited. TruffleHog v2 offers approximately 40 rules, TruffleHog v3 offers around 790 patterns, and GitLeaks offers approximately 60 rules. While it’s a good start, it’s not enough.
This project was initially made before TruffleHog v3 was released. At that time, the largest rules database was GitLeaks with 60 rules available. TruffleHog v3 helped a lot in collecting large datasets, but it’s still in a format that can not be ingestible with other tools since the new detector format is placed as Golang modules for each detection rule. This means that we would have to use Trufflehog v3 if we would like to make use of their detection rules.
I have compiled and curated a database of regular expression patterns for secrets, API tokens, keys, and passwords to improve the detection of secrets in codebases. This project I built, Secrets-Patterns-DB, contains over 1600 patterns and is being open-sourced in the hope that security teams will contribute to and improve it.
To ensure the quality and effectiveness of these patterns, I have written scripts to validate them against ReDoS attacks and created CI jobs to load and validate the patterns. I have also manually cleaned up any invalid patterns.
I encourage security teams to use and contribute to Secrets-Patterns-DB to enhance the security of their codebases.
The project is in Beta. There’s a lot of room for improvement on the project. I’m looking forward to your Pull Requests and Issues on Github to enhance Secrets-Patterns-DB for everyone. Unified Pattern Format for all tools
The Secrets-Patterns-DB has a unified pattern format that can be converted to all tools of choice. If you use TruffleHog, GitLeaks, or other tools in your organization, Secrets-Patterns-DB can be exported to the format that your tool supports.
For Trufflehog v2
$> ./convert-rules.py ./db/rules.yml trufflehog
$> ./convert-rules.py ./db/rules.yml gitleaks
And then, you can use the output rules with your tool.
This project is licensed under Creative-Common. If you’re building a tool or a product that uses Secrets-Patterns-DB, you should explicitly reference Secrets-Patterns-DB.