35 more Semgrep rules: infrastructure, supply chain, and Ruby

By Matt Schwager and Travis Peters

Introduction to Malware Binary Triage (IMBT) Course

Looking to level up your skills? Get 10% off using coupon code: MWNEWS10 for any flavor of the Malware Binary Triage (IMBT) course starting this Black Friday and Cyber Monday!

Enroll Now and Save 10%: Coupon Code MWNEWS10

Note: This is an affiliate link – your enrollment helps support this platform at no extra cost to you.

We are publishing another set of custom Semgrep rules, bringing our total number of public rules to 115. This blog post will briefly cover the new rules, then explore two Semgrep features in depth: regex mode (especially how it compares against generic mode), and HCL language support for technologies such as Terraform and Nomad. With these features, we can search for security vulnerabilities in more than just application code. This new release joins our existing collection of Semgrep rules, our public CodeQL queries, and our Testing Handbook as part of our long-term effort to share our technical expertise with the security community.

Semgrep is a vast and capable tool, and it contains many nooks and crannies that can be exploited to get the most value possible out of a static analysis tool. Like our previous Semgrep rules release post, this post will highlight some interesting Semgrep functionality. Publicly releasing rules is a great start, but we feel that we can do even better by explaining why rules are written the way they are.

For this release, we focused on supply chain issues related to a lack of short-lived OIDC tokens in GitHub Actions; infrastructure concerns in Terraform code, Nomad jobs, and insecure database connections; and general application security concerns in Ruby code. Many of these Ruby rules were written during our recent Ruby Central (rubygems.org) audit. We will be publishing more information about this audit shortly.

Without further ado, here are our new rules:

Mode Rule ID Rule description
Ruby action-dispatch-insecure-ssl Found Rails application with insecure SSL setting.
Ruby action-mailer-insecure-tls Found ActionMailer SMTP configuration with insecure TLS setting. These settings do not require a successful, encrypted, verified TLS connection is made. Set enable_starttls: true and openssl_verify_mode to verify peer.
Ruby active-record-encrypts-misorder Found an ActiveRecord value with encryption before serialization. The declaration of the serialized attribute should go before the encryption declaration.
Ruby active-record-hardcoded-encryption-key Found hard-coded ActiveRecord encryption key.
Ruby global-timeout Found Timeout::timeout (or timeout) use. Setting a global timeout can cause an exception to be raised anywhere in the passed block of code. This precludes any possible clean up action typically associated with rescuing from exceptions. This can lead to denial-of-service, data integrity failure, and general availability concerns. Instead prefer to use the library’s built in timeout functionality, if it has any, to ensure processing happens as expected. If it does not have built in timeout functionality, then consider implementing it.
Ruby faraday-disable-verification Found Faraday HTTP request disabling SSL/TLS verification.
Ruby ruby-saml-skip-validation SAML response validation disabled for $KEY.
Ruby yaml-unsafe-load Found YAML call to unsafe_load. This can lead to deserialization bugs and RCE.
Ruby rails-cookie-attributes Found Rails cookie set with insecure attribute.
Ruby rails-cache-store-marshal Found Rails cache store configured to allow Marshaling. As of Rails 7.1 the default serializer is :marshal_7_1. If an attacker can inject data into the cache store (SSRF, etc.), then they can achieve code execution when the object is later deserialized. Consider using the :message_pack serializer or a custom serializer.
Ruby json-create-deserialization Found json_create class method. This implies custom JSON deserialization is occuring. This can lead to RCE and other deserialization-type bugs. Usage should be audited and, at least, fuzzed.
Ruby insecure-rails-cookie-session-store Found Rails session cookie missing SameSite=Secure. As of Rails 7.2, session cookies default to SameSite=Lax.
Ruby rest-client-disable-verification Found RestClient HTTP request disabling SSL/TLS verification.
Regex postgres-insecure-sslmode Found PostgreSQL connection string disabling SSL verification.
Regex mongodb-insecure-transport Found insecure MongoDB connection, prefer TLS encrypted transport by setting the tls=true connection option and ensuring proper verification.
Regex mysql-insecure-sslmode Found MySQL connection string disabling SSL verification.
Generic amqp-unencrypted-transport Found unencrypted AMQP connection, prefer TLS encrypted amqps:// transport.
Generic redis-unencrypted-transport Found unencrypted Redis connection, prefer TLS encrypted rediss:// transport.
Generic node-disable-certificate-validation Setting this environment variable disables TLS certificate validation. This makes TLS, and HTTPS by extension, insecure. The use of this environment variable is strongly discouraged.
HCL aws-oidc-role-policy-duplicate-condition Found AWS role policy for GitHub Actions with duplicate condition. This overrides previous conditions, and the last condition with the duplicated key “wins.” This likely breaks access controls and allows unauthorized access.
HCL aws-oidc-role-policy-missing-sub Found AWS role policy for GitHub Actions missing OIDC subject. This means any GitHub repository can assume this role in CI.
HCL vault-hardcoded-token Found Terraform Vault instance with hard-coded token.
HCL vault-skip-tls-verify Found Terraform Vault instance with TLS verification disabled.
HCL root-user Found Nomad task using root user.
HCL docker-hardcoded-password Found Nomad task using Docker auth with hard-coded password.
HCL docker-privileged-mode Found Nomad task using Docker containers in privileged mode.
HCL tls-hostname-verification-disabled Found Nomad tls block with server hostname verification disabled.
HCL podman-tls-verify-disabled Found Nomad task using Podman with registry TLS verification disabled.
YAML jfrog-hardcoded-credential Found long-term access key. Instead prefer JFrog temporary OIDC security credentials.
YAML aws-secret-key Found long-term access key. Instead prefer AWS role assumption and temporary OIDC security credentials.
YAML gcp-credentials-json Found long-term access key. Instead prefer GCP workload identity federation and temporary OIDC security credentials.
YAML rubygems-publish-key Found long-term access key. Instead prefer RubyGems trusted publishing and temporary OIDC security credentials.
YAML vault-token Found long-term access key. Instead prefer Vault role assumption and temporary OIDC security credentials.
YAML pypi-publish-password Found long-term access key. Instead prefer PyPI trusted publishing and temporary OIDC security credentials.
YAML azure-principal-secret Found long-term access key. Instead prefer Azure subscription ID and temporary OIDC security credentials.

Semgrep isn’t just for programming languages

The first post in this series included perspectives on two lesser-known Semgrep features: generic mode and YAML support. This post introduces two additional considerations: regular expressions vs. generic mode and HashiCorp Configuration Language (HCL) support for infrastructure-as-code (IaC) security. We will continue the trend of bringing Semgrep to all forms of textual data.

Heuristics: Regular expressions vs. generic-mode

Regular expression patterns are another lesser-known feature of Semgrep. This is the so-called pattern-regex operator and regex language. But why would you want to use regular expressions in Semgrep rules? Doesn’t that defeat the purpose of static analysis tools like Semgrep? Why not simply use ripgrep or classic grep? Doesn’t generic mode obviate the need for regex mode?

The following heuristics will help you understand when to use regex mode. The more “yeses” you answer below, the more likely you should be using regex mode.

Heuristic #1: Does the text you are looking for generally span a single line of code?

Dealing with multi-line whitespace in a regular expression is a pain. If you find yourself searching for multi-line patterns, and language-specific rules aren’t possible, then you will probably be best served by generic mode. So remember: when using regex mode, the text you’re searching for will almost always span a single line.

Heuristic #2: Does this pattern exist in many languages or types of text files?

The beauty of Semgrep is that it’s a one-stop-shop for all things textual analysis. If the text you are searching for may exist in many languages, then it may be a good fit for regex mode. For example, consider URL parameters. If you’re searching for, say, sslmode=disable, then the following regular expression would be a good start: [?&]sslmode=(disable|allow|prefer). This is great because it will find this insecure URL parameter in any connection URI to any PostgreSQL library in any language. We don’t have to write separate rules for separate libraries and languages. It will also find this pattern in shell scripts, documentation, CI jobs, and more.

Heuristic #3: Do you want to share your regular expressions with others?

Again, the beauty of Semgrep is that it consolidates the functionality of tools like ripgrep or classic grep under a single tool. ripgrep can be useful when you’re quickly iterating on regular expressions and searching through your code for patterns, but Semgrep rules really shine once it comes time to codify, test, and publish a regex. Your regex findings will exist next to your Python and Kubernetes findings, and you can track all of your findings and manage rules from a single location.

Heuristic #4: Do you need to match specific characters or character classes?

Regex mode and generic mode often serve similar needs. Our previous post discussed the advantages of generic mode, so when should you use regex mode? Regex mode is preferred over generic mode when you would like to match specific characters or character classes, or use other regular expression functionality such as alternation. For example, in the sslmode regular expression above, we search for sslmode prefixed by a character class with ? and &. These two prefixes give us additional confidence that what we find will in fact be a URL parameter. As far as we know, there is not an easy way to express this in generic mode. We can always use pattern-either, but this can get quite verbose for more complex expressions. On the other hand, generic mode’s primary advantage is that it supports the ellipsis operator (i.e., ...), which allows easily skipping non-matching elements and whitespace used in multi-line patterns.

As you can see, there are often multiple ways to approach searching for specific code patterns in Semgrep. The heuristics above provide a good baseline for when you may want to use regex mode. The more important consideration is that regex mode exists, and it’s a valuable tool in your toolbelt when searching through textual data.

HCL support and IaC security

Infrastructure as Code (IaC) has transformed cloud management. It brings faster deployments, improved consistency and repeatability, and better security through version-control environments that previously relied on manual configurations. By codifying infrastructure, organizations can seamlessly integrate these definitions with CI/CD pipelines thus enabling automated testing, deployment, and static analysis.

HashiCorp Configuration Language (HCL) is foundational to many IaC tools, including Terraform, Nomad, and Consul. Recognizing the increasing importance of IaC, Semgrep introduced HCL support back in 2021. With dedicated HCL support, Semgrep now allows for a unified approach, bringing the same level of scrutiny to both application code and infrastructure configurations, ensuring they work together harmoniously within CI/CD pipelines.

We’ve learned that even the most straightforward Semgrep rules can uncover significant issues that continue to pose risks in 2024. Take, for example, the common practice of disabling TLS verification during development. If this configuration is inadvertently deployed, it could expose sensitive data. Here’s how easy it is to detect such vulnerabilities in Vault infrastructure with Semgrep:

rules:
  - id: vault-skip-tls-verify
    message: |
      Found Terraform Vault instance with TLS verification disabled
    languages: [hcl]
    severity: WARNING
    patterns:
      - pattern-inside: provider "vault" { ... }
      - pattern: skip_tls_verify = true

Figure 1: Semgrep rule searching for disabled TLS verification (hcl/terraform/vault-skip-tls-verify.yaml)

Another frequent misstep is hard-coding credentials—a security risk that Semgrep can easily catch:

rules:
  - id: vault-hardcoded-token
    message: |
      Found Terraform Vault instance with hardcoded token
    languages: [hcl]
    severity: WARNING
    patterns:
      - pattern-inside: provider "vault" { ... }
      - pattern: token = "..."

Figure 2: Semgrep rule search for hardcoded Vault tokens (hcl/terraform/vault-hardcoded-token.yaml)

By coupling this step with configuring your CI/CD pipelines to block PRs with unresolved Semgrep findings (one of our recommended practices), you can easily keep these issues out of production infrastructure.

HCL’s structured nature also makes it particularly effective for detecting more complex patterns and ensuring that we keep false positives as low as possible. For instance, consider the following rule that identifies AWS role policies for GitHub Actions that are missing the OIDC subject—a critical misconfiguration that could allow any GitHub repository to assume the role in CI:

rules:
  - id: aws-oidc-role-policy-missing-sub
    message: |
      Found AWS role policy for GitHub Actions missing OIDC subject. This
      means any GitHub repository can assume this role in CI.
    languages: [hcl]
    severity: WARNING
    patterns:
      - pattern-inside: |
          {
            ...
            Statement = [...]
            ...
          }
      - pattern-inside: |
          {
            ...,
            "Action": "sts:AssumeRoleWithWebIdentity",
            ...
          }
      - pattern: |
          {
            ...
            "Condition": {
                ...
                "StringEquals": {
                    ...
                    "token.actions.githubusercontent.com:aud": ...,
                    ...
                }
                ...
            }
            ...
          }
      - pattern-not: |
          {
            ...
            "Condition": {
                ...
                "StringEquals": {
                    ...
                    "token.actions.githubusercontent.com:sub": ...,
                    ...
                    "token.actions.githubusercontent.com:aud": ...,
                    ...
                }
                ...
            }
            ...
          }
      # Remain pattern-nots truncated to save space

Figure 3: Semgrep rule searching for missing OIDC subjects (hcl/terraform/aws-oidc-role-policy-missing-sub.yaml)

Role policies for GitHub Actions can be configured in many ways, and we can use pattern-inside and pattern-not to properly contextualize the pattern we are looking for (i.e., instances where the subject is not defined). This rule is a powerful example of how Semgrep can help enforce security policies and prevent configuration errors that could lead to serious vulnerabilities.

Text is the universal interface

If text is the universal interface, then Semgrep can help secure arbitrary interfaces, from bytes and strings to IaC, YAML, and more. Combining the power of Semgrep with regular expressions, generic mode, YAML, and IaC support allows us to go beyond just code in programming languages. As the industry moves everything toward “as-code” solutions, we need to be able to apply scalable tooling to domains like supply chain, CI/CD, and IaC.

With IaC, you can apply the same rigor of static analysis to your infrastructure as you do to your application code, catching issues early and avoiding costly mistakes in production—“shifting left,” as it were. Manual audits and dynamic scans against production environments are slow and do not scale well. We encourage you to try out our newly released Terraform and Nomad rules, explore Semgrep’s terraform rules, and consider incorporating them into your projects. To our knowledge, these are the first open-source Semgrep rules targeting Nomad—a fact we’re excited to share with the community, hoping to inspire others to build upon them.

If you’d like to read more about our work on Semgrep, we have used its capabilities in several ways, such as securing machine learning pipelines, discovering goroutine leaks, and securing Apollo GraphQL servers.

Contact us if you’re interested in custom Semgrep rules for your project!

Article Link: 35 more Semgrep rules: infrastructure, supply chain, and Ruby | Trail of Bits Blog