Normalization Attacks on the Web

Character Normalization Attacks on the Web

Introduction

In the vast realm of web security, one of the more subtle and insidious threats comes from character normalization attacks. These attacks exploit the nuances of text encoding and character representation, leading to various forms of vulnerabilities in web applications. This blog post delves into what character normalization attacks are, how they operate, and provides examples to illustrate the potential damage they can cause.

What is Character Normalization?

Character normalization refers to the process of converting text to a canonical (standard) form. This process is crucial because different characters can visually appear similar or even identical but have different underlying representations. For instance, the character “é” can be represented as a single Unicode character (U+00E9) or as a combination of two characters: “e” (U+0065) followed by the combining accent ”´” (U+0301).

Normalization aims to standardize these representations, but malicious actors can exploit inconsistencies in how different systems handle normalization.

Types of Character Normalization Attacks

  1. Homoglyph Attacks: These attacks exploit characters that look alike but have different Unicode representations. For example, the Latin letter “a” (U+0061) and the Cyrillic letter “а” (U+0430) are visually similar but are different characters.

  2. Combining Characters Attacks: Combining characters are used in scripts like Devanagari or can be applied as diacritics in Latin scripts. Attackers can insert invisible or visually identical combining characters to bypass filters or create confusion in text processing.

  3. Normalization Form Exploitation: There are different Unicode normalization forms (NFC, NFD, NFKC, NFKD). Inconsistent handling of these forms across systems can lead to security bypasses. For example, a system that checks input in one form but processes it in another might inadvertently allow harmful input.

Examples of Character Normalization Attacks

Homoglyph Attack Example

An attacker could register a domain name using homoglyphs to deceive users. For example:

  • Legitimate domain: example.com
  • Malicious domain: ехample.com (with the “e” replaced by a Cyrillic “е”)

Users might not notice the difference and could fall victim to phishing or other malicious activities.

Combining Characters Attack Example

Consider a web application that filters out certain keywords to prevent SQL injection. An attacker could use combining characters to bypass this filter:

  • Keyword to filter: SELECT
  • Malicious input: S\u0065\u0301LECT

In this case, “SELECT” appears visually intact but is actually composed of “S”, “e” (U+0065), and a combining acute accent (U+0301), potentially bypassing simple keyword filters.

Normalization Form Exploitation Example

A system might normalize user input before storing it but fail to normalize it consistently when comparing or processing it later. This discrepancy can be exploited:

  • Input normalization at registration: Normalizes to NFC
  • Input comparison during login: Uses raw input without normalization

An attacker could register with a username that appears normal but contains characters that, when normalized, match an existing username. This could allow unauthorized access.

Potential Damages

Character normalization attacks can lead to a variety of severe consequences, including:

  • Phishing: Users can be tricked into visiting malicious websites that appear legitimate.
  • Security Bypass: Filters and validation mechanisms can be bypassed, leading to injection attacks (e.g., SQL injection, XSS).
  • Data Integrity Issues: Inconsistent handling of normalized text can corrupt data, leading to data breaches or loss.

Prevention and Mitigation

To protect against character normalization attacks, consider the following measures:

  1. Consistent Normalization: Always normalize text input and output to a consistent form using a trusted library.
  2. Robust Filtering: Implement multi-layered filtering and validation mechanisms that account for different character representations.
  3. User Awareness: Educate users about the risks of homoglyph domains and encourage them to verify URLs carefully.
  4. Security Audits: Regularly audit code and systems for vulnerabilities related to text processing and normalization.

Capture The Flag (CTF) Challenge: Exploiting Character Normalization

For those interested in practical applications and learning through challenges, here’s a CTF-style challenge to explore character normalization vulnerabilities.

Challenge Description

You have been provided with access to a web application that contains a user registration and login system. The challenge is to gain unauthorized access by exploiting character normalization issues.

Steps to Solve

  1. Analyze the Login System: Register an account and inspect the login mechanism. Try different character representations for your username and password.
  2. Test for Homoglyphs: Register a user with a name containing homoglyphs and attempt to log in using the visually similar characters.
  3. Combining Characters: Attempt to register and log in using combining characters that might bypass input validation.
  4. Normalization Forms: Experiment with different Unicode normalization forms to see if the system processes them inconsistently.

Example Payloads

  • Homoglyph Test: Register as usernаme (using Cyrillic ‘а’) and attempt to log in as username.
  • Combining Characters Test: Register as user\u0065\u0301name and try to log in using username.
  • Normalization Form Test: Register with a username in NFD form and attempt to log in with the NFC form of the username.

Submission

Submit the username and password used to exploit the vulnerability along with an explanation of the steps taken to discover and exploit the normalization issue.

Vulnerable Libraries

Several libraries and frameworks have historically been susceptible to character normalization attacks. Below are some notable examples and details of their vulnerabilities:

1. Python’s unicodedata Module

The unicodedata module in Python provides Unicode character database and normalization features. Improper use of this module can lead to vulnerabilities if normalization isn’t consistently applied.

2. Java’s Normalizer Class

Java’s Normalizer class can be used to convert Unicode text into normalized forms. Bugs or misuse in applications relying on this class can introduce security risks.

3. JavaScript Libraries

Libraries like unorm for normalization in JavaScript can be misconfigured or improperly used, leading to similar vulnerabilities.

Example Code Vulnerability

Here’s a simple example in Python demonstrating how inconsistent normalization can lead to vulnerabilities:

import unicodedata

# Normalizing user input
def normalize_input(input_str):
    return unicodedata.normalize('NFC', input_str)

# Registration (normalized input)
username = normalize_input(input("Enter username: "))
# Simulating storing the username
stored_username = username

# Login (non-normalized input)
login_username = input("Enter username to login: ")

# Authentication check
if stored_username == login_username:
    print("Login successful!")
else:
    print("Login failed.")

In this example, if a user registers with a username containing combining characters and attempts to log in with a visually similar string, the login might fail due to inconsistent normalization.

Conclusion

Character normalization attacks are a sophisticated threat in the web security landscape. By understanding the mechanics of these attacks and implementing robust defenses, developers and security professionals can mitigate the risks and protect users from potential harm. Always stay vigilant and keep your systems up-to-date to guard against these and other emerging threats.