charset-normalizer
The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
This package has a good security score with no known vulnerabilities.
Community Reviews
Straightforward charset detection with minimal learning curve
The error messages are clear when you pass invalid input, and debugging is rarely needed since the library does what it says on the tin. The documentation is concise but sufficient - you can get up and running in minutes. The result object provides helpful attributes like `encoding`, `best()` method for getting normalized content, and confidence scores if you need them.
One minor frustration is that the API could be more Pythonic in some areas (the result object's interface feels a bit non-standard), and GitHub issues sometimes take a while to get responses. That said, the library is stable enough that you rarely need support - it just handles the common case of "what encoding is this file?" reliably.
Best for: Projects that need reliable automatic encoding detection for user-uploaded files or scraped content with unknown encodings.
Avoid if: You need guaranteed real-time support or have very specific edge case encoding requirements that need extensive customization.
Drop-in chardet replacement with better accuracy and minimal learning curve
Error messages are straightforward when you pass invalid inputs, and the library handles edge cases gracefully - it won't crash on malformed data like some alternatives. The documentation is lean but sufficient; the GitHub readme has clear examples that cover 90% of use cases. I rarely need to look beyond the basic usage pattern.
One gotcha: the `from_fp()` function expects file pointers to be seekable, which bit me when working with streams. The library is also heavier than chardet due to its more sophisticated detection algorithms, but the accuracy gains are worth it. When debugging detection issues, the library provides confidence scores that help you understand why it made certain choices.
Best for: Projects needing reliable character encoding detection for user-uploaded files, web scraping, or text processing with unknown encodings.
Avoid if: You need the absolute smallest dependency footprint and chardet's accuracy is sufficient for your use case.
Solid charset detection with minimal security surface area
From a security standpoint, the library has minimal attack surface. It's a pure detection tool with no network calls, file system access, or complex dependencies. Input validation is implicit—it accepts bytes and fails softly with low confidence scores rather than crashing. I've thrown maliciously crafted payloads at it during fuzzing and it handles them without resource exhaustion or crashes. The error handling is clean, never exposing stack traces or internal state in production scenarios.
The main limitation is performance on very large inputs—it can be CPU-intensive on multi-megabyte files. For production systems processing untrusted input, I recommend setting size limits upstream. The library follows secure-by-default principles well: no auto-decoding that could introduce injection vulnerabilities, and it doesn't make assumptions about file origins.
Best for: Applications processing untrusted file uploads or external data where charset detection is needed with minimal security risk.
Avoid if: You need real-time detection on massive data streams without implementing your own chunking and timeout logic.
Sign in to write a review
Sign In