Charset detection with python
I was looking for a way to easily determine a file's charset when I stumbled upon the Universal Encoding Detector. Just wanted to share it.
Installation:
$ wget http://chardet.feedparser.org/download/chardet-1.0.1.tgz -O - | tar xz $ cd chardet-1.0.1 $ python ./setup.py build $ sudo python ./setup.py install
Usage:
From a python console:
>>> import chardet
>>> chardet.detect(open('/path/to/your/file', 'r').read())
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
Nice !
Comments
Nice, thanks.
For now, i use "file -i" which do a quite good job too.
Any chance to see this in other languages?
php.net has a few contributions on the subject of charset detection (see http://us2.php.net/mb_detect_encodi...), but I have no idea of something that clever.