1
0
mirror of https://github.com/django/django.git synced 2025-07-04 01:39:20 +00:00

unicode: Added a new docoment describing how wonderful our unicode support is

and documenting some of the unicode-specific features.


git-svn-id: http://code.djangoproject.com/svn/django/branches/unicode@5330 bcc190cf-cafb-0310-a4f2-bffc1f526a37
This commit is contained in:
Malcolm Tredinnick 2007-05-24 09:15:31 +00:00
parent dceac0a384
commit 9ce95c6775

328
docs/unicode.txt Normal file
View File

@ -0,0 +1,328 @@
======================
Unicode data in Django
======================
**New in Django development version**
Django natively supports Unicode data everywhere. Providing your database can
somehow store the data, you can safely pass around Unicode strings to
templates, models and the database.
This files describes some things to be aware of if you are writing applications
which do not only use ASCII-encoded data.
Creating the database
=====================
Make sure your database is configured to be able to store arbitrary string
data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
a more restrictive encoding -- for example, latin1 (iso8859-1) -- there will be
some characters that you cannot store in the database and information will be
lost.
* For MySQL users, refer to the `MySQL manual`_ (section 10.3.2 for MySQL 5.1)
for details on how to set or alter the database character set encoding.
* For PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in
PostgreSQL 8) for details on creating databases with the correct encoding.
* For SQLite users, there is nothing you need to do. SQLite always uses UTF-8
for internal encoding.
.. _MySQL manual: http://www.mysql.org/doc/refman/5.1/en/charset-database.html
.. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104
All of Django's database backends automatically convert Unicode strings into
the appropriate encoding for talking to the database. They also automatically
convert strings retrieved from the database into Python Unicode strings. You
don't even need to tell Django what encoding your database uses: that is
handled transparently.
General string handling
=======================
Whenever you use strings with Django, you have two choices. You can use Unicode
strings or you can use normal strings (sometimes called bytestrings) that are
encoded using UTF-8.
.. warning::
A bytestring does not carry any information with it about its encoding. So
we have to make an assumption and Django assumes that all bytestrings are
in UTF-8. If you pass a string to Django that has been encoded in some
other format, things will go wrong in interesting ways. Usually Django will
raise a UnicodeDecodeError at some point.
If your code only uses ASCII data, you are quite safe to simply use your normal
strings (since ASCII is a subset of UTF-8) and pass them around at will.
Do not be fooled into thinking that if you ``DEFAULT_CHARSET`` setting is set
to something other than ``utf-8`` you can use that encoding in your
bytestrings! The ``DEFAULT_CHARSET`` only applies to the strings generated as
the result of template rendering (and email). Django will always assume UTF-8
encoding for internal bytestrings. The reason for this is that the
``DEFAULT_CHARSET`` setting is not actually under your control (if you are the
application developer). It is under the control of the person installing and
using your application and if they choose a different setting, your code must
still continue to work. Ergo, it cannot rely on that setting.
In most cases when Django is dealing with strings, it will convert them to
Unicode strings before doing anything else. So if you pass in a bytestring, be
prepared to receive a Unicode string back in the result.
.. _lazy translation:
Translated strings
------------------
There is actually a third type of string-like object you may encounter when
using Django. If you are using the internationalization features of Django,
there is the concept of a "lazy translation". This is a string that has been
marked as translated, but the actual result is not determined until the object
is used in a string. This is useful because the locale that should be used for
the translation will not be known until the string is used, even though the
string might have originally been created when the code was first imported.
Normally, you won't have to worry about lazy translations. Just be aware that
if you examine an object and it claims to be a
``django.utils.functional.__proxy__`` object, it is a lazy translation.
Calling ``unicode()`` with the translation as the argument will generate a
string in the current locale.
.. _utility functions:
Useful utility functions
------------------------
Since some string operations come up again and again, Django ships with a few
useful functions that should make working with unicode and bytestring objects
a bit easier.
Conversion functions
~~~~~~~~~~~~~~~~~~~~
The ``django.utils.encoding`` module contains a few functions that are handy
for converting back and forth between unicode and bytestrings.
* ``smart_unicode(s, encoding='utf-8', errors='strict')`` converts its
input to unicode string. The ``encoding`` parameter specifies the input
encoding of any bytestring -- Django uses this internally when
processing form input data, for example, which might not be UTF-8
encoded. The ``errors`` parameter takes any of the values that are
accepted by Python's ``unicode()`` function for its error handling.
If you pass ``smart_unicode()`` an object that has a ``__unicode__``
method, it will use that method to do the conversion.
* ``force_unicode(s, encoding='utf-8', errors='strict')`` is identical to
``smart_unicode()`` in almost all cases. The difference is when the
first argument is a `lazy translation`_ instance. Whilst
``smart_unicode()`` preserves lazy translations, ``force_unicode()``
forces those objects to a unicode string (causing the translation to
occur). Normally, you will want to use ``smart_unicode()``. However,
``force_unicode()`` is useful in filters and template tags when you
absolutely must have a string to work with, not just something that can
be converted to a string.
* ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
is essentially the opposite of ``smart_unicode()``. It forces the first
argument to a string. The ``strings_only`` parameter, if set to True,
will result in Python integers, booleans and ``None`` not being
converted to a string (they keep their original types). This is slightly
different semantics from Python's builtin ``str()`` function, but the
difference is needed in a few places internally.
Normally, you will only need to use ``smart_unicode()``. Call it as early as
possible on any input data that might be either a unicode or bytestring and
from then on you can treat the result as always being unicode.
.. _uri_and_iri:
URI and IRI handling
~~~~~~~~~~~~~~~~~~~~
Web frameworks have to deal with URLs (which are a type of URI_). One
requirement of URLs is that they are encoded using only ASCII characters.
However, in an international environment, you will often need to construct a
URL from an IRI_ (very loosely speaking, a URI that can contain unicode
characters). Getting the quoting and conversion from IRI to URI correct can be
a little tricky, so Django provides some assistance.
* The function ``django.utils.encoding.iri_to_uri()`` implements the
conversion from IRI to URI as required by `the specification`_.
* The functions ``django.utils.html.urlquote()`` and
``django.utils.html.urlquote_plus()`` are versions of Python's standard
``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
characters (the data is converted to UTF-8 prior to encoding).
These two groups of functions have slightly different purposes and it is
important to keep them straight. Normally, you would use ``urlquote()`` on the
individual portions of the IRI or URI path so that any reserved characters
such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
the full IRI and it converts any non-ASCII characters to the correct encoded
values.
.. note::
It isn't completely correct to say that ``iri_to_uri()`` implements the
full algorithm in the IRI specification. It does not perform the
international domain name encoding portion of the algorithm (at the
moment).
The ``iri_to_uri()`` function will not change ASCII characters that are
otherwise permitted in a URL. So, for example, the character '%' is not
further encoded when passed to ``iri_to_uri()``. This means you can pass a
full URL to this function and it will not mess up the query string or anything
like that.
An example might clarify things here::
>>> urlquote(u'Paris & Orléans')
u'Paris%20%26%20Orl%C3%A9ans'
>>> iri_to_uri(u'/favorites/François/%s' % urlquote(u'Paris & Orléans'))
'/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
If you look carefully, you can see that the portion that was generated by
``urlquote()`` in the second example was not double-quoted when passed to
``iri_to_uri()``. This is a very important and useful feature. It means that
you can construct your IRI without worrying about whether it contains
non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
result.
.. _URI: http://www.ietf.org/rfc/rfc2396.txt
.. _IRI: http://www.ietf.org/rfc/rfc3987.txt
.. _the specification: IRI_
Models
======
Because all strings are returned from the database as unicode strings, model
fields that are character based (CharField, TextField, URLField, etc) will
contain unicode values when Django retrieves the model from the database. This
is always the case, even if the data could fit into an ASCII string.
As always, you can pass in bytestrings when creating a model or populating a
field and Django will convert it to unicode when it needs to.
Choosing between ``__str__()`` and ``__unicode__()``
-----------------------------------------------------
One consequence of using unicode by default is that you have to take some care
when printing data from the model. In particular, rather than writing a
``__str__()`` method, it is recommended to write a ``__unicode__()`` method for
your model. In the ``__unicode__()`` method, you can quite safely return the
values of all your fields without having to worry about whether they fit into a
bytestring or not (the result of ``__str__()`` is *always* a bytestring, even
if you accidentally try to return a unicode object).
You can still create a ``__str__()`` method on your models if you wish, of
course. However, Django's ``Model`` base class automatically provides you with a ``__str__()`` method
that calls your ``__unicode__()`` method and then encodes the result correctly
into UTF-8. So you would normally only create a ``__unicode__()`` method and
let Django handle the coercion to a bytestring when required.
Taking care in ``get_absolute_url()``
-------------------------------------
URLs can only contain ASCII characters. If you are constructing a URL from
pieces of data that might be non-ASCII, you must be careful to encode the
results in a way that is suitable for a URL. If you are using the
``django.db.models.permalink()`` decorator, this is handled automatically by
the decorator.
If you are constructing the URL manually, you need to take care of the
encoding yourself. Normally, this would involve a combination of the
``iri_to_uri()`` and ``urlquote()`` functions that were documented above_. For
example::
from django.utils.encoding import iri_to_uri
from django.utils.html import urlquote
def get_absolute_url(self):
url = u'/person/%s/?x=0&y=0' % urlquote(self.location)
return iri_to_uri(url)
This function returns a correctly encoded URL even if ``self.location`` is
something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
call isn't strictly necessary in the above example, because all the
non-ASCII characters would have been removed in quoting in the first line.)
.. _RFC 3987: IRI_
.. _above: uri_and_iri_
The database API
================
You can happily pass unicode strings or bytestrings as arguments to
``filter()`` methods and the like in the database API. The following two
querysets are identical::
qs = People.objects.filter(name__contains=u'Å')
qs = People.objects.filter(name__contains='\xc3\85') # UTF-8 encoding of Å
Templates
=========
As usual, templates can be created from unicode or bytestrings. However, they
can also be created by reading a file from disk and this creates a slight
complication: not all filesystems store their data encoded as UTF-8. If your
template files are not stored with a UTF-8 encoding, set the ``FILE_CHARSET``
setting to the encoding of the on-disk files. When Django reads in a template
file it will convert the data from this encoding to unicode.
When a template is rendered for sending out as an HTML document or an e-mail,
it may be convenient to use an encoding other than UTF-8. You should set the
``DEFAULT_CHARSET`` parameter to control the rendered template encoding (the
default setting is utf-8).
E-mail
======
Django's email framework (in ``django.core.mail``) supports unicode
transparently. You can use unicode data in the message bodies and any headers.
However, you must still respect the requirements of the email specifications,
so, for example, email addresses should use ASCII characters. The following
code is certainly possible (demonstrating the everything except e-mail
addresses can be non-ASCII)::
from django.core.mail import EmailMessage
subject = u'My visit to Sør-Trøndelag'
sender = u'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>'
recipients = ['Fred <fred@example.com']
body = u'...'
EmailMessage(subject, body, sender, recipients).send()
Form submission
===============
HTML form submission is a tricky area. There is no guarantee that the
submission will include encoding information.
Django adopts a "lazy" approach to decoding form data. The data in an
``HttpRequest`` object is only decoded when you access it. In fact, most of
the data is not decoded at all. Only the ``HttpRequest.GET`` and
``HttpRequest.POST`` data structures have any decoding applied to them. Those
two fields will return their members as unicode data. All other members will
be returned exactly as they were submitted by the client.
By default, the ``DEFAULT_CHARSET`` setting is used as the assumed encoding
for form data. If you need to change this for a particular form, you can set
the ``encoding`` attribute on the ``GET`` and ``POST`` data structures. For
example::
def some_view(request):
# We know that the data must be encoded as KOI8-R (for some reason).
request.GET.encoding = 'koi8-r'
request.POST.encoding = 'koi8-r'
...
It will typically be very rare that you would need to worry about changing the
form encoding. However, if you are talking to a legacy system or a system
beyond your control with particular ideas about encoding, you do have a way to
control the decoding of the data.
For request features such as file uploads, no automatic decoding takes place,
because those attributes are normally treated as collections of bytes, rather
than strings. Any decoding would alter the meaning of the stream of bytes.