1
0
mirror of https://github.com/django/django.git synced 2025-10-31 09:41:08 +00:00

Refs #3254 -- Added full text search to contrib.postgres.

Adds a reasonably feature complete implementation of full text search
using the built in PostgreSQL engine. It uses public APIs from
Expression and Lookup.

With thanks to Tim Graham, Simon Charettes, Josh Smeaton, Mikey Ariel
and many others for their advice and review. Particular thanks also go
to the supporters of the contrib.postgres kickstarter.
This commit is contained in:
Marc Tamlyn
2015-05-31 22:45:03 +01:00
parent f4c2b8e04a
commit 2d877da855
16 changed files with 880 additions and 4 deletions

View File

@@ -105,6 +105,7 @@ manipulating the data of your Web application. Learn more about it below:
:doc:`Raw SQL <topics/db/sql>` |
:doc:`Transactions <topics/db/transactions>` |
:doc:`Aggregation <topics/db/aggregation>` |
:doc:`Search <topics/db/search>` |
:doc:`Custom fields <howto/custom-model-fields>` |
:doc:`Multiple databases <topics/db/multi-db>` |
:doc:`Custom lookups <howto/custom-lookups>` |

View File

@@ -37,4 +37,5 @@ release. Some fields require higher versions.
functions
lookups
operations
search
validators

View File

@@ -0,0 +1,191 @@
================
Full text search
================
.. versionadded:: 1.10
The database functions in the ``django.contrib.postgres.search`` module ease
the use of PostgreSQL's `full text search engine
<http://www.postgresql.org/docs/current/static/textsearch.html>`_.
For the examples in this document, we'll use the models defined in
:doc:`/topics/db/queries`.
.. seealso::
For a high-level overview of searching, see the :doc:`topic documentation
</topics/db/search>`.
.. currentmodule:: django.contrib.postgres.search
The ``search`` lookup
=====================
.. fieldlookup:: search
The simplest way to use full text search is to search a single term against a
single column in the database. For example::
>>> Entry.objects.filter(body_text__search='Cheese')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
This creates a ``to_tsvector`` in the database from the ``body_text`` field
and a ``plainto_tsquery`` from the search term ``'Potato'``, both using the
default database search configuration. The results are obtained by matching the
query and the vector.
To use the ``search`` lookup, ``'django.contrib.postgres'`` must be in your
:setting:`INSTALLED_APPS`.
``SearchVector``
================
.. class:: SearchVector(\*expressions, config=None, weight=None)
Searching against a single field is great but rather limiting. The ``Entry``
instances we're searching belong to a ``Blog``, which has a ``tagline`` field.
To query against both fields, use a ``SearchVector``::
>>> from django.contrib.postgres.search import SearchVector
>>> Entry.objects.annotate(
... search=SearchVector('body_text', 'blog__tagline'),
... ).filter(search='Cheese')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
The arguments to ``SearchVector`` can be any
:class:`~django.db.models.Expression` or the name of a field. Multiple
arguments will be concatenated together using a space so that the search
document includes them all.
``SearchVector`` objects can be combined together, allowing you to reuse them.
For example::
>>> Entry.objects.annotate(
... search=SearchVector('body_text') + SearchVector('blog__tagline'),
... ).filter(search='Cheese')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
See :ref:`postgresql-fts-search-configuration` and
:ref:`postgresql-fts-weighting-queries` for an explanation of the ``config``
and ``weight`` parameters.
``SearchQuery``
===============
.. class:: SearchQuery(value, config=None)
``SearchQuery`` translates the terms the user provides into a search query
object that the database compares to a search vector. By default, all the words
the user provides are passed through the stemming algorithms, and then it
looks for matches for all of the resulting terms.
``SearchQuery`` terms can be combined logically to provide more flexibility::
>>> from django.contrib.postgres.search import SearchQuery
>>> SearchQuery('potato') & SearchQuery('ireland') # potato AND ireland
>>> SearchQuery('potato') | SearchQuery('penguin') # potato OR penguin
>>> ~SearchQuery('sausage') # NOT sausage
See :ref:`postgresql-fts-search-configuration` for an explanation of the
``config`` parameter.
``SearchRank``
==============
.. class:: SearchRank(vector, query, weights=None)
So far, we've just returned the results for which any match between the vector
and the query are possible. It's likely you may wish to order the results by
some sort of relevancy. PostgreSQL provides a ranking function which takes into
account how often the query terms appear in the document, how close together
the terms are in the document, and how important the part of the document is
where they occur. The better the match, the higher the value of the rank. To
order by relevancy::
>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
>>> vector = SearchVector('body_text')
>>> query = SearchQuery('cheese')
>>> Entry.objects.annotate(rank=SearchRank(vector, query)).order_by('-rank')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]
See :ref:`postgresql-fts-weighting-queries` for an explanation of the
``weights`` parameter.
.. _postgresql-fts-search-configuration:
Changing the search configuration
=================================
You can specify the ``config`` attribute to a :class:`SearchVector` and
:class:`SearchQuery` to use a different search configuration. This allows using
a different language parsers and dictionaries as defined by the database::
>>> from django.contrib.postgres.search import SearchQuery, SearchVector
>>> Entry.objects.annotate(
... search=SearchVector('body_text', config='french'),
... ).filter(search=SearchQuery('œuf', config='french'))
[<Entry: Pain perdu>]
The value of ``config`` could also be stored in another column::
>>> from djanog.db.models import F
>>> Entry.objects.annotate(
... search=SearchVector('body_text', config=F('blog__language')),
... ).filter(search=SearchQuery('œuf', config=F('blog__language')))
[<Entry: Pain perdu>]
.. _postgresql-fts-weighting-queries:
Weighting queries
=================
Every field may not have the same relevance in a query, so you can set weights
of various vectors before you combine them::
>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
>>> vector = SearchVector('body_text', weight='A') + SearchVector('blog__tagline', weight='B')
>>> query = SearchQuery('cheese')
>>> Entry.objects.annotate(rank=SearchRank(vector, query)).filter(rank__gte=0.3).order_by('rank')
The weight should be one of the following letters: D, C, B, A. By default,
these weights refer to the numbers ``0.1``, ``0.2``, ``0.4``, and ``1.0``,
respectively. If you wish to weight them differently, pass a list of four
floats to :class:`SearchRank` as ``weights`` in the same order above::
>>> rank = SearchRank(vector, query, weights=[0.2, 0.4, 0.6, 0.8])
>>> Entry.objects.annotate(rank=rank).filter(rank__gte=0.3).order_by('-rank')
Performance
===========
Special database configuration isn't necessary to use any of these functions,
however, if you're searching more than a few hundred records, you're likely to
run into performance problems. Full text search is a more intensive process
than comparing the size of an integer, for example.
In the event that all the fields you're querying on are contained within one
particular model, you can create a functional index which matches the search
vector you wish to use. For example:
.. code-block:: sql
CREATE INDEX body_text_search ON blog_entry (to_tsvector(body_text));
This index will then be used by subsequent queries. In many cases this will be
sufficient.
``SearchVectorField``
---------------------
.. class:: SearchVectorField
If this approach becomes too slow, you can add a ``SearchVectorField`` to your
model. You'll need to keep it populated with triggers, for example, as
described in the `PostgreSQL documentation`_. You can then query the field as
if it were an annotated ``SearchVector``::
>>> Entry.objects.update(search_vector=SearchVector('body_text'))
>>> Entry.objects.filter(search_vector='potato')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]
.. _PostgreSQL documentation: http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERS

View File

@@ -24,7 +24,14 @@ recommend** and only officially support the latest release of each series.
What's new in Django 1.10
=========================
...
Full text search for PostgreSQL
-------------------------------
``django.contrib.postgres`` now includes a :doc:`collection of database
functions </ref/contrib/postgres/search>` to allow the use of the full text
search engine. You can search across multiple fields in your relational
database, combine the searches with other lookups, use different language
configurations and weightings, and rank the results by relevance.
Minor features
--------------

View File

@@ -14,6 +14,7 @@ model maps to a single database table.
models
queries
aggregation
search
managers
sql
transactions

View File

@@ -27,7 +27,7 @@ models, which comprise a Weblog application:
return self.name
class Author(models.Model):
name = models.CharField(max_length=50)
name = models.CharField(max_length=200)
email = models.EmailField()
def __str__(self): # __unicode__ on Python 2

129
docs/topics/db/search.txt Normal file
View File

@@ -0,0 +1,129 @@
======
Search
======
A common task for web applications is to search some data in the database with
user input. In a simple case, this could be filtering a list of objects by a
category. A more complex use case might require searching with weighting,
categorization, highlighting, multiple languages, and so on. This document
explains some of the possible use cases and the tools you can use.
We'll refer to the same models used in :doc:`/topics/db/queries`.
Use Cases
=========
Standard textual queries
------------------------
Text-based fields have a selection of simple matching operations. For example,
you may wish to allow lookup up an author like so::
>>> Author.objects.filter(name__contains='Terry')
[<Author: Terry Gilliam>, <Author: Terry Jones>]
This is a very fragile solution as it requires the user to know an exact
substring of the author's name. A better approach could be a case-insensitive
match (:lookup:`icontains`), but this is only marginally better.
A database's more advanced comparison functions
-----------------------------------------------
If you're using PostgreSQL, Django provides :doc:`a selection of database
specific tools </ref/contrib/postgres/search>` to allow you to leverage more
complex querying options. Other databases have different selections of tools,
possibly via plugins or user-defined functions. Django doesn't include any
support for them at this time. We'll use some examples from PostgreSQL to
demonstrate the kind of functionality databases may have.
.. admonition:: Searching in other databases
All of the searching tools provided by :mod:`django.contrib.postgres` are
constructed entirely on public APIs such as :doc:`custom lookups
</ref/models/lookups>` and :doc:`database functions
</ref/models/database-functions>`. Depending on your database, you should
be able to construct queries to allow similar APIs. If there are specific
things which cannot be achieved this way, please open a ticket.
In the above example, we determined that a case insensitive lookup would be
more useful. When dealing with non-English names, a further improvement is to
use :lookup:`unaccented comparison <unaccent>`::
>>> Author.objects.filter(name__unaccent__icontains='Helen')
[<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Actor: Hélène Joy>]
This shows another issue, where we are matching against a different spelling of
the name. In this case we have an asymmetry though - a search for ``Helen``
will pick up ``Helena`` or ``Hélène``, but not the reverse. Another option
would be to use a trigram comparison, which compares sequences of letters.
For example::
>>> Author.objects.filter(name__unaccent__lower__trigram='Hélène')
[<Author: Helen Mirren>, <Actor: Hélène Joy>]
Now we have a different problem - the longer name of "Helena Bonham Carter"
doesn't show up as it is much longer. Trigram searches consider all
combinations of three letters, and compares how many appear in both search and
source strings. For the longer name, there are more combinations which appear
in the source string so it is no longer considered a close match.
The correct choice of comparison functions here depends on your particular data
set, for example the language(s) used and the type of text being searched. All
of the examples we've seen are on short strings where the user is likely to
enter something close (by varying definitions) to the source data.
Document-based search
---------------------
Simple database operations are too simple an approach when you start
considering large blocks of text. Whereas the examples above can be thought of
as operations on a string of characters, full text search looks at the actual
words. Depending on the system used, it's likely to use some of the following
ideas:
- Ignoring "stop words" such as "a", "the", "and".
- Stemming words, so that "pony" and "ponies" are considered similar.
- Weighting words based on different criteria such as how frequently they
appear in the text, or the importance of the fields, such as the title or
keywords, that they appear in.
There are many alternatives for using searching software, some of the most
prominent are Elastic_ and Solr_. These are full document-based search
solutions. To use them with data from Django models, you'll need a layer which
translates your data into a textual document, including back-references to the
database ids. When a search using the engine returns a certain document, you
can then look it up in the database. There are a variety of third-party
libraries which are designed to help with this process.
.. _Elastic: https://www.elastic.co/
.. _Solr: http://lucene.apache.org/solr/
PostgreSQL support
~~~~~~~~~~~~~~~~~~
PostgreSQL has its own full text search implementation built-in. While not as
powerful as some other search engines, it has the advantage of being inside
your database and so can easily be combined with other relational queries such
as categorization.
The :mod:`django.contrib.postgres` module provides some helpers to make these
queries. For example, a simple query might be to select all the blog entries
which mention "cheese"::
>>> Entry.objects.filter(body_text__search='cheese')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]
You can also filter on a combination of fields and on related models::
>>> Entry.objects.annotate(
... search=SearchVector('blog__tagline', 'body_text'),
... ).filter(search='cheese')
[
<Entry: Cheese on Toast recipes>,
<Entry: Pizza Recipes>,
<Entry: Dairy farming in Argentina>,
]
See the ``contrib.postgres`` :doc:`/ref/contrib/postgres/search` document for
complete details.