mirror of
https://github.com/django/django.git
synced 2024-12-30 13:05:45 +00:00
415ef34c4c
Backport of 718b32c691
from main
447 lines
16 KiB
Plaintext
447 lines
16 KiB
Plaintext
================
|
|
Full text search
|
|
================
|
|
|
|
The database functions in the ``django.contrib.postgres.search`` module ease
|
|
the use of PostgreSQL's `full text search engine
|
|
<https://www.postgresql.org/docs/current/textsearch.html>`_.
|
|
|
|
For the examples in this document, we'll use the models defined in
|
|
:doc:`/topics/db/queries`.
|
|
|
|
.. seealso::
|
|
|
|
For a high-level overview of searching, see the :doc:`topic documentation
|
|
</topics/db/search>`.
|
|
|
|
.. currentmodule:: django.contrib.postgres.search
|
|
|
|
The ``search`` lookup
|
|
=====================
|
|
|
|
.. fieldlookup:: search
|
|
|
|
A common way to use full text search is to search a single term against a
|
|
single column in the database. For example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> Entry.objects.filter(body_text__search="Cheese")
|
|
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
|
|
|
|
This creates a ``to_tsvector`` in the database from the ``body_text`` field
|
|
and a ``plainto_tsquery`` from the search term ``'Cheese'``, both using the
|
|
default database search configuration. The results are obtained by matching the
|
|
query and the vector.
|
|
|
|
To use the ``search`` lookup, ``'django.contrib.postgres'`` must be in your
|
|
:setting:`INSTALLED_APPS`.
|
|
|
|
``SearchVector``
|
|
================
|
|
|
|
.. class:: SearchVector(*expressions, config=None, weight=None)
|
|
|
|
Searching against a single field is great but rather limiting. The ``Entry``
|
|
instances we're searching belong to a ``Blog``, which has a ``tagline`` field.
|
|
To query against both fields, use a ``SearchVector``:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchVector
|
|
>>> Entry.objects.annotate(
|
|
... search=SearchVector("body_text", "blog__tagline"),
|
|
... ).filter(search="Cheese")
|
|
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
|
|
|
|
The arguments to ``SearchVector`` can be any
|
|
:class:`~django.db.models.Expression` or the name of a field. Multiple
|
|
arguments will be concatenated together using a space so that the search
|
|
document includes them all.
|
|
|
|
``SearchVector`` objects can be combined together, allowing you to reuse them.
|
|
For example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> Entry.objects.annotate(
|
|
... search=SearchVector("body_text") + SearchVector("blog__tagline"),
|
|
... ).filter(search="Cheese")
|
|
[<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
|
|
|
|
See :ref:`postgresql-fts-search-configuration` and
|
|
:ref:`postgresql-fts-weighting-queries` for an explanation of the ``config``
|
|
and ``weight`` parameters.
|
|
|
|
``SearchQuery``
|
|
===============
|
|
|
|
.. class:: SearchQuery(value, config=None, search_type='plain')
|
|
|
|
``SearchQuery`` translates the terms the user provides into a search query
|
|
object that the database compares to a search vector. By default, all the words
|
|
the user provides are passed through the stemming algorithms, and then it
|
|
looks for matches for all of the resulting terms.
|
|
|
|
If ``search_type`` is ``'plain'``, which is the default, the terms are treated
|
|
as separate keywords. If ``search_type`` is ``'phrase'``, the terms are treated
|
|
as a single phrase. If ``search_type`` is ``'raw'``, then you can provide a
|
|
formatted search query with terms and operators. If ``search_type`` is
|
|
``'websearch'``, then you can provide a formatted search query, similar to the
|
|
one used by web search engines. ``'websearch'`` requires PostgreSQL ≥ 11. Read
|
|
PostgreSQL's `Full Text Search docs`_ to learn about differences and syntax.
|
|
Examples:
|
|
|
|
.. _Full Text Search docs: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchQuery
|
|
>>> SearchQuery("red tomato") # two keywords
|
|
>>> SearchQuery("tomato red") # same results as above
|
|
>>> SearchQuery("red tomato", search_type="phrase") # a phrase
|
|
>>> SearchQuery("tomato red", search_type="phrase") # a different phrase
|
|
>>> SearchQuery("'tomato' & ('red' | 'green')", search_type="raw") # boolean operators
|
|
>>> SearchQuery(
|
|
... "'tomato' ('red' OR 'green')", search_type="websearch"
|
|
... ) # websearch operators
|
|
|
|
``SearchQuery`` terms can be combined logically to provide more flexibility:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchQuery
|
|
>>> SearchQuery("meat") & SearchQuery("cheese") # AND
|
|
>>> SearchQuery("meat") | SearchQuery("cheese") # OR
|
|
>>> ~SearchQuery("meat") # NOT
|
|
|
|
See :ref:`postgresql-fts-search-configuration` for an explanation of the
|
|
``config`` parameter.
|
|
|
|
``SearchRank``
|
|
==============
|
|
|
|
.. class:: SearchRank(vector, query, weights=None, normalization=None, cover_density=False)
|
|
|
|
So far, we've returned the results for which any match between the vector and
|
|
the query are possible. It's likely you may wish to order the results by some
|
|
sort of relevancy. PostgreSQL provides a ranking function which takes into
|
|
account how often the query terms appear in the document, how close together
|
|
the terms are in the document, and how important the part of the document is
|
|
where they occur. The better the match, the higher the value of the rank. To
|
|
order by relevancy:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
|
|
>>> vector = SearchVector("body_text")
|
|
>>> query = SearchQuery("cheese")
|
|
>>> Entry.objects.annotate(rank=SearchRank(vector, query)).order_by("-rank")
|
|
[<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]
|
|
|
|
See :ref:`postgresql-fts-weighting-queries` for an explanation of the
|
|
``weights`` parameter.
|
|
|
|
Set the ``cover_density`` parameter to ``True`` to enable the cover density
|
|
ranking, which means that the proximity of matching query terms is taken into
|
|
account.
|
|
|
|
Provide an integer to the ``normalization`` parameter to control rank
|
|
normalization. This integer is a bit mask, so you can combine multiple
|
|
behaviors:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.db.models import Value
|
|
>>> Entry.objects.annotate(
|
|
... rank=SearchRank(
|
|
... vector,
|
|
... query,
|
|
... normalization=Value(2).bitor(Value(4)),
|
|
... )
|
|
... )
|
|
|
|
The PostgreSQL documentation has more details about `different rank
|
|
normalization options`_.
|
|
|
|
.. _different rank normalization options: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING
|
|
|
|
``SearchHeadline``
|
|
==================
|
|
|
|
.. class:: SearchHeadline(expression, query, config=None, start_sel=None, stop_sel=None, max_words=None, min_words=None, short_word=None, highlight_all=None, max_fragments=None, fragment_delimiter=None)
|
|
|
|
Accepts a single text field or an expression, a query, a config, and a set of
|
|
options. Returns highlighted search results.
|
|
|
|
Set the ``start_sel`` and ``stop_sel`` parameters to the string values to be
|
|
used to wrap highlighted query terms in the document. PostgreSQL's defaults are
|
|
``<b>`` and ``</b>``.
|
|
|
|
Provide integer values to the ``max_words`` and ``min_words`` parameters to
|
|
determine the longest and shortest headlines. PostgreSQL's defaults are 35 and
|
|
15.
|
|
|
|
Provide an integer value to the ``short_word`` parameter to discard words of
|
|
this length or less in each headline. PostgreSQL's default is 3.
|
|
|
|
Set the ``highlight_all`` parameter to ``True`` to use the whole document in
|
|
place of a fragment and ignore ``max_words``, ``min_words``, and ``short_word``
|
|
parameters. That's disabled by default in PostgreSQL.
|
|
|
|
Provide a non-zero integer value to the ``max_fragments`` to set the maximum
|
|
number of fragments to display. That's disabled by default in PostgreSQL.
|
|
|
|
Set the ``fragment_delimiter`` string parameter to configure the delimiter
|
|
between fragments. PostgreSQL's default is ``" ... "``.
|
|
|
|
The PostgreSQL documentation has more details on `highlighting search
|
|
results`_.
|
|
|
|
Usage example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchHeadline, SearchQuery
|
|
>>> query = SearchQuery("red tomato")
|
|
>>> entry = Entry.objects.annotate(
|
|
... headline=SearchHeadline(
|
|
... "body_text",
|
|
... query,
|
|
... start_sel="<span>",
|
|
... stop_sel="</span>",
|
|
... ),
|
|
... ).get()
|
|
>>> print(entry.headline)
|
|
Sandwich with <span>tomato</span> and <span>red</span> cheese.
|
|
|
|
See :ref:`postgresql-fts-search-configuration` for an explanation of the
|
|
``config`` parameter.
|
|
|
|
.. _highlighting search results: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-HEADLINE
|
|
|
|
.. _postgresql-fts-search-configuration:
|
|
|
|
Changing the search configuration
|
|
=================================
|
|
|
|
You can specify the ``config`` attribute to a :class:`SearchVector` and
|
|
:class:`SearchQuery` to use a different search configuration. This allows using
|
|
different language parsers and dictionaries as defined by the database:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchQuery, SearchVector
|
|
>>> Entry.objects.annotate(
|
|
... search=SearchVector("body_text", config="french"),
|
|
... ).filter(search=SearchQuery("œuf", config="french"))
|
|
[<Entry: Pain perdu>]
|
|
|
|
The value of ``config`` could also be stored in another column:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.db.models import F
|
|
>>> Entry.objects.annotate(
|
|
... search=SearchVector("body_text", config=F("blog__language")),
|
|
... ).filter(search=SearchQuery("œuf", config=F("blog__language")))
|
|
[<Entry: Pain perdu>]
|
|
|
|
.. _postgresql-fts-weighting-queries:
|
|
|
|
Weighting queries
|
|
=================
|
|
|
|
Every field may not have the same relevance in a query, so you can set weights
|
|
of various vectors before you combine them:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
|
|
>>> vector = SearchVector("body_text", weight="A") + SearchVector(
|
|
... "blog__tagline", weight="B"
|
|
... )
|
|
>>> query = SearchQuery("cheese")
|
|
>>> Entry.objects.annotate(rank=SearchRank(vector, query)).filter(rank__gte=0.3).order_by(
|
|
... "rank"
|
|
... )
|
|
|
|
The weight should be one of the following letters: D, C, B, A. By default,
|
|
these weights refer to the numbers ``0.1``, ``0.2``, ``0.4``, and ``1.0``,
|
|
respectively. If you wish to weight them differently, pass a list of four
|
|
floats to :class:`SearchRank` as ``weights`` in the same order above:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> rank = SearchRank(vector, query, weights=[0.2, 0.4, 0.6, 0.8])
|
|
>>> Entry.objects.annotate(rank=rank).filter(rank__gte=0.3).order_by("-rank")
|
|
|
|
Performance
|
|
===========
|
|
|
|
Special database configuration isn't necessary to use any of these functions,
|
|
however, if you're searching more than a few hundred records, you're likely to
|
|
run into performance problems. Full text search is a more intensive process
|
|
than comparing the size of an integer, for example.
|
|
|
|
In the event that all the fields you're querying on are contained within one
|
|
particular model, you can create a functional
|
|
:class:`GIN <django.contrib.postgres.indexes.GinIndex>` or
|
|
:class:`GiST <django.contrib.postgres.indexes.GistIndex>` index which matches
|
|
the search vector you wish to use. For example::
|
|
|
|
GinIndex(
|
|
SearchVector("body_text", "headline", config="english"),
|
|
name="search_vector_idx",
|
|
)
|
|
|
|
The PostgreSQL documentation has details on
|
|
`creating indexes for full text search
|
|
<https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX>`_.
|
|
|
|
``SearchVectorField``
|
|
---------------------
|
|
|
|
.. class:: SearchVectorField
|
|
|
|
If this approach becomes too slow, you can add a ``SearchVectorField`` to your
|
|
model. You'll need to keep it populated with triggers, for example, as
|
|
described in the `PostgreSQL documentation`_. You can then query the field as
|
|
if it were an annotated ``SearchVector``:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> Entry.objects.update(search_vector=SearchVector("body_text"))
|
|
>>> Entry.objects.filter(search_vector="cheese")
|
|
[<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]
|
|
|
|
.. _PostgreSQL documentation: https://www.postgresql.org/docs/current/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERS
|
|
|
|
Trigram similarity
|
|
==================
|
|
|
|
Another approach to searching is trigram similarity. A trigram is a group of
|
|
three consecutive characters. In addition to the :lookup:`trigram_similar`,
|
|
:lookup:`trigram_word_similar`, and :lookup:`trigram_strict_word_similar`
|
|
lookups, you can use a couple of other expressions.
|
|
|
|
To use them, you need to activate the `pg_trgm extension
|
|
<https://www.postgresql.org/docs/current/pgtrgm.html>`_ on PostgreSQL. You can
|
|
install it using the
|
|
:class:`~django.contrib.postgres.operations.TrigramExtension` migration
|
|
operation.
|
|
|
|
``TrigramSimilarity``
|
|
---------------------
|
|
|
|
.. class:: TrigramSimilarity(expression, string, **extra)
|
|
|
|
Accepts a field name or expression, and a string or expression. Returns the
|
|
trigram similarity between the two arguments.
|
|
|
|
Usage example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import TrigramSimilarity
|
|
>>> Author.objects.create(name="Katy Stevens")
|
|
>>> Author.objects.create(name="Stephen Keats")
|
|
>>> test = "Katie Stephens"
|
|
>>> Author.objects.annotate(
|
|
... similarity=TrigramSimilarity("name", test),
|
|
... ).filter(
|
|
... similarity__gt=0.3
|
|
... ).order_by("-similarity")
|
|
[<Author: Katy Stevens>, <Author: Stephen Keats>]
|
|
|
|
``TrigramWordSimilarity``
|
|
-------------------------
|
|
|
|
.. class:: TrigramWordSimilarity(string, expression, **extra)
|
|
|
|
Accepts a string or expression, and a field name or expression. Returns the
|
|
trigram word similarity between the two arguments.
|
|
|
|
Usage example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import TrigramWordSimilarity
|
|
>>> Author.objects.create(name="Katy Stevens")
|
|
>>> Author.objects.create(name="Stephen Keats")
|
|
>>> test = "Kat"
|
|
>>> Author.objects.annotate(
|
|
... similarity=TrigramWordSimilarity(test, "name"),
|
|
... ).filter(
|
|
... similarity__gt=0.3
|
|
... ).order_by("-similarity")
|
|
[<Author: Katy Stevens>]
|
|
|
|
``TrigramStrictWordSimilarity``
|
|
-------------------------------
|
|
|
|
.. class:: TrigramStrictWordSimilarity(string, expression, **extra)
|
|
|
|
.. versionadded:: 4.2
|
|
|
|
Accepts a string or expression, and a field name or expression. Returns the
|
|
trigram strict word similarity between the two arguments. Similar to
|
|
:class:`TrigramWordSimilarity() <TrigramWordSimilarity>`, except that it forces
|
|
extent boundaries to match word boundaries.
|
|
|
|
``TrigramDistance``
|
|
-------------------
|
|
|
|
.. class:: TrigramDistance(expression, string, **extra)
|
|
|
|
Accepts a field name or expression, and a string or expression. Returns the
|
|
trigram distance between the two arguments.
|
|
|
|
Usage example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import TrigramDistance
|
|
>>> Author.objects.create(name="Katy Stevens")
|
|
>>> Author.objects.create(name="Stephen Keats")
|
|
>>> test = "Katie Stephens"
|
|
>>> Author.objects.annotate(
|
|
... distance=TrigramDistance("name", test),
|
|
... ).filter(
|
|
... distance__lte=0.7
|
|
... ).order_by("distance")
|
|
[<Author: Katy Stevens>, <Author: Stephen Keats>]
|
|
|
|
``TrigramWordDistance``
|
|
-----------------------
|
|
|
|
.. class:: TrigramWordDistance(string, expression, **extra)
|
|
|
|
Accepts a string or expression, and a field name or expression. Returns the
|
|
trigram word distance between the two arguments.
|
|
|
|
Usage example:
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from django.contrib.postgres.search import TrigramWordDistance
|
|
>>> Author.objects.create(name="Katy Stevens")
|
|
>>> Author.objects.create(name="Stephen Keats")
|
|
>>> test = "Kat"
|
|
>>> Author.objects.annotate(
|
|
... distance=TrigramWordDistance(test, "name"),
|
|
... ).filter(
|
|
... distance__lte=0.7
|
|
... ).order_by("distance")
|
|
[<Author: Katy Stevens>]
|
|
|
|
``TrigramStrictWordDistance``
|
|
-----------------------------
|
|
|
|
.. class:: TrigramStrictWordDistance(string, expression, **extra)
|
|
|
|
.. versionadded:: 4.2
|
|
|
|
Accepts a string or expression, and a field name or expression. Returns the
|
|
trigram strict word distance between the two arguments.
|