Editing Information retrieval (section)

== Model types ==
[[File:Information-Retrieval-Models.png|thumb|500px|Categorization of IR-models (translated from [[:de:Informationsrückgewinnung#Klassifikation von Modellen zur Repräsentation natürlichsprachlicher Dokumente|German entry]], original source [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id= Dominik Kuropka])]]
In order to effectively retrieve relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of some common models. In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.

=== First dimension: mathematical basis ===
* ''Set-theoretic'' models represent documents as [[Set (mathematics)|set]]s of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are:
** [[Standard Boolean model]]
** [[Extended Boolean model]]
** [[Fuzzy retrieval]]
* ''Algebraic models'' represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value.
** [[Vector space model]]
** [[Generalized vector space model]]
** [[Topic-based vector space model|(Enhanced) Topic-based Vector Space Model]]
** [[Extended Boolean model]]
** [[Latent semantic indexing]] a.k.a. [[latent semantic analysis]]
* ''Probabilistic models'' treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like [[Bayes' theorem]] are often used in these models.
** [[Binary Independence Model]]
** [[Probabilistic relevance model]] on which is based the [[Probabilistic relevance model (BM25)|okapi (BM25)]] relevance function
** [[Uncertain inference]]
** [[Language model]]s
** [[Divergence-from-randomness model]]
** [[Latent Dirichlet allocation]]
* ''Feature-based retrieval models'' view documents as vectors of values of ''feature functions'' (or just ''features'') and seek the best way to combine these features into a single relevance score, typically by [[learning to rank]] methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just another feature.

=== Second dimension: properties of the model ===
* ''Models without term-interdependencies'' treat different terms/words as independent. This fact is usually represented in vector space models by the [[orthogonality]] assumption of term vectors or in probabilistic models by an [[Independence (mathematical logic)|independency]] assumption for term variables.
* ''Models with immanent term interdependencies'' allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by [[dimension reduction|dimensional reduction]]) from the [[co-occurrence]] of those terms in the whole set of documents.
* ''Models with transcendent term interdependencies'' allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely on an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)

=== Third Dimension: representational approach-based classification ===
In addition to the theoretical distinctions, modern information retrieval models are also categorized on how queries and documents are represented and compared, using a practical classification distinguishing between sparse, dense and hybrid models.<ref name=":02">{{cite arXiv | eprint=2107.09226 | last1=Kim | first1=Dohyun | last2=Zhao | first2=Lina | last3=Chung | first3=Eric | last4=Park | first4=Eun-Jae | title=Pressure-robust staggered DG methods for the Navier-Stokes equations on general meshes | date=2021 | class=math.NA }}</ref>

* '''''Sparse''''' models utilize interpretable, term-based representations and typically rely on inverted index structures. Classical methods such as TF-IDF and BM25 fall under this category, along with more recent learned sparse models that integrate neural architectures while retaining sparsity.<ref name=":6">{{cite arXiv | eprint=2104.08663 | last1=Thakur | first1=Nandan | last2=Reimers | first2=Nils | last3=Rücklé | first3=Andreas | last4=Srivastava | first4=Abhishek | last5=Gurevych | first5=Iryna | title=BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models | date=2021 | class=cs.IR }}</ref>
* '''''Dense''''' models represent queries and documents as continuous vectors using deep learning models, typically transformer-based encoders. These models enable semantic similarity matching beyond exact term overlap and are used in tasks involving semantic search and question answering.<ref>{{Cite journal |last1=Lau |first1=Jey Han |last2=Armendariz |first2=Carlos |last3=Lappin |first3=Shalom |last4=Purver |first4=Matthew |last5=Shu |first5=Chang |date=2020 |editor-last=Johnson |editor-first=Mark |editor2-last=Roark |editor2-first=Brian |editor3-last=Nenkova |editor3-first=Ani |title=How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in Context |url=https://aclanthology.org/2020.tacl-1.20/ |journal=Transactions of the Association for Computational Linguistics |volume=8 |pages=296–310 |doi=10.1162/tacl_a_00315}}</ref>
* '''''Hybrid''''' models aim to combine the strengths of both approaches, integrating lexical (tokens) and semantic signals through score fusion, late interaction, or multi-stage ranking pipelines.<ref>{{cite arXiv | eprint=2109.10739 | last1=Arabzadeh | first1=Negar | last2=Yan | first2=Xinyi | last3=Clarke | first3=Charles L. A. | title=Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection | date=2021 | class=cs.IR }}</ref>

This classification has become increasingly common in both academic and the real world applications and is getting widely adopted and used in evaluation benchmarks for Information Retrieval models.<ref name=":12">{{cite arXiv | eprint=2010.06467 | last1=Lin | first1=Jimmy | last2=Nogueira | first2=Rodrigo | last3=Yates | first3=Andrew | title=Pretrained Transformers for Text Ranking: BERT and Beyond | date=2020 | class=cs.IR }}</ref><ref name=":6" />