Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Information retrieval
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== History == {{Rquote|right|there is ... a machine called the Univac ... whereby letters and figures are coded as a pattern of magnetic spots on a long steel tape. By this means the text of a document, preceded by its subject code symbol, can be recorded ... the machine ... automatically selects and types out those references which have been coded in any desired way at a rate of 120 words a minute| J. E. Holmstrom, 1948}} The idea of using computers to search for relevant pieces of information was popularized in the article ''[[As We May Think]]'' by [[Vannevar Bush]] in 1945.<ref name="Singhal2001">{{cite journal |last=Singhal |first=Amit |title=Modern Information Retrieval: A Brief Overview |journal=Bulletin of the IEEE Computer Society Technical Committee on Data Engineering|volume=24 |issue=4 |pages=35β43 |year =2001 |url=http://singhal.info/ieee2001.pdf }}</ref> It would appear that Bush was inspired by patents for a 'statistical machine' β filed by [[Emanuel Goldberg]] in the 1920s and 1930s β that searched for documents stored on film.<ref name="Sanderson2012">{{cite journal |author=Mark Sanderson & W. Bruce Croft |title=The History of Information Retrieval Research |journal=Proceedings of the IEEE |volume=100 |pages=1444β1451 |year =2012 |doi=10.1109/jproc.2012.2189916|doi-access=free }}</ref> The first description of a computer searching for information was described by Holmstrom in 1948,<ref name="Holmstrom1948">{{cite journal |author=JE Holmstrom |title='Section III. Opening Plenary Session |journal=The Royal Society Scientific Information Conference, 21 June-2 July 1948: Report and Papers Submitted |pages=85|year =1948|url=https://books.google.com/books?id=M34lAAAAMAAJ&q=univac}}</ref> detailing an early mention of the [[Univac]] computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy ''[[Desk Set]]''. In the 1960s, the first large information retrieval research group was formed by [[Gerard Salton]] at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small [[text corpora]] such as the Cranfield collection (several thousand documents).<ref name="Singhal2001" /> Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s. In 1992, the US Department of Defense along with the [[National Institute of Standards and Technology]] (NIST), cosponsored the [[Text Retrieval Conference]] (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that [[scalability|scale]] to huge corpora. The introduction of [[web search engine]]s has boosted the need for very large scale retrieval systems even further. By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as [[AltaVista]] (1995) and [[Yahoo! Inc. (1995β2017)|Yahoo!]] (1994) offered keyword-based retrieval, they were limited in scale and ranking refinement. The breakthrough came in 1998 with the founding of [[Google]], which introduced the [[PageRank]] algorithm,<ref name=":2">{{Cite web |title=The Anatomy of a Search Engine |url=http://infolab.stanford.edu/~backrub/google.html |access-date=2025-04-09 |website=infolab.stanford.edu}}</ref> using the webβs hyperlink structure to assess page importance and improve relevance ranking. During the 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, [[Microsoft]] launched [[Microsoft Bing|Bing]], introducing features that would later incorporate [[Semantic Web|semantic]] web technologies through the development of its Satori knowledge base. Academic analysis<ref name=":3">{{Cite journal |last1=Uyar |first1=Ahmet |last2=Aliyu |first2=Farouk Musa |date=2015-01-01 |title=Evaluating search features of Google Knowledge Graph and Bing Satori: Entity types, list searches and query interfaces |url=https://www.emerald.com/insight/content/doi/10.1108/oir-10-2014-0257/full/html |journal=Online Information Review |volume=39 |issue=2 |pages=197β213 |doi=10.1108/OIR-10-2014-0257 |issn=1468-4527}}</ref> have highlighted Bingβs semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing. A major leap occurred in 2018, when Google deployed [[BERT (language model)|BERT]] ('''B'''idirectional '''E'''ncoder '''R'''epresentations from '''T'''ransformers) to better understand the contextual meaning of queries and documents. This marked one of the first times deep neural language models were used at scale in real-world retrieval systems.<ref name=":4">{{cite arXiv | eprint=1810.04805 | last1=Devlin | first1=Jacob | last2=Chang | first2=Ming-Wei | last3=Lee | first3=Kenton | last4=Toutanova | first4=Kristina | title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | date=2018 | class=cs.CL }}</ref> BERTβs bidirectional training enabled a more refined comprehension of word relationships in context, improving the handling of natural language queries. Because of its success, transformer-based models gained traction in academic research and commercial search applications.<ref>{{Cite journal |last1=Gardazi |first1=Nadia Mushtaq |last2=Daud |first2=Ali |last3=Malik |first3=Muhammad Kamran |last4=Bukhari |first4=Amal |last5=Alsahfi |first5=Tariq |last6=Alshemaimri |first6=Bader |date=2025-03-15 |title=BERT applications in natural language processing: a review |url=https://link.springer.com/article/10.1007/s10462-025-11162-5 |journal=Artificial Intelligence Review |language=en |volume=58 |issue=6 |pages=166 |doi=10.1007/s10462-025-11162-5 |issn=1573-7462|doi-access=free }}</ref> Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the '''T'''ext '''RE'''trieval '''C'''onference ([[Text Retrieval Conference|TREC]]), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO('''MA'''chine '''R'''eading '''CO'''mprehension) (2019)<ref name=":5">{{cite arXiv | eprint=1611.09268 | last1=Bajaj | first1=Payal | last2=Campos | first2=Daniel | last3=Craswell | first3=Nick | last4=Deng | first4=Li | last5=Gao | first5=Jianfeng | last6=Liu | first6=Xiaodong | last7=Majumder | first7=Rangan | last8=McNamara | first8=Andrew | last9=Mitra | first9=Bhaskar | last10=Nguyen | first10=Tri | last11=Rosenberg | first11=Mir | last12=Song | first12=Xia | last13=Stoica | first13=Alina | last14=Tiwary | first14=Saurabh | last15=Wang | first15=Tong | title=MS MARCO: A Human Generated MAchine Reading COmprehension Dataset | date=2016 | class=cs.CL }}</ref> became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment.<ref>{{Cite journal |last1=Craswell |first1=Nick |last2=Mitra |first2=Bhaskar |last3=Yilmaz |first3=Emine |last4=Rahmani |first4=Hossein A. |last5=Campos |first5=Daniel |last6=Lin |first6=Jimmy |last7=Voorhees |first7=Ellen M. |last8=Soboroff |first8=Ian |date=2024-02-28 |title=Overview of the TREC 2023 Deep Learning Track |url=https://www.microsoft.com/en-us/research/publication/overview-of-the-trec-2023-deep-learning-track/ |language=en-US}}</ref> As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: '''sparse''', '''dense''', and '''hybrid''' models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals.<ref name=":0">{{arxiv|2107.09226 }}</ref> Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap.<ref name=":8">{{Cite book |last1=Khattab |first1=Omar |last2=Zaharia |first2=Matei |chapter=ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT |date=2020-07-25 |title=Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval |chapter-url=https://dl.acm.org/doi/10.1145/3397271.3401075 |series=SIGIR '20 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=39β48 |doi=10.1145/3397271.3401075 |isbn=978-1-4503-8016-4}}</ref> Hybrid models aim to combine the advantages of both, balancing the lexical (token) precision of sparse methods with the semantic depth of dense models. This way of categorizing models balances scalability, relevance, and efficiency in retrieval systems.<ref name=":1">{{cite arXiv | eprint=2010.06467 | last1=Lin | first1=Jimmy | last2=Nogueira | first2=Rodrigo | last3=Yates | first3=Andrew | title=Pretrained Transformers for Text Ranking: BERT and Beyond | date=2020 | class=cs.IR }}</ref> As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the picture. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Information retrieval
(section)
Add topic