Large language models for text representation: semantic search and topic maps

LLM-powered semantic search and topic mapping marks a significant improvement over traditional methods of information retrieval and text mining, offering more contextually relevant results. As the technology continues to evolve, it holds the potential to revolutionize how we access and interpret vast amounts of textual data.

Dr. Matthias Plaue, Data Scientist at MAPEGY, recently published a new article focused on the integration of large language models (LLMs), such as BERT and GPT, into the realms of semantic search and topic mapping. This contribution is part of MAPEGY's involvement in the KI4BoardNET research project, funded by the Federal Ministry of Education and Research in Germany. The article delves into the potential of LLMs to revolutionize the way we interact with and extract meaning from vast textual data repositories.

The blog post commences by acknowledging the inefficiencies of traditional lexical search methods, which primarily rely on simple string matching and fail to capture the nuances of word meanings in different contexts. To illustrate this, Dr. Plaue presents an example involving the word "table," where a lexical search would treat it as the same term in two distinct contexts – as a spreadsheet in one instance and as a piece of furniture in another. This limitation prompts the need for semantic search, which allows users to input queries like "table furniture" and retrieve relevant results, even if the word "table" is not explicitly mentioned.

The article then delves into the concept of text embeddings and how LLMs like BERT are employed to numerically represent textual documents. These embeddings, consisting of hundreds of numerical entries, are derived from hidden states in neural networks trained to predict words based on context – an example of self-supervised learning. By comparing these embedding vectors using cosine similarity, it becomes possible to assess the semantic similarity between texts, allowing for more sophisticated semantic search engines.

To implement semantic search effectively, both user queries and the entire database of documents must be embedded. The article mentions vector databases like Milvus and the pgvector extension for PostgreSQL, which are designed for this task. The benefits of semantic search are further explored by comparing its performance with traditional lexical search methods, including the classic term frequency-inverse document frequency (tf-idf) approach. The analysis reveals that more recent LLMs outperform both fastText and tf-idf, suggesting that LLMs offer a balance between accuracy and runtime performance.

The author also introduces the concept of hybrid search, where a combination of semantic and lexical search methods is likely to become the industry standard. An intriguing challenge in semantic search is the highlighting of relevant keywords in matched documents, given the lack of explicit keyword matching. The article discusses techniques for tokenizing documents and highlighting tokens with high similarity to the query.

The blog post then shifts its focus to topic maps, which are valuable tools for text clustering and visualization. By vectorizing terms and including descriptions, embeddings are enhanced with additional context information. The article showcases the use of dimensionality reduction algorithms like UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) to map high-dimensional data into two-dimensional space for visualization.

In conclusion, the integration of LLMs like BERT and GPT into search algorithms represents a significant leap forward from traditional lexical search methods. These models enable more nuanced and contextually relevant information retrieval, enhancing the relevance of search results. However, this advancement comes with challenges, including computational resource requirements and the interpretability of results. Nevertheless, LLM-powered semantic search and topic mapping have the potential to transform how we access and interpret large volumes of textual data, offering a more sophisticated and meaningful approach to information retrieval.

To read the full article authored by Dr. Matthias Plaue, please follow this link: Full article here.

‍

Large language models for text representation: semantic search and topic maps

Drive Innovation Smarter and Faster in the Digital Era.

Drive Innovation Smarter and Faster
in the Digital Era.