Tokenization and Ranking Mechanics

The SearchIndex class in app.services.search_service provides an in-memory full-text search implementation designed for small datasets. It uses an inverted index structure to map text tokens to bookmark IDs, enabling efficient retrieval and relevance-based ranking without the overhead of an external search engine like Elasticsearch or Typesense.

Tokenization Strategy

The search process begins with tokenization, which converts raw text into a normalized list of searchable terms. This logic is encapsulated in the _tokenize method and is applied both when indexing a bookmark and when processing a search query.

Normalization and Regex Splitting

The implementation uses a simple regular expression to identify tokens: [a-z0-9]+. This ensures that search is case-insensitive and ignores punctuation.

# app/services/search_service.py

_TOKEN_RE = re.compile(r"[a-z0-9]+")

def _tokenize(self, text: str) -> List[str]:
    """Split text into lowercase tokens, removing stop words."""
    tokens = _TOKEN_RE.findall(text.lower())
    return [t for t in tokens if t not in _STOP_WORDS]

Stop-Word Removal

To improve search quality and reduce index size, the system filters out common English "stop words" that carry little semantic meaning. These are defined as a module-level constant:

_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}

If a search query consists entirely of stop words (e.g., "the and"), the _tokenize method returns an empty list, and the search method immediately returns no results.

Search Mechanics and Query Execution

The SearchIndex uses an AND-logic approach for multi-token queries. For a bookmark to be considered a match, it must contain all tokens present in the search query.

Inverted Index Lookup

The core of the search is a dictionary mapping tokens to sets of bookmark IDs (Dict[str, Set[str]]). When a query is executed, the system retrieves the set of IDs for each token and performs a set intersection.

# app/services/search_service.py:71-73

candidate_ids: Set[str] = self._index.get(tokens[0], set()).copy()
for token in tokens[1:]:
    candidate_ids &= self._index.get(token, set())

This design choice prioritizes precision over recall. If a user searches for "python tutorial", only bookmarks containing both "python" and "tutorial" in their title or description will be returned.

Relevance Ranking

Once the candidate bookmarks are identified, they are ranked by relevance before being returned to the user. The ranking algorithm is implemented in the _rank_results static method.

Frequency-Based Scoring

The score for a bookmark is calculated by counting the total number of occurrences of each query token within the concatenated title and description.

# app/services/search_service.py:108-113

@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
    """Rank results by number of token occurrences in title + description."""
    def score(b: Bookmark) -> int:
        text = f"{b.title} {b.description}".lower()
        return sum(text.count(t) for t in tokens)

    return sorted(bookmarks, key=score, reverse=True)

This approach provides a basic level of relevance: a bookmark that mentions a search term multiple times will rank higher than one that mentions it only once. However, it does not distinguish between a match in the title versus a match in the description, treating both fields with equal weight.

Design Tradeoffs and Constraints

The implementation of SearchIndex reflects a "simplicity first" design philosophy, suitable for the project's current scale but with clear constraints for future growth.

In-Memory Lifecycle: The index is entirely in-memory and is rebuilt from the repository every time the SearchIndex is initialized (typically on application startup via the BookmarkService). While this ensures fast lookups, it increases startup time as the number of bookmarks grows.
Incremental Updates: To maintain consistency without a full rebuild, the index is updated incrementally whenever a bookmark is added or modified through index_bookmark. This method first removes the old entries for the bookmark ID before re-indexing the new content.
Memory Usage: Because every unique token in every bookmark is stored in the _index dictionary, memory usage scales linearly with the total volume of text in the database.
Simple Scoring: The ranking algorithm lacks advanced features like TF-IDF (Term Frequency-Inverse Document Frequency) or field boosting, which would be necessary for more sophisticated search requirements.

Tokenization Strategy​

Normalization and Regex Splitting​

Stop-Word Removal​

Search Mechanics and Query Execution​

Inverted Index Lookup​

Relevance Ranking​

Frequency-Based Scoring​

Design Tradeoffs and Constraints​