Skip to main content

Search & Indexing

The search and indexing system provides full-text discovery across bookmark titles and descriptions. It is implemented as an in-memory inverted index that supports token-based matching and relevance ranking, integrated directly into the application's service layer.

Inverted Index Architecture

The core of the search functionality resides in the SearchIndex class within app/services/search_service.py. This class maintains an inverted index—a mapping where each unique word (token) points to a set of bookmark IDs containing that word.

class SearchIndex:
def __init__(self, repository: "BookmarkRepository") -> None:
self._repo = repository
self._index: Dict[str, Set[str]] = defaultdict(set)
self._rebuild()

This design allows for efficient retrieval of matching documents without scanning the entire dataset. When a search is performed, the system looks up the sets of IDs for each query token and performs a set intersection to find bookmarks that contain all requested terms.

Tokenization and Relevance

Text processing is handled by the _tokenize method, which prepares both the indexed content and the user's search queries.

Tokenization Process

The system uses a regular expression ([a-z0-9]+) to extract alphanumeric tokens and converts them to lowercase. To improve search quality, it filters out common "stop words" defined in _STOP_WORDS (e.g., "the", "and", "is").

_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}
_TOKEN_RE = re.compile(r"[a-z0-9]+")

def _tokenize(self, text: str) -> List[str]:
"""Split text into lowercase tokens, removing stop words."""
tokens = _TOKEN_RE.findall(text.lower())
return [t for t in tokens if t not in _STOP_WORDS]

Relevance Ranking

Search results are not returned in arbitrary order. The _rank_results method calculates a simple relevance score based on the frequency of query tokens within the bookmark's title and description.

@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
"""Rank results by number of token occurrences in title + description."""
def score(b: Bookmark) -> int:
text = f"{b.title} {b.description}".lower()
return sum(text.count(t) for t in tokens)

return sorted(bookmarks, key=score, reverse=True)

Lifecycle and Synchronization

The BookmarkService in app/services/bookmark_service.py orchestrates the search index's lifecycle, ensuring it remains synchronized with the underlying repository.

Initialization

When the BookmarkService is first initialized (typically at application startup), it bootstraps the SearchIndex. The index performs a full rebuild by fetching up to 10,000 bookmarks from the BookmarkRepository.

Incremental Updates

To maintain accuracy without full rebuilds, the index is updated incrementally during write operations. Both create_bookmark and update_bookmark trigger an indexing event:

# app/services/bookmark_service.py

def create_bookmark(self, data: Dict[str, Any]) -> Tuple[Optional[Bookmark], Optional[str]]:
# ... validation and save ...
self._repo.save_bookmark(bookmark)
self._search.index_bookmark(bookmark)
return bookmark, None

The index_bookmark method handles updates by first removing any existing entries for the bookmark ID before re-adding the new tokens, preventing duplicate or stale entries.

Design Tradeoffs and Constraints

The implementation makes several specific design choices that impact its behavior and scalability:

  • In-Memory Storage: The index is entirely volatile. While this provides extremely fast lookups, it requires a full rebuild from the repository on every application restart and is limited by the available system memory.
  • Strict AND Logic: The search implementation uses set intersection (&=) for multi-token queries. This means a bookmark must contain all tokens in the query to be returned. If a user searches for "python tutorial", only bookmarks containing both words will match.
  • Soft-Delete Persistence: In the current implementation of BookmarkService.delete_bookmark, bookmarks are "trashed" (a status change) but not explicitly removed from the SearchIndex. Because the index rebuild process in SearchIndex._rebuild fetches all bookmarks regardless of status, trashed and archived bookmarks will appear in search results if they match the query.
  • Tokenization Limits: The regex-based tokenizer is optimized for English alphanumeric text. It does not support stemming (e.g., matching "running" when searching for "run") or fuzzy matching (handling typos).