Search & Indexing
The search and indexing system provides full-text discovery across bookmark titles and descriptions. It is implemented as an in-memory inverted index that supports token-based matching and relevance ranking, integrated directly into the application's service layer.
Inverted Index Architecture
The core of the search functionality resides in the SearchIndex class within app/services/search_service.py. This class maintains an inverted index—a mapping where each unique word (token) points to a set of bookmark IDs containing that word.
class SearchIndex:
def __init__(self, repository: "BookmarkRepository") -> None:
self._repo = repository
self._index: Dict[str, Set[str]] = defaultdict(set)
self._rebuild()
This design allows for efficient retrieval of matching documents without scanning the entire dataset. When a search is performed, the system looks up the sets of IDs for each query token and performs a set intersection to find bookmarks that contain all requested terms.
Tokenization and Relevance
Text processing is handled by the _tokenize method, which prepares both the indexed content and the user's search queries.
Tokenization Process
The system uses a regular expression ([a-z0-9]+) to extract alphanumeric tokens and converts them to lowercase. To improve search quality, it filters out common "stop words" defined in _STOP_WORDS (e.g., "the", "and", "is").
_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}
_TOKEN_RE = re.compile(r"[a-z0-9]+")
def _tokenize(self, text: str) -> List[str]:
"""Split text into lowercase tokens, removing stop words."""
tokens = _TOKEN_RE.findall(text.lower())
return [t for t in tokens if t not in _STOP_WORDS]
Relevance Ranking
Search results are not returned in arbitrary order. The _rank_results method calculates a simple relevance score based on the frequency of query tokens within the bookmark's title and description.
@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
"""Rank results by number of token occurrences in title + description."""
def score(b: Bookmark) -> int:
text = f"{b.title} {b.description}".lower()
return sum(text.count(t) for t in tokens)
return sorted(bookmarks, key=score, reverse=True)
Lifecycle and Synchronization
The BookmarkService in app/services/bookmark_service.py orchestrates the search index's lifecycle, ensuring it remains synchronized with the underlying repository.
Initialization
When the BookmarkService is first initialized (typically at application startup), it bootstraps the SearchIndex. The index performs a full rebuild by fetching up to 10,000 bookmarks from the BookmarkRepository.
Incremental Updates
To maintain accuracy without full rebuilds, the index is updated incrementally during write operations. Both create_bookmark and update_bookmark trigger an indexing event:
# app/services/bookmark_service.py
def create_bookmark(self, data: Dict[str, Any]) -> Tuple[Optional[Bookmark], Optional[str]]:
# ... validation and save ...
self._repo.save_bookmark(bookmark)
self._search.index_bookmark(bookmark)
return bookmark, None
The index_bookmark method handles updates by first removing any existing entries for the bookmark ID before re-adding the new tokens, preventing duplicate or stale entries.
Design Tradeoffs and Constraints
The implementation makes several specific design choices that impact its behavior and scalability:
- In-Memory Storage: The index is entirely volatile. While this provides extremely fast lookups, it requires a full rebuild from the repository on every application restart and is limited by the available system memory.
- Strict AND Logic: The search implementation uses set intersection (
&=) for multi-token queries. This means a bookmark must contain all tokens in the query to be returned. If a user searches for "python tutorial", only bookmarks containing both words will match. - Soft-Delete Persistence: In the current implementation of
BookmarkService.delete_bookmark, bookmarks are "trashed" (a status change) but not explicitly removed from theSearchIndex. Because the index rebuild process inSearchIndex._rebuildfetches all bookmarks regardless of status, trashed and archived bookmarks will appear in search results if they match the query. - Tokenization Limits: The regex-based tokenizer is optimized for English alphanumeric text. It does not support stemming (e.g., matching "running" when searching for "run") or fuzzy matching (handling typos).