Data AnalysisInternal OnlyVerified

Pinterest uses a fine-tuned multilingual language model (XLM-RoBERTa-large) to automatically score how relevant each search result (called a Pin) is to a given query, a task previously performed by human annotators. The model classifies each query-Pin pair on a five-level relevance scale and can process 150,000 pairs in 30 minutes on a single GPU. Pinterest uses this internal tool to evaluate search algorithm changes in A/B experiments without waiting for manual labeling cycles.

Details

Pinterest fine-tunes XLM-RoBERTa-large on approximately 2.6 million human-annotated query-Pin pairs, using a cross-encoder architecture that concatenates query text and Pin text features (including Pin titles, image captions generated by BLIP, board titles, and high-engagement query tokens) as input. The model outputs a five-dimensional relevance score, and the label with the highest score is used as the relevance judgment. Validation shows 73.7% exact agreement with human raters and 91.7% agreement within one point on the five-level scale. The tool runs on a single A10G GPU and labels 150,000 query-Pin pairs in 30 minutes, enabling Pinterest to evaluate A/B experiment results for search ranking changes without waiting for human annotation, which the blog describes as high-cost and slow.