论文标题
关于密集检索的行为代表的价值
On the Value of Behavioral Representations for Dense Retrieval
论文作者
论文摘要
我们考虑在现实世界中的密集代表空间中进行文本检索,例如电子商务搜索,其中(a)文档流行和(b)与文档相关的查询多样性的分布偏差。大多数当代著名的检索文献都在这些环境中给出了两个缺点。 (1)他们每个文件都学会了几乎相等的表示形式,这是因为一些头部文档对实现良好的检索性能而言更为重要。 (ii)他们从固有的文档特征中得出纯粹的语义文档表示形式,这些文档的特征可能不含足够的信息来确定文档相关的查询 - 特别是在文档简短的情况下。我们建议通过增强双线编码器所学到的语义文档表示形式来克服这些局限性,并通过我们所提出的方法MVG学到的行为文档表示。为此,MVG(1)确定如何通过与Pitman-Yor流程建立联系来分配行为表示的总预算,并且(2)简单地将与给定文档(基于用户行为)相关的查询(基于用户行为)在由基本的Bi-编码器中学到的代表空间中,并将集群中心作为其行为表示。我们的核心贡献是找到如此简单的直观轻量重量方法,从而通过仅产生边缘记忆开销,从而在关键的第一阶段检索指标中取得了可观的增长。我们通过在三个大型公共数据集上进行广泛的实验来建立这一点,这些数据集比较了几个单矢量和多矢量双重编码器,与生产质量的双重编码器相比,专有的电子商务搜索数据集以及一个A/B测试。
We consider text retrieval within dense representational space in real-world settings such as e-commerce search where (a) document popularity and (b) diversity of queries associated with a document have a skewed distribution. Most of the contemporary dense retrieval literature presents two shortcomings in these settings. (1) They learn an almost equal number of representations per document, agnostic to the fact that a few head documents are disproportionately more critical to achieving a good retrieval performance. (ii) They learn purely semantic document representations inferred from intrinsic document characteristics which may not contain adequate information to determine the queries for which the document is relevant--especially when the document is short. We propose to overcome these limitations by augmenting semantic document representations learned by bi-encoders with behavioral document representations learned by our proposed approach MVG. To do so, MVG (1) determines how to divide the total budget for behavioral representations by drawing a connection to the Pitman-Yor process, and (2) simply clusters the queries related to a given document (based on user behavior) within the representational space learned by a base bi-encoder, and treats the cluster centers as its behavioral representations. Our central contribution is the finding such a simple intuitive light-weight approach leads to substantial gains in key first-stage retrieval metrics by incurring only a marginal memory overhead. We establish this via extensive experiments over three large public datasets comparing several single-vector and multi-vector bi-encoders, a proprietary e-commerce search dataset compared to production-quality bi-encoder, and an A/B test.
