论文标题
自我监督的隐性字形对文本识别的关注
Self-supervised Implicit Glyph Attention for Text Recognition
论文作者
论文摘要
注意机制已成为场景文本识别方法(str)方法中的\ emph {de exto}模块,因为它有能力提取字符级表示。可以将这些方法概括为基于隐性注意的基于隐性的注意力和受监督的注意力,取决于如何计算注意力,即分别从序列级文本注释和 /或字符级边界框注释中学到隐性注意和监督注意力。隐含的注意力可能会提取出粗略甚至不正确的空间区域作为性格的注意,这很容易受到对齐拖延问题的困扰。受监督的注意力可以减轻上述问题,但它是特定于角色类别的,它需要额外的繁琐角色级别边界框注释,并且在处理具有较大角色类别的语言时将是记忆密集型的。为了解决上述问题,我们提出了一种新型的关注机制,用于STR,自我保护的隐式字形注意力(SIGA)。 Siga通过共同自我监督的文本分割和隐性注意对准来描述文本图像的字形结构,这些文本分割和隐含的注意对齐是在没有额外的角色级注释的情况下提高注意力正确性的监督。实验结果表明,就注意力正确性和最终识别性能而言,SIGA的性能始终如一,比以前的基于注意力的STR方法的表现始终如一,在公开可用的上下文基准和我们的最终识别表现方面以及我们的无上下文基准。
The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and or character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks.
