Handwritten Word Detector
Introduction
This article serves as documentation for the
WordDetectorNN implementation.
It is a neural network based word detector inspired by the ideas of
Zhou
and
Axler.
The model classifies each pixel as word (inner part or surrounding) or background pixel.
For each pixel of the
inner word class, an axis aligned bounding box (AABB) around the word is predicted.
Because usually multiple AABBs per word are predicted, a clustering algorithm is applied to them.
The model is trained on the IAM dataset, a sample result for the CVL dataset is shown in Fig. 1.
Fig. 1: Detected words for a sample from the CVL dataset.
Model
The AABBs are encoded by the following output maps of the model:
- 3 segmentation maps with one-hot encoding:
- Word (inner part)
- Word (surrounding)
- Background
- 4 geometry maps encode distances between the current pixel and the AABB edges:
Only for pixels of the
inner word class the bounding box geometry is learned.
A
surrounding class is added to avoid mapping both the background and the surrounding of a word to the
background class.
Fig. 2 shows an encoded AABB.
Fig. 2: An encoded AABB with segmentation maps (red: inner part of word, green: surrounding of word, blue: background) and geometry maps (distance to top, bottom, left, right of AABB edges).
ResNet18 is used as a feature extractor.
The model follows the typical U-shape architecture known from segmentation tasks.
An image size of 448×448 is used while training.
After the final layer of ResNet18, the feature maps have a size of 14×14.
The following layers upscale the maps step-by-step.
Further, they merge intermediate maps from ResNet18.
The output of the neural network has a size of 224×224, that is half the input width and height.
Loss function
The total loss is the sum of:
- Segmentation loss: segmentation is regarded as a pixelwise classification problem, therefore cross entropy loss is used
- Geometry loss: using sum-of-squared errors on the geometry would put more weight on larger bounding boxes for which larger errors can be tolerated, therefore intersection over union (IOU) is used instead
Bounding box clustering
Usually, many AABBs are predicted for the same word, each slightly different, see Fig. 3.
The Jaccard distance JD between two AABBs is JD=1-IOU.
A distance matrix containing the (Jaccard) distances between all AABB pairs is computed.
Using this distance matrix, the clustering algorithm DBSCAN computes AABB clusters.
The resulting AABB is computed from the cluster members by taking the median edge positions.
Fig. 3: Multiple AABBs for the same word before clustering.
Computing the distance matrix is a performance bottleneck in the detection pipeline.
Two different strategies to speed up this step are implemented:
- Only take a subset of AABBs
- Compute connected components of inner word segmentation map, and only take a small number of AABBs per component
Summary
The model computes AABBs enclosing the detected handwritten words.
It encodes the AABBs with segmentation and geometry maps.
A clustering step is applied to the predicted bounding boxes.
Harald Scheidl, 2021