Aligning tables

First pre-processing steps towards transcription...

The processing of the massive amounts of data stored in the State Archives relies on contributions by citizen scientists and computer vision pre-processing of the data. The latter ensures consistency throughout the valorization process, before actual transcription. One of the pre-processing steps includes extracting individual cells from the tables of meteorological data.

However, uneven distribution of the data across various scanned images makes it hard to automatically extract individual numbers from the columns and rows of the observation tables. Below you see an animated image of very similar data sheets. Obviously the data itself changes, but take note of the subtle changes in the alignment of the numbers from row to row and column to column. Although scans will be made using both a flatbed scanner or a reproduction stand pages can still warp or are not always aligned in the same way.

Fig 1. - badly aligned tables do not allow for automatic extraction of the filled in numbers

We resolve this issue by transforming a table with known properties of the cells of the table onto a table with unknown properties using a spline deformation. One can imagine one of the sheets of paper being made out of rubber, now we wiggle this rubber sheet until it perfectly aligns with a reference table. This is what happens during the transformation process from one sheet layout into the next. We borrowed this technique from the medical sciences where it is used to align several medical images (x-ray / CT / MRI scans) taken over time.

The result of this exercise is shown below. Here the lines defining the table stay fixed while only the numbers switch between the two pages. Now the frame of reference is fixed between pages, we can extract the individual numbers more easily as they are always found in the same location. Not only that, a fixed frame of reference also allows us to (in part) remove the underlying layout of the table itself only retaining the hand written numbers. This procedure avoids some of the tedious work of marking the cells of various tables in the Old Weather citizen science project and allows to move ahead to transcription faster.

Fig 2. - aligned tables, here only the content changes from one page to the next (not the location of rows and columns)