COBECORE

Jungle Weather CitSci first batch finished

2020-05-26T00:00:00+00:00

Only a good month after the launch of the Junge Weather project the first ~380 000 transcriptions have been made. This unusual speed record was in part due to the fact that so many were at home due to COVID lockdowns. Regardless this is an amazing accomplishment.

For now I’ll go back to the drawing board to see if these transcriptions can be leveraged to further automate the task using machine learning. New batches of data will be uploaded onces we can scope the efficiency of this problem better.

My thanks go out to all the ~2300 volunteers who have contributed to the project!!

Jungle Weather CitSci launch

2020-04-26T00:00:00+00:00

I’m happy to announce that our Jungle Weather citizen science project to transcribe old recovered weather records has launched. You can find a full description of the project on our website here:

http://cobecore.org/jungleweather/

and you can contribute to the project here:

https://www.zooniverse.org/projects/khufkens/jungle-weather

The project is hosted on the Zooniverse platform and I thank the project staff and volunteers to make this project happen. As I write, two days after the soft launch, the project has 346 active volunteers which contributed over 60K of classifications! Go team go.

Historical Land-Use Land-Cover paper published

2020-02-09T00:00:00+00:00

In a previous post I described how I was using Deep Learning to map forest cover based upon old aerial photographs in the Yangambi region, from which I source most of my wood (core) material.

This work was finished at the end of December and submitted to MDPI Remote Sensing, and went through review rather smoothly. The paper is currently accepted and available for free online (open access). I’ll give a short visual summary below, for those interested in the details I refer to the link provided.

Historical Aerial Surveys Map Long-Term Changes of Forest Cover and Structure in the Central Congo Basin

Abstract

Given the impact of tropical forest disturbances on atmospheric carbon emissions, biodiversity, and ecosystem productivity, accurate long-term reporting of Land-Use and Land-Cover (LULC) change in the pre-satellite era (<1972) is an imperative. Here, we used a combination of historical (1958) aerial photography and contemporary remote sensing data to map long-term changes in the extent and structure of the tropical forest surrounding Yangambi (DR Congo) in the central Congo Basin. Our study leveraged structure-from-motion and a convolutional neural network-based LULC classifier, using synthetic landscape-based image augmentation to map historical forest cover across a large orthomosaic (~93,431 ha) geo-referenced to ~4.7 ± 4.3 m at submeter resolution. A comparison with contemporary LULC data showed a shift from previously highly regular industrial deforestation of large areas to discrete smallholder farming clearing, increasing landscape fragmentation and providing opportunties for substantial forest regrowth. We estimated aboveground carbon gains through reforestation to range from 811 to 1592 Gg C, partially offsetting historical deforestation (2416 Gg C), in our study area. Efforts to quantify long-term canopy texture changes and their link to aboveground carbon had limited to no success. Our analysis provides methods and insights into key spatial and temporal patterns of deforestation and reforestation at a multi-decadal scale, providing a historical context for past and ongoing forest research in the area.

Visual summary

Our paper discusses how historical aerial photography can be turned into a useful land-use and land-cover map in the central Congo Basin using structure-from-motion and Deep Learning based image segmentation.

These historical aerial photographs were made within the context of mapping back in 1958. In my COBECORE project we digitized a large number of these analogue records in order to valorize these data within the context of ecological research.

These data were first stitched together into an orthomosaic (~93K ha), adjusting view angle effects, and georeferenced to local ground control points (buildings and other fixed structures). From this homogeneous forest / non-forest areas were selected for further processing.

Homogeneous areas were then combined into synthetic landscapes, mixing forest and non-forest classes. Landscape patterns were generated using a Gaussian random field mask, with random “sharpness” in the transition of classes (in addition to other augmentation techniques).

The synthetic landscapes were used to train a Deep Learning U-Net image segmentation routine, and resulted in a forest / non-forest land cover map for the whole historical orthomosaic. This data we compared to the current state of forest cover using Global Forest Cover data.

Corroborating our results we ran the segmentation routine on a contemporary panchromatic high resolution Geo-Eye image. The accuracy for the original map exceeded 95%, for the image 60 years later we still found an agreement of 87% (compared to Global Frorest Cover data)!

From these data we calculated changes in the state of the forest in terms of Above Ground Carbon (AGC) and landscape metrics (i.e. how the complexity of the landscape changed over time). We can conclude that for our study area a lot of previously cleared forest (hence AGC) has been reclaimed, and is forest again, offsetting some of the recent losses. Formal homogeneous colonial land clearing made way to a more fragmented ad-hoc deforestation.

Our analysis provides insights into the rate at which deforestation and reforestation has taken place over a multi-decadal scale in the central Congo basin. And, as such, it provides a useful historical context while interpreting past and ongoing forest research in the area.

Data & Code availability

Hufkens et al. (2019): A curated dataset of aerial survey images over the central Congo Basin, 1958. Zenodo: https://doi.org/10.5281/zenodo.3547767. All data not included in the latter repository can be found bundled with the analysis code as listed below. Proprietary datasets (i.e., Geo-Eye data) are not shared, but purchase order numbers allow for acquisition of these datasets to ensure reproducibility. All analysis code is available as R/python projects (https://khufkens.github.io/orthodrc and https://khufkens.github.io/orthodrc_cnn/).

Deep learning forest cover

2019-11-05T00:00:00+00:00

Within the COBECORE project I promised to digitize and map past forest cover. This task has been finished, but a thorough analysis of this data would add a lot of value in terms of the disturbance history of the forests around Yangambi.

However, the area covers roughly a 35x40km outline at ~1m resolution is hard to manually segment this forest cover and disturbed areas (plantations, build areas, agriculture). In a previous version of the orthomosaic I’ve done this excercise. However it can be argued that this analysis is biased, if not due to the size of the brush used to infill the classes.

In order to side-step this issue, and increase reproducibility, I decided to tackle it using deep learning. I choose to go down this path in part because the problem at hand resembles standard supervised texture segmentation, a common computer vision issue, which has become trivial with deep learning.

My approach went through a number of iterations, not in the least because I wanted to limit my workload. This workload, in particular generating a deep learning training set, although not as large as a full segmentation of the map, can still be signficant. This is due do to the fact that deep learning is a supervised methodology which requires tons of training data to learn what forested or disturbed areas look like.

In my first approach I outlined a number of areas which were homogenous with respect to a particular class (forested, disturbed / non-forest). Within this region I randomly sampled small images (513x513 pixels). By default I knew what was covered and I could generate ground truth labels for training a deep learning network with only two masks.

Large scale representation issues

When using this initial dataset accuracy stalled at roughly 60% Intersect over Union (IoU, a metric of segmentation accuracy). When inspecting the data it was obvious that for homogenous tiles the deep learning network performed well. However, it performed poorly for those instances when two classes would mix in a given “scene”. The model made an implicit assumption based upon the size and overall content of the small images in training, rather than evaluating every pixel (and surrounding texture) in an image.

This effect can be seen in the below image where the edges between forest and disturbed (lighter) patches are surrounded by a large area in which there is uncertainty (purple tints). This uncertainty also takes on the shape of the square (moving) window used during evaluation of the scene.

Acknowledging this issue I decided to add image augmentation on top of the binary labelled training data. I created mash ups of the binary labelled datasets, combining forested and disturbed images using a random gaussian field mask (see below). These artificial scenes provide the algorithm transition states between the forest and disturbed patches without having to manually segment those, a time intensive task.

Synthetic landscapes to sidestep manual segmentation

Using these artificial scenes the algorithm IoU metrics jumped to 92%, very good for a segmentation task. Evaluating the same scene as above shows this improvement with mapping finer detail in the disturbances (pink) and fewer broad areas of uncertainty (purple). In short, transitions between forested and disturbed areas are better detected, resulting in sharper edges.

When evaluating the whole map (below) performance is indeed in line with the validation results. The majority of the surface area is correctly classified. However, exceptions to this rule exist. In particular stitch lines between different images used to create the orthomosaic are incorrectly labelled as a disturbance. Arguably, this indeed represents a sort of disturbance, but not one related to the true structure of the forest. More data augmentation is needed to address this issue. In particular I combined two forest tiles to combine different texture of forest, an issue which due to acquisition date differences causes issues along stitch lines.

After this final correction, classification accuracy increases to 96% IoU accuracy, while early stopping the training process (some gains are still possible by increasing the training duration still). I expect that the final classification accuracy should reach ~98% IoU, an extremly high value.

Digitization of the Bulletin Agricole du Congo Belge

2019-03-08T00:00:00+00:00

From October until March, I was an intern at the AfricaMuseum on the COBECORE project. My task here was to help with the digitization of the Bulletin Agricole du Congo Belge. This supports the underlying philosophy of the project, which is to increase the accessibility of scientific information for both scientists and the general public. We have now finished scanning all articles published in the Bulletin Agricole and the Bulletin d’Information de l’INEAC. A beta version of the website is accessible via this link: http://ineac.africamuseum.be/. Furthermore, the digitized articles are also accessible via a server placed in Kinshasa (http://ineac.rdcmirrorsmrac.org/), meeting a local request for documentation.

*Fig 1 - Congolese farmers cultivate cotton at the paysannat of Bambesa in Uele. (Bulletin Agricole, vol. XLII, 1952, 93.).*

The agricultural research, described in the Bulletin Agricole, was carried out between 1910 and 1960, with the last one being published in 1961, shortly after Congolese independence. The most prolific period was undoubtedly during the existence of the INEAC (Institut National pour l’Etude Agronomique du Congo Belge); this institution played an important role in rural Congo during colonial times. The research and achievements of INEAC on agricultural and economic level are known around the world. Between 1945 and 1960, INEAC was the biggest scientific research institute in Africa. Even though their research during the colonial period was mainly aimed at maximizing the profits for Belgium, INEACs scientific insights are still relevant today in the agronomic, forestry and ecological context of the Democratic Republic of the Congo. The Bulletin Agricole includes more than 3000 scientific articles which contain many insights that are still very relevant for contemporary tropical agronomic research.

*Fig 2 - Map of the paysannat of Luberizi. (Bulletin Agricole, vol. XLII, 1952, 243.).*

Besides that, the Bulletin Agricole also has an important historical value. During my internship at COBECORE I learned how colonial resources still have contemporary value. Historical data are crucial for contemporary research. For example: historical climate data in models about climate change. Moreover, the archival material from the Bulletin Agricole is also important to uncover our colonial past. This brings me to my own research. In the context of my master’s thesis in social history at the University of Leuven, I’m uncovering the duality of the Belgian agricultural politics in Congo. Specifically, I focus on the paysannat-system of INEAC. This system was a large-scale form of social engineering in in which Congolese farmers were transported to organized agricultural allotments where they had to cultivate a combination of local- and cash crops for the colonial administration, following specific methods that supported their self-sufficiency. I chose this subject in collaboration with my promoter professor Yves Segers, the director of the Interfaculty Centre of Agrarian History (ICAG). Consequently, it was very interesting for me to link my own research with my internship. My internship as a historian at COBECORE was an educative experience and it was nice to be a part of this multidisciplinary project. Both my internship and my thesis illustrate the importance of exploring and digitizing unexplored historical resources, like the articles in the Bulletin Agricole.

– Febe Boulanger

Automated data coverage statistics

2019-03-06T00:00:00+00:00

Last week I finished the pre-processing code for aligning and screening the COBECORE digitized records. Friday I ran the alignment and classificatoin routine on “format 1” one of the more common data sheet formats in the dataset, which covers the 1950s. Today I processed some of the meta-data produced during the process.

During alignment of the 10K+ scans cell values are not only screened for the presence of data, this meta-data is also retained to get an idea of the transcription data load and transcription tasks in the upcoming citizen science project. Here I provide some of these statistics for those sites which could be matched to a proper site ID as noted in the Belgian State Archive index.

After processing the 10 720 scans some 5.3 million cells were evaluated for data coverage (%), as reported in the bar graph below. Only considering data on a monthly level (counting months with 10 days or more of values as complete), high coverage was noted for minimum and maximum temperature (and derived values) with coverage in the low 90%. Values which serve to calculate relative humidity (wet and dry bulb temperature), as well as rainfall data covered roughly >60% of all months analyzed. While evaporation data and soil conditions report incidences lower than 60%. Notes and data on thunderstorms (noise), as well as meta-data on rainfall data such as the duration and intensity, were reported in 20% or less of the months.

This summary provides some insight into what data is available and what can be expected after data recovery in terms of data coverage. As expected, simple measurements of temperature and rainfall are well covered. In format 1 it seems that values of relative humidity will also be fairly common.

Finding empty cells in tables

2019-02-19T00:00:00+00:00

A large (empty) workload

A previous blog post outlined how the +70K scans present an inssue when it comes to processing and extracting data. Due to template matching a large part of these issues have been automated away. Yet, even when the data can be extracted one hurdles remains, empty cells in table.

Our digitized tables are sparse. This means that the bulk of the data in the tables consists of empty table cells, while the remaining part is true valuable data. Since transcription will rely on a pair of human eyes evaluating every single cell of data it is obviously a waste of time to review what are empty cells. A solution has to be found to quickly and accurately screen these empty cells, and remove them from the final data set to limit the workload (not wasting time of volunteers).

Tensorflow transfer learning

During a previous project transfer learning, based on the Tensorflow framework, provided a fast solution for a simple classification task. Transfer learning is a way of rapidly training a complex image processing model by leveraging the efforts of previous researchers. Previous research groups have created models which are tuned using a large selection (many millions) of labelled images. This model therefore includes a fairly good representation of what you might encounter, and want to label, in the real world. In transfer learning we use this existing model and tune it further to a specific use case. This is often far faster than creating your own model, which also requires vast amounts of data.

We use the transfer learning approach to classify cells of the digitized tables as either empty or complete (Fig 1.). To do this the cells of 3 tables with varying handwriting or typed numbers were extracted using the template matching approach, as previously described. This left us with training data of ~1400 cell values (split evenly among the two types). This data was used to retrain the model for our use case.

*Fig. 1 - An empty and complete table cell value. © State Archive (COBECORE)*

After retraining the model had an accuracy of ~98%. For the task at hand this is sufficient, as additional screening based upon column wide statistics will be made. A visualization of the classification results of one particular table are given below (Fig 2.). The template matching visualization is used, where light blue pixels represent those of the template, red/pink pixels represent those of the matched table, blue pixels show agreement between the template and the matched table and, finally, white crosses indicate empty cells as predicted by our Tensorflow model.

In the below table we see only few misclassified cells. In particular we find one false positive, claiming to be empty when it is not, and six false negatives, where empty cells are not flagged. With over 400 values in the table and an accuracy of ~98% having an error rate of seven values is roughtly what you might expect.

*Fig. 2 - Cells classified as empty marked with a white X. © State Archive (COBECORE)*

Template matching data tables

2019-02-11T00:00:00+00:00

Big tables, big problems

With all images scanned and sorted the next step involves the transcription of the images into meaningful, machine readable, data. Due to the complexity of the data, such as various handwriting styles in faded or runny ink, automating this process is very difficult. We will therefore aim to crowdsource the transcription of the data. Yet, large tables are difficult to transcribe as the location within a table is of importance, and not only the values. As such, mistakes are more easily made when transcribing tables as a whole.

Template matching

To resolve the issue of size we will cut the table into its individual cells using a technique called template matching. In this application the technique uses an empty table, with known properties such as cell locations, as a template. The tables containing data are then matched to this template to allow for the extraction of the data. Such an approach is commonly used to automatically process various forms, for example standardized tests. In general these computer science problems are broadly covered by the field of pattern recognition.

In our case we will use the Oriented FAST and Rotated BRIEF (ORB) feature matching algorithm to calculate correspondence between the empty template and an image of a table containing data (Fig 1.). These features make it possible to calculate the homography, or projective geometry of one image to the next.

*Fig. 1 - ORB matches between an image containing data and the template. © State Archive (COBECORE)*

If good correspondence is established we can transform the image containing data to align (perfectly) with the template. Below (Fig 2.) and example is given where the transformed data is shown in red/pink tints, the template is shown as light blue, the image is dark blue where there is correspondence between both images. Note how light blue, template, header texts transition into an almost perfect correspondence with the image which contains data.

*Fig. 2 - Colours correspond to the different images. Red tints, actual data sheet, light blue for the template. Correspondence is marked by dark blue pixels. © State Archive (COBECORE)*

This correspondence allows us to measure the location of the individual cells in the template table once, and transfer these measurements to the aligned table containing data. With this match it is now possible to extract individual cells in the table for easier processing (using crowdsourcing or otherwise) with limited effort, scaling an otherwise tedious manual if not impossible task.

Sorting the digitized data

2019-02-07T00:00:00+00:00

COBECORE aims to transcribe historical climate data. Sadly the volume as well as the state of the documents limits automation of this process. As such, we aim to enlist the help of citizen scientists to contribute a bit of time to transcribe this data. However, even these efforts require the original scans to be sorted and pre-processed. Doing so revealed some interesting statistics.

Most of the data is stored two formats, a larger and a smaller table (see Fig 1.). These datasets are good for 14.8K and 24.7K images respectivelly. The next steps will include dividing these sheets in particular into their individual cells. Afterwards these cells will be (automatically) scanned for the content of data. Those containing data will be shown to citizen science for transcription.

*Fig. 1 - A table in the most common data format.*

The FOTO image texture R package is online

2019-01-18T00:00:00+00:00

Within the COBECORE project a goal was to quantify the complexity of the canopy in old archival images using the FOTO (Fourier Transform Textural Ordination) method.

FOTO uses a principal component analysis (PCA) on radially averaged 2D Fourier spectra to characterize (grayscale) image texture and was first described by Couteron et al. 2005 to quantify canopy stucture in relation to biomass and biodiversity.

To formalize this routine the approach was converted into an R package and peer-reviewed for inclusion into the Comprehensive R Archive Network (or CRAN). As such the “foto” package can be easily installed in your R working environment using:

install.packages("foto")
library("foto")

The complete documentation and source code of the project can be found on the github page: https://github.com/khufkens/foto.

An example analysis is run below. In the resulting image pixels with a similar colour have a similar texture. The analysis is run on a historical image of plantations near Yangambi, DR Congo, as recovered in the COBECORE project. The regular pattern of planted trees is picked up readily by the algorithm.

# load the library
library(foto)

# load demo data
r <- raster::raster(system.file("extdata", "yangambi.png",
                          package = "foto",
                          mustWork = TRUE))

# classify pixels using zones (discrete steps)
output <- foto(r,
     plot = TRUE,
     window_size = 25,
     method = "zones")