GSoC — Final Report

The most productive summer ever.

Kush Kothari
7 min readAug 19, 2021

Main Project Contribution:

A complete list of all contributions is present at the end.

Phase 1: Getting used to the Retriever ecosystem

The first time I heard about Data Retriever, was through the numFOCUS GSoC archive. I immediately joined the Gitter channel and started surfing through the repositories to understand the codebase. A few points are important over here.

  • Script: Data Retriever uses JSON and Python files to store details about the details of the way in which a particular dataset is supposed to be installed.
  • Retriever-recipes: The scripts are hosted on a GitHub repo called retriever-recipes. A complete list of all datasets currently supported is found here. A fellow GSoC participant, Aakash Chaudhary is working on adding API support to retriever, so the above statement may not be completely true at the time of reading this.
  • rdataretriever: Data retriever also has an R interface for installing the datasets.
  • retrieverdash: This is the project that I worked on for the summer. Now, obviously, retriever supports a large number of datasets that can be downloaded and installed. Very often the JSON/Python scripts that contain the instructions to download the files fail, because for a large variety of reasons. This may be because of a change in data format, change in the URL of the dataset, issues with the dataset encoder, changes in the size of the various fields, or sometimes the server just stops responding. This is where the retrieverdash project comes in. This project is supposed to run on a large server with a good internet connection and a large place to store the datasets. retrieverdash acts like a testing pipeline for all the scripts present in the retriever project. It downloads the raw data, installs the data into databases (SQLite for tabular datasets, and Postgres for spatial databases), and if the data has changed, it also produces diffs that inform the tester exactly what data has changed.

Phase 2: Understanding retrieverdash in detail

All the points that were said in the above bullet point apply here. As for the nature of the retrieverdash project, it is mainly a Django project.

Why a Django project you ask? Retrieverdash is an abstraction over the inner workings of retriever so that the user can just connect to the server and see the dashboard that is being served by the Django Project.

An example of the original dashboard.

The structure of the repo looks like this:

.
├── docs
│ ├── _build
│ ├── conf.py
│ ├── developer.rst
│ ├── index.rst
│ ├── introduction.rst
│ ├── license.rst
│ ├── make.bat
│ ├── Makefile
│ └── script.rst
├── environment.yml
├── LICENSE
├── README.md
├── retrieverdash
│ ├── apps
│ ├── configs
│ ├── core
│ ├── dashboard_script
│ ├── __init__.py
│ ├── manage.py
│ ├── __pycache__
│ ├── README.rst
│ ├── requirements.txt
│ ├── retrieverdash
│ ├── run
│ ├── static
│ ├── templates
│ ├── tests
│ └── tox.ini
└── setup.py

The dashboard_script app holds the script that is responsible for running everything. This is run every Sunday by a cronjob library called Djangoc-crontab.

Read more about cronjobs and the library here:

Phase 3: Working with retrieverdash

In order, to view the current dashboard that is running on the weecology server we will have to use the concept of port-forwarding. Use the following command

ssh -L 8000:localhost:8000 <user>@<server_domain>

If you’re a developer and you wish to get the credentials to the server that is currently being used, pls get in touch with the data-retriever team on Gitter:

On the server, you might also want to use tmux to keep the server running in the background. Usually, the tmux server is running with an instance name 0, just use:

retrieverdash@<servername>:~$ tmux ls
<instance name>: 1 windows (created Tue Aug 3 12:54:48 2021) [158x40]
retrieverdash@<servername>:~$ tmux attach -t <instance name>

Read more about tmux here:

Phase 4: Things I worked on over the summer

Issues:

The USGS-elevation dataset was missing a default bbox parameter that defines the bounding box of the dataset.

For some reason, retriever was not able to find new python scripts. While this issue was verified on testing with the GSoC mentor, this however wasn’t reproducible on another computer so this issue was closed.

The updates I made to retrieverdash found the following 3 issues in a large number of datasets, issues were opened for the correction of the same.

Pull Requests:

The above 3 PRs were made by me to solve issues with scripts in the retriever.

And finally the main PR,

This PR has all the code required to add support for spatial dataset testing in retrieverdash. It can now install spatial datasets ( vector or raster) to a PostgreSQL server and then create diffs for just it like we do for tabular datasets. It has the following work done on it:

  • New logic to use the postgres engine on retriever to install postgres datasets.
  • Refined previous logic to solve some bugs and save memory on the server.
  • New logic to handle all the diffs that have to bre created.
  • A new view on the Django server, that cleans up the UI and transfers all views of a particular dataset to a separate page, thus making it easier to navigate the page.
  • Updated documentation for all the new changes that took place to retrieverdash.
  • Some minor cleaning up of code
  • Addition of a new required library to requirements.txt

All these changes were regularly tested on the weecology server.

Latest images after all the above changes:

Main dataset view
View to access diffs if data has changed
View to see all diffs of a particular dataset
Diff view showing changes to the data

--

--