GSoC — Final Report

The most productive summer ever.

Kush Kothari

7 min readAug 19, 2021

Main Project Contribution:

Commits · weecology/retrieverdash

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

A complete list of all contributions is present at the end.

Phase 1: Getting used to the Retriever ecosystem

The first time I heard about Data Retriever, was through the numFOCUS GSoC archive. I immediately joined the Gitter channel and started surfing through the repositories to understand the codebase. A few points are important over here.

Script: Data Retriever uses JSON and Python files to store details about the details of the way in which a particular dataset is supposed to be installed.
Retriever-recipes: The scripts are hosted on a GitHub repo called retriever-recipes. A complete list of all datasets currently supported is found here. A fellow GSoC participant, Aakash Chaudhary is working on adding API support to retriever, so the above statement may not be completely true at the time of reading this.
rdataretriever: Data retriever also has an R interface for installing the datasets.
retrieverdash: This is the project that I worked on for the summer. Now, obviously, retriever supports a large number of datasets that can be downloaded and installed. Very often the JSON/Python scripts that contain the instructions to download the files fail, because for a large variety of reasons. This may be because of a change in data format, change in the URL of the dataset, issues with the dataset encoder, changes in the size of the various fields, or sometimes the server just stops responding. This is where the retrieverdash project comes in. This project is supposed to run on a large server with a good internet connection and a large place to store the datasets. retrieverdash acts like a testing pipeline for all the scripts present in the retriever project. It downloads the raw data, installs the data into databases (SQLite for tabular datasets, and Postgres for spatial databases), and if the data has changed, it also produces diffs that inform the tester exactly what data has changed.

Phase 2: Understanding retrieverdash in detail

All the points that were said in the above bullet point apply here. As for the nature of the retrieverdash project, it is mainly a Django project.

Why a Django project you ask? Retrieverdash is an abstraction over the inner workings of retriever so that the user can just connect to the server and see the dashboard that is being served by the Django Project.

The structure of the repo looks like this:

.
├── docs
│   ├── _build
│   ├── conf.py
│   ├── developer.rst
│   ├── index.rst
│   ├── introduction.rst
│   ├── license.rst
│   ├── make.bat
│   ├── Makefile
│   └── script.rst
├── environment.yml
├── LICENSE
├── README.md
├── retrieverdash
│   ├── apps
│   ├── configs
│   ├── core
│   ├── dashboard_script
│   ├── __init__.py
│   ├── manage.py
│   ├── __pycache__
│   ├── README.rst
│   ├── requirements.txt
│   ├── retrieverdash
│   ├── run
│   ├── static
│   ├── templates
│   ├── tests
│   └── tox.ini
└── setup.py

The dashboard_script app holds the script that is responsible for running everything. This is run every Sunday by a cronjob library called Djangoc-crontab.

Read more about cronjobs and the library here:

GitHub - kraiz/django-crontab: dead simple crontab powered job scheduling for django.

dead simple crontab powered job scheduling for django (1.8-2.0). install via pip: pip install django-crontab add it to…

github.com

cron - Wikipedia

The software utility also known as cron job is a time-based job scheduler in Unix-like computer operating systems…

en.wikipedia.org

A Beginners Guide To Cron Jobs - OSTechNix

Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule…

ostechnix.com

Phase 3: Working with retrieverdash

In order, to view the current dashboard that is running on the weecology server we will have to use the concept of port-forwarding. Use the following command

ssh -L 8000:localhost:8000 <user>@<server_domain>

If you’re a developer and you wish to get the credentials to the server that is currently being used, pls get in touch with the data-retriever team on Gitter:

weecology/retriever

Quickly download, clean up, and install public datasets into a database management system

gitter.im

On the server, you might also want to use tmux to keep the server running in the background. Usually, the tmux server is running with an instance name 0, just use:

retrieverdash@<servername>:~$ tmux ls
<instance name>: 1 windows (created Tue Aug  3 12:54:48 2021) [158x40]
retrieverdash@<servername>:~$ tmux attach -t <instance name>

Home · tmux/tmux Wiki

tmux is a terminal multiplexer. It lets you switch easily between several programs in one terminal, detach them (they…

github.com

A Quick and Easy Guide to tmux

I love working with the command line. Seriously, I think there's hardly any more productive and more versatile tool for…

www.hamvocke.com

Phase 4: Things I worked on over the summer

Issues:

Add a default bounding box for usgs-elevation · Issue #1595 · weecology/retriever

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

The USGS-elevation dataset was missing a default bbox parameter that defines the bounding box of the dataset.

Retriever doesn't detect new python scripts · Issue #1596 · weecology/retriever

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

For some reason, retriever was not able to find new python scripts. While this issue was verified on testing with the GSoC mentor, this however wasn’t reproducible on another computer so this issue was closed.

coronavirus-belgium: Error while installing · Issue #124 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Worldclim data needs to be updated · Issue #123 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

fia-* datasets are failing due to a missing file · Issue #122 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

The updates I made to retrieverdash found the following 3 issues in a large number of datasets, issues were opened for the correction of the same.

Pull Requests:

Update coronavirus data by kkothari2001 · Pull Request #125 · weecology/retriever-recipes

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes…

github.com

Added csv_extend_size property to the script table of the mtbs-burn-area-boundary dataset by…

tion of the mtbs-burn-area-boundary dataset Add this suggestion to a batch that can be applied as a single commit. This…

github.com

Add Covid19 surveillance data by kkothari2001 · Pull Request #120 · weecology/retriever-recipes

Adds the dataset requested in weecology/retriever#1582 New dataset is named covid-case-surveillance Tested with…

github.com

The above 3 PRs were made by me to solve issues with scripts in the retriever.

And finally the main PR,

Adding support for installing and creating diff of spatial dataset by kkothari2001 · Pull Request…

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes…

github.com

This PR has all the code required to add support for spatial dataset testing in retrieverdash. It can now install spatial datasets ( vector or raster) to a PostgreSQL server and then create diffs for just it like we do for tabular datasets. It has the following work done on it:

New logic to use the postgres engine on retriever to install postgres datasets.
Refined previous logic to solve some bugs and save memory on the server.
New logic to handle all the diffs that have to bre created.
A new view on the Django server, that cleans up the UI and transfers all views of a particular dataset to a separate page, thus making it easier to navigate the page.
Updated documentation for all the new changes that took place to retrieverdash.
Some minor cleaning up of code
Addition of a new required library to requirements.txt

All these changes were regularly tested on the weecology server.

Latest images after all the above changes:

GSoC — Final Report

The most productive summer ever.

Main Project Contribution:

Commits · weecology/retrieverdash

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Phase 1: Getting used to the Retriever ecosystem

Phase 2: Understanding retrieverdash in detail

GitHub - kraiz/django-crontab: dead simple crontab powered job scheduling for django.

dead simple crontab powered job scheduling for django (1.8-2.0). install via pip: pip install django-crontab add it to…

cron - Wikipedia

The software utility also known as cron job is a time-based job scheduler in Unix-like computer operating systems…

A Beginners Guide To Cron Jobs - OSTechNix

Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule…

Phase 3: Working with retrieverdash

weecology/retriever

Quickly download, clean up, and install public datasets into a database management system

Home · tmux/tmux Wiki

tmux is a terminal multiplexer. It lets you switch easily between several programs in one terminal, detach them (they…

A Quick and Easy Guide to tmux

I love working with the command line. Seriously, I think there's hardly any more productive and more versatile tool for…

Phase 4: Things I worked on over the summer

Add a default bounding box for usgs-elevation · Issue #1595 · weecology/retriever

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Retriever doesn't detect new python scripts · Issue #1596 · weecology/retriever

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

coronavirus-belgium: Error while installing · Issue #124 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Worldclim data needs to be updated · Issue #123 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

fia-* datasets are failing due to a missing file · Issue #122 · weecology/retriever-recipes

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Update coronavirus data by kkothari2001 · Pull Request #125 · weecology/retriever-recipes

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes…

Added csv_extend_size property to the script table of the mtbs-burn-area-boundary dataset by…

tion of the mtbs-burn-area-boundary dataset Add this suggestion to a batch that can be applied as a single commit. This…

Add Covid19 surveillance data by kkothari2001 · Pull Request #120 · weecology/retriever-recipes

Adds the dataset requested in weecology/retriever#1582 New dataset is named covid-case-surveillance Tested with…

Adding support for installing and creating diff of spatial dataset by kkothari2001 · Pull Request…

Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes…

Written by Kush Kothari

No responses yet