sdmdl — Species Distribution Modelling with Deep Learning ========================================================= **sdmdl** is an object-oriented Python package for species distribution modelling (SDM) using deep neural networks (DNNs). It provides a high-level interface for modelling species' environmental preferences across many abiotic and biotic variables, training binary classification DNNs with dropout regularisation, and generating global distribution predictions. The package was built to maximise ease of use while still offering in-depth parameter control. The main entry point is a single ``sdmdl`` class with four methods that cover the complete workflow: .. code:: python from sdmdl.sdmdl_main import sdmdl model = sdmdl('/path/to/repository_root') model.prep() # data preparation model.train() # model training model.predict() # distribution prediction model.clean() # remove temporary files Further customisation is available through a ``config.yml`` file that controls model hyper-parameters (see `Configuration`_). Installation --------------------------------------------------------- Prerequisites ^^^^^^^^^^^^^ `GDAL `_ must be installed as a system dependency, including the GDAL Python bindings. On Ubuntu/Debian: .. code:: bash sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable sudo apt-get update sudo apt-get install gdal-bin libgdal-dev pip install GDAL==$(gdal-config --version) Install sdmdl ^^^^^^^^^^^^^ Clone the repository and install: .. code:: bash git clone https://github.com/naturalis/sdmdl.git cd sdmdl pip install . The Python dependencies listed in ``requirements.txt`` will be installed automatically. Input requirements --------------------------------------------------------- To create an ``sdmdl`` object and subsequently train models, the following inputs are required: 1. **Environmental raster layers** (``.tif``) placed in the appropriate directories: - ``data/gis/layers/scaled/`` — layers that need to be standardised during preprocessing. - ``data/gis/layers/non-scaled/`` — layers that are already normalised or categorical (e.g. 0 = absent, 1 = present). Example datasets are available on Zenodo: `environmental rasters `_. .. note:: All environmental layers **must** share the same affine transformation and resolution. This includes the bundled ``empty_land_map.tif`` in ``data/gis/layers/``. If you supply your own rasters, ensure they match the affine transformation and resolution of ``empty_land_map.tif`` (or vice versa). 2. **Occurrence tables** (``.csv``, ``.xls``, or ``.xlsx``) placed in ``data/occurrences/``. Each table must contain two required columns: - ``decimalLatitude`` (or ``decimallatitude``) — latitude for each occurrence. - ``decimalLongitude`` (or ``decimallongitude``) — longitude for each occurrence. Coordinates must be in the WGS 84 coordinate system. Example datasets are available on Zenodo: `occurrence datasets `_. .. warning:: Occurrence coordinates are **not** validated before data preparation. Incorrect data types (non-numerical values) or coordinates outside the spatial extent of the raster files will cause errors. Directory layout summary: - Scaled ``.tif`` layers → ``data/gis/layers/scaled/`` - Non-scaled ``.tif`` layers → ``data/gis/layers/non-scaled/`` - Occurrence tables → ``data/occurrences/`` Configuration --------------------------------------------------------- A ``config.yml`` file is generated automatically on first use in the ``data/`` directory. It stores detected raster files, detected occurrence files, and model parameters: - **random_seed** (*int*): seed for reproducibility (default: 42). - **pseudo_freq** (*int*): number of pseudo-absence samples (default: 2000). - **batchsize** (*int*): training batch size (default: 75). - **epoch** (*int*): number of training epochs (default: 150). - **model_layers** (*list of int*): nodes per hidden layer; adding items deepens the network (default: ``[250, 200, 150, 100]``). - **model_dropout** (*list of float*): dropout rate per hidden layer; 0 = no dropout, 1 = full dropout (default: ``[0.3, 0.5, 0.3, 0.5]``). - **verbose** (*bool*): if ``True``, display progress bars (default: ``True``). Data paths and detected files can also be customised in ``config.yml``. .. note:: Changes to ``config.yml`` are **not** picked up automatically. A new ``sdmdl`` object must be created for changes to take effect. Usage example --------------------------------------------------------- **Step 1:** Create an ``sdmdl`` object: .. code:: python from sdmdl.sdmdl_main import sdmdl model = sdmdl('/path/to/repository_root') **Step 2:** Prepare data (presence maps, raster stack, pseudo-absences, training and prediction datasets): .. code:: python model.prep() **Step 3:** Train deep neural network models for each species: .. code:: python model.train() **Step 4:** Predict global species distributions: .. code:: python model.predict() **Step 5:** Remove temporary intermediate files: .. code:: python model.clean() Outputs --------------------------------------------------------- Data preparation (``prep``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Several temporary intermediate files are created and used as inputs for training and prediction. Training (``train``) ^^^^^^^^^^^^^^^^^^^^ - **Performance metrics** — ``results/_DNN_performance/DNN_eval.txt`` contains per-species accuracy, loss, AUC, true positive rate, and 95 % confidence intervals. - **Model files** — for each species, a ``.h5`` (weights) and ``.json`` (architecture) file is saved under ``results//``. - **Feature importance** — a SHAP-based feature impact plot (``.png``) per species, saved under ``results//``. Prediction (``predict``) ^^^^^^^^^^^^^^^^^^^^^^^^ - **Prediction map** — a colour-mapped ``.png`` visualisation of the predicted distribution per species, saved under ``results//``. - **Prediction raster** — a GeoTIFF (``.tif``) with the predicted probability of presence (0–1) per species, saved under ``results//``. Package architecture --------------------------------------------------------- The package is organised into the following modules: - ``sdmdl.sdmdl_main`` — the main ``sdmdl`` class orchestrating the full workflow. - ``sdmdl.sdmdl.config`` — ``Config`` class for managing ``config.yml``. - ``sdmdl.sdmdl.occurrences`` — ``Occurrences`` class for managing occurrence data. - ``sdmdl.sdmdl.gis`` — ``GIS`` class for managing raster layer paths and output locations. - ``sdmdl.sdmdl.trainer`` — ``Trainer`` class for DNN training and evaluation. - ``sdmdl.sdmdl.predictor`` — ``Predictor`` class for generating distribution predictions. - ``sdmdl.sdmdl.data_prep`` — data preparation sub-package: - ``presence_map`` — creates per-species presence rasters. - ``raster_stack`` — stacks environmental layers into a single GeoTIFF. - ``presence_pseudo_absence`` — samples pseudo-absence locations. - ``band_statistics`` — computes band-wise mean and standard deviation. - ``training_data`` — prepares per-species training datasets. - ``prediction_data`` — prepares the global prediction dataset. References --------------------------------------------------------- This package implements the deep-learning SDM approach described in: Rademaker, M., Hogeweg, L., & Vos, R. (2019). *Modelling the niches of wild and domesticated Ungulate species using deep learning.* bioRxiv, 744441. `doi:10.1101/744441 `_ Related repositories: - `Comparative analysis of abiotic niches in Ungulates `_ by E. Hendrix. - `Ecological Niche Modelling Using Deep Learning `_ by M. Rademaker. License --------------------------------------------------------- This project is licensed under the `MIT License `_. Copyright © 2019 Naturalis Biodiversity Center.