- Notifications
You must be signed in to change notification settings - Fork156
A standalone web service that pushes data files from a CKAN site resources into its DataStore
License
ckan/datapusher
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
DataPusher is a standalone web service that automatically downloads any tabulardata files like CSV or Excel from a CKAN site's resources when they are added to theCKAN site, parses them to pull out the actual data, then uses the DataStore APIto push the data into the CKAN site's DataStore.
This makes the data from the resource files available via CKAN's DataStore API.In particular, many of CKAN's data preview and visualization plugins will onlywork (or will work much better) with files whose contents are in the DataStore.
To get it working you have to:
- Deploy a DataPusher instance to a server (or use an existing DataPusherinstance)
- Enable and configure the
datastoreplugin on your CKAN site. - Enable and configure the
datapusherplugin on your CKAN site.
Note that if you installed CKAN using thepackage install option then aDataPusher instance should be automatically installed and configured to workwith your CKAN site.
DataPusher is built usingCKAN Service Provider andMessytables.
The original author of DataPusher wasDominik Moritzdominik.moritz@okfn.org. For the current list of contributorsseegithub.com/ckan/datapusher/contributors
Install the required packages::
sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-devGet the code::
git clone https://github.com/ckan/datapushercd datapusherInstall the dependencies::
pip install -r requirements.txtpip install -r requirements-dev.txtpip install -e .Run the DataPusher::
python datapusher/main.py deployment/datapusher_settings.pyBy default DataPusher should be running at the following port:
http://localhost:8800/If you need to change the host or port, copydeployment/datapusher_settings.py todeployment/datapusher_local_settings.py and modify the file to suit your needs. Also if running a production setup, make sure that the host and port matcht thehttp settings in the uWSGI configuration.
To run the tests:
pytestNote: If you installed CKAN via apackage install, the DataPusher has already been installed and deployed for you. You can skip directly to theConfiguring section.
Thes instructions assume you already have CKAN installed on this server in the defaultlocation described in the CKAN install documentation(/usr/lib/ckan/default). If this is correct you should be able to run thefollowing commands directly, if not you will need to adapt the previous path toyour needs.
These instructions set up the DataPusher web service onuWSGI running on port 8800, but can be easily adapted to other WSGI servers like Gunicorn. You'llprobably need to set up Nginx as a reverse proxy in front of it and something likeSupervisor to keep the process up.
# Install requirements for the DataPusher sudo apt install python3-venv python3-dev build-essential sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev git libffi-dev # Create a virtualenv for datapusher sudo python3 -m venv /usr/lib/ckan/datapusher # Create a source directory and switch to it sudo mkdir /usr/lib/ckan/datapusher/src cd /usr/lib/ckan/datapusher/src # Clone the source (you should target the latest tagged version) sudo git clone -b 0.0.17 https://github.com/ckan/datapusher.git # Install the DataPusher and its requirements cd datapusher sudo /usr/lib/ckan/datapusher/bin/pip install -r requirements.txt sudo /usr/lib/ckan/datapusher/bin/python setup.py develop # Create a user to run the web service (if necessary) sudo addgroup www-data sudo adduser -G www-data www-data # Install uWSGI sudo /usr/lib/ckan/datapusher/bin/pip install uwsgiAt this point you can run DataPusher with the following command:
/usr/lib/ckan/datapusher/bin/uwsgi -i /usr/lib/ckan/datapusher/src/datapusher/deployment/datapusher-uwsgi.iniNote: If you are installing the DataPusher on a different location than the defaultone you need to adapt the relevant paths in thedatapusher-uwsgi.ini to the ones you are using. Also you might need to change theuid andguid settings when using a different user.
The default DataPusher configuration uses SQLite as the backend for the jobs database and a single uWSGI thread. To increase performance and concurrency you can configure DataPusher in the following way:
Use Postgres as database backend, which will allow concurrent writes (and provide a more reliable backend anyway). To use Postgres, create a user and a database and update the
SQLALCHEMY_DATABASE_URIsettting accordingly:# This assumes DataPusher is already installedsudo apt-get install postgresql libpq-devsudo -u postgres createuser -S -D -R -P datapusher_jobssudo -u postgres createdb -O datapusher_jobs datapusher_jobs -E utf-8# Run this in the virtualenv where DataPusher is installedpip install psycopg2# Edit SQLALCHEMY_DATABASE_URI in datapusher_settings.py accordingly# eg SQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobsStart more uWSGI threads. On the
deployment/datapusher-uwsgi.inifile, setworkersandthreadsto a value that suits your needs, and add thelazy-apps=truesetting to avoid concurrency issues with SQLAlchemy, eg:# ... rest of datapusher-uwsgi.iniworkers = 3threads = 3lazy-apps = true
Adddatapusher to the plugins in your CKAN configuration file(generally located at/etc/ckan/default/production.ini or/etc/ckan/default/ckan.ini):
ckan.plugins = <other plugins> datapusherIn order to tell CKAN where this webservice is located, the following must beadded to the[app:main] section of your CKAN configuration file :
ckan.datapusher.url = http://127.0.0.1:8800/Starting from CKAN 2.10, DataPusher requires a valid API token to operate (seethe documentation on API tokens), and will fail to start if the following option is not set:
ckan.datapusher.api_token = <api_token>There are other CKAN configuration options that allow to customize the CKAN - DataPusherintegation. Please refer to theDataPusher Settings section in the CKAN documentation for more details.
The DataPusher instance is configured in thedeployment/datapusher_settings.py file.Here's a summary of the options available.
| Name | Default | Description |
|---|---|---|
| HOST | '0.0.0.0' | Web server host |
| PORT | 8800 | Web server port |
| SQLALCHEMY_DATABASE_URI | 'sqlite:////tmp/job_store.db' | SQLAlchemy Database URL. See note about database backend below. |
| MAX_CONTENT_LENGTH | '1024000' | Max size of files to process in bytes |
| CHUNK_SIZE | '16384' | Chunk size when processing the data file |
| CHUNK_INSERT_ROWS | '250' | Number of records to send a request to datastore |
| DOWNLOAD_TIMEOUT | '30' | Download timeout for requesting the file |
| SSL_VERIFY | False | Do not validate SSL certificates when requesting the data file (Warning: Do not use this setting in production) |
| TYPES | [messytables.StringType, messytables.DecimalType, messytables.IntegerType, messytables.DateUtilType] | Messytables types used internally, can be modified to customize the type guessing |
| TYPE_MAPPING | {'String': 'text', 'Integer': 'numeric', 'Decimal': 'numeric', 'DateUtil': 'timestamp'} | Internal Messytables type mapping |
| LOG_FILE | /tmp/ckan_service.log | Where to write the logs. Use an empty string to disable |
| STDERR | True | Log to stderr? |
Most of the configuration options above can be also provided as environment variables prepending the name withDATAPUSHER_, egDATAPUSHER_SQLALCHEMY_DATABASE_URI,DATAPUSHER_PORT, etc. In the specific case ofDATAPUSHER_STDERR the possible values are1 and0.
By default, DataPusher uses SQLite as the database backend for jobs information. This is fine for local development and sites with low activity, but for sites that need more performance, Postgres should be used as the backend for the jobs database (egSQLALCHEMY_DATABASE_URI=postgresql://datapusher_jobs:YOURPASSWORD@localhost/datapusher_jobs. See alsoHigh Availability Setup. If SQLite is used, its probably a good idea to store the database in a location other than/tmp. This will prevent the database being dropped, causing out of sync errors in the CKAN side. A good place to store it is the CKAN storage folder (if DataPusher is installed in the same server), generally in/var/lib/ckan/.
Any file that has one of the supported formats (defined inckan.datapusher.formats) will be attempted to be loadedinto the DataStore.
You can also manually trigger resources to be resubmitted. When editing a resource in CKAN (clicking the "Manage" button on a resource page), a new tab named "DataStore" will appear. This will contain a log of the last attempted upload and a button to retry the upload.
Run the following command to submit all resources to datapusher, although it will skip files whose hash of the data file has not changed:
ckan -c /etc/ckan/default/ckan.ini datapusher resubmitOn CKAN<=2.8:
paster --plugin=ckan datapusher resubmit -c /etc/ckan/default/ckan.iniTo Resubmit a specific resource, whether or not the hash of the data file has changed::
ckan -c /etc/ckan/default/ckan.ini datapusher submit {dataset_id}On CKAN<=2.8:
paster --plugin=ckan datapusher submit <pkgname> -c /etc/ckan/default/ckan.iniThis material is copyright (c) 2020 Open Knowledge Foundation and other contributors
It is open and licensed under the GNU Affero General Public License (AGPL) v3.0whose full text may be found at:
About
A standalone web service that pushes data files from a CKAN site resources into its DataStore
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
