Manually Running Dumps
Data dumps are introduced at
Successful data dumps are transferred to
Data dumps should be created on ol-home0
within the openlibrary-cron-jobs-1
Docker container.
Docker container.- That container uses
to submit the cron jobs. - The jobs are defined in
Data dumps (e.g. ol_dump.txt.gz) may be manually regenerated on ol-home0
within the openlibrary-cron-jobs-1
Docker container:
Run an out-of-cycle Open Library Data Dump (Aug. 2022)
- Log into the host
# The data dumps are a long-running process andtmux
enables reconnecting to a host that has been /opt/openlibrary
docker ps
# To ensure thatopenlibrary-cron-jobs-1
is up and runningdocker exec -it -uroot openlibrary-cron-jobs-1 bash
crontab -l | less
# to see the ol data dumps commandls /1/var/tmp/dumps
# to see if there are data files that should be deleted- We kept the raw database dump
- We
rm -r oldumpsort
because we wanted to rebuild that - We replaced the date logic with a date string
- We removed
to skip some early steps like extractingdata.txt.gz
from postgres
- We kept the raw database dump
cd /opt/openlibrary
# just to be surePSQL_PARAMS=‘-h ol-db1 openlibrary’ TMPDIR=‘/1/var/tmp’ OL_CONFIG=‘/olsystem/etc/openlibrary.yml’ su openlibrary -c “/openlibrary/scripts/ 2022-07-31 —archive”
- Debug with
and also withzcat /1/var/tmp/dumps_2022-07-31.txt.gz | head | less
Examine the dump process logs
- Log into the host
docker logs openlibrary-cron-jobs-1 2>&1 | grep openlibrary.dump | less
- Or to follow the logs during the process:
docker logs openlibrary-cron-jobs-1 --follow
- Or to follow the logs during the process:
Related Issues - cron is presently broken - fix for October 2021-10
See original by @gdamdam at:
How it Works
Dumping the DB
First step is dumping the data table from ol-db1
-- this task requires around 1 hour to complete.
you@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz
Generate Metadata table dump from archive db
This task will also require ~1 hour to complete. Change the filename dates accordingly:
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)you@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
(venv)you@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/ --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz
Generate Revision Dump
This will create a dump of all revisions of all documents and takes around 8 hours to complete:
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)you@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/ cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
(venv)you@ol-home:/1/var/tmp$ rm data.txt.gz
Generate Latest Revision Dump
Generate the dump of latest revisions of all documents. This task requires around 6 hours to complete.
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)you@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/ sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/ dump | gzip -c > ol_dump_2015-03-11.txt.gz
(venv)you@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort
Splitting Dumps
Splitting the Dump into authors, editions, works, redirects:
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/ split --format ol_dump_%s_2015-03-11.txt.gz
Generating Denormalized Works Dump
XXX: This script returns exceptions! Each denormalized Work dump record/row is a JSON document with the following fields:
- work – The work documents
- editions – List of editions that belong to this work
- authors – All the authors of this work
- ia – IA metadata for all the ia items referenced in the editions as a list
- duplicates – dictionary of duplicates (key -> it’s duplicates) of work and edition docs mentioned above
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)you@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/ ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz
Verify Dumps
you@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate # Activate virtual environment
(venv)you@ol-home:/1/var/tmp$ ls -lh
ia_metadata_dump_2015-03-11.txt.gz ol_dump_2015-03-11.txt.gz
ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
ol_dump_deworks_2015-01-11.txt.gz ol_dump_editions_2015-03-11.txt.gz