The Import Pipeline

Import Pipeline Flowchart
The Automated Import Pipeline
Code Paths

Watch this video about how Open Library import pipeline works. Staff should see these import notes.

OpenLibrary.org offers several "Public Import API Endpoints" that can be used to submit book data for import, including one for MARC records, one for raw json book records (/api/import), and for directly importing against existing partner items (like archive.org) by ID (/api/import/ia).

Outside of these public API endpoints, Open Library also maintains a bulk batch import system for enqueueing json book data in bulk from book sources like betterworldbooks, amazon, and other trusted book providers (like librivox and standardebooks). These bulk batch imports ultimately submit records (in a systematic and rate-limited way) to the "Public Import API Endpoints", e.g. /api/import.

Once a record passes through our bulk batch import process and/or gets submitted to one of our "Public Import API Endpoints" (e.g. /api/import, see code), the data is then parsed, augmented, and validated by the "Validator" in importapi/import_edition_builder.py.

Next the formatted, validated book_edition goes through the "Import Processor" called as catalog.add_book.load(book_edition). The function has 3 paths. It tries to find an existing matching edition and its work. The options are (1) no edition/work is found and the edition is created, (2) a matched edition is found no new data is available, (3) a matched record is modified with new available data.

In the case of (1) and (3), a final step is performed called "Perform Import / Update" whose description is in load_data(). Here is a flowchart of what the internal import pipeline looks like for a record that has been submitted to a public API:

Import Pipeline Flowchart

Automated Import Pipeline

For instructions and context on testing the Cron + ImportBot pipelines, please see notes in issue #5588 and this overview video (bharat + mek)

Open Library's production automatic import pipeline consists of two components:

A Cron service with a collection of jobs which routinely pulls data from partner source and enqueues them in a database
An ImportBot which polls this unified database of queued import requests and process the imports them into the catalog

Note: In the following chart, the Infogami Container is detailed above in the main import flowchart viz-js com_

Code Paths

There are multiple paths by which data can be imported into Open Library.

Through the website UI and the Open Library Client which both use the endpoint: https://openlibrary.org/books/add
- code: openlibrary/plugins/upstream/addbook.py
- tests: openlibrary/plugins/upstream/tests/test_addbook.py although the current tests only cover the TestSaveBookHelper class, which is only used by the edit book pages, not addbook.
Through the data import API: https://openlibrary.org/api/import
- code: openlibrary/plugins/importapi/code.py
By reference to archive.org items via the IA import endpoint: https://openlibrary.org/api/import/ia
- code: openlibrary/plugins/importapi/code.py which calls openlibrary.catalog.add_book.load() in openlibrary/catalog/add_book/__init__.py Checking for existing works and editions is performed here in openlibrary.catalog.add_book.exit_early()
- Add book tests: openlibrary/catalog/add_book/test_add_book.py
Through our privileged ImportBot scripts/manage_imports.py which POSTs to the IA import API via Openlibrary.import_ocaid() from openlibrary/api.py
Through bulk import API openlibrary/api.py -- this should be considered deprecated

Administrative Tasks

Importing Internet Archive Books

From openlibrary_cron-jobs_1 on ol-home0 enqueue a batch:

cd /openlibrary/scripts
PYTHONPATH=/openlibrary python /openlibrary/scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml add-new-scans 2021-07-28

Directly Importing an Internet Archive Book (Internally)

Run import on an ID from openlibrary_importbot_1 on ol-home0

cd /openlibrary/scripts
PYTHONPATH=/openlibrary python

import web
import infogami
from openlibrary import config
config.load_config('/olsystem/etc/openlibrary.yml')
config.setup_infobase_config('/olsystem/etc/infobase.yml')
importer = __import__("manage-imports")
import internetarchive as ia
item = importer.ImportItem.find_by_identifier('itinerariosporlo0000garc')
x = importer.ol_import_request(item, servername='https://openlibrary.org', require_marc=False)

Frontend

Backend

The Import Pipeline

Import Pipeline Flowchart

Automated Import Pipeline

Code Paths

Administrative Tasks

Importing Internet Archive Books

Directly Importing an Internet Archive Book (Internally)

The Import Pipeline ​

Import Pipeline Flowchart ​

Automated Import Pipeline ​

Code Paths ​

Administrative Tasks ​

Importing Internet Archive Books ​

Directly Importing an Internet Archive Book (Internally) ​

The Import Pipeline

Import Pipeline Flowchart

Automated Import Pipeline

Code Paths

Administrative Tasks

Importing Internet Archive Books

Directly Importing an Internet Archive Book (Internally)