Skip to content
mtwebit edited this page Oct 5, 2022 · 1 revision

DataSet import

In order to import data you need to create DataSet pages where you can upload your CSV or XML files. (The module creates a DataSet template type for you during install.)
The import from these files is performed according to rules stored in the description field next to them.
The rules can be described in YAML or JSON format.
Note: if the config is in JSON format it should start with the prefix 'JSON' (it will be removed during parsing).

Import rules

The import rule description contains the following main sections:

  • name - A human-readable name for the import.
  • input - Description of the input source.
  • pages - Description of the output pages and import options.
  • fieldmappings - mappings between source data and page fields.

Name (a single value)

A human-readable name for the import.

input

The source configuration. It may contain

  • type: the source type, e.g. XML or CSV, see the available submodules.
  • location: (optional) location override. Use this location instead of the source file (the file type must be the same).
  • delimiter: delimiter char (CSV) or XML tag name surrounding entries (XML), e.g. ';' or 'entry'.
  • header: number of header rows (CSV only).
  • enclosure: character around CSV fields when they contain the delimiter (CSV only).
  • required_fields: array of CSV column numbers or XPath selectors that are required to have a non-empty value.
  • silent_missing: if present, missing required fields will not be reported as errors and missing glue components will not be reported as warnings.
  • exclude_filter: filter out rows before running the import using a simple PHP logical expression using @n to reference column IDs
  • limit: maximum number of records to import (good for debugging).

pages

The output configuration.

  • template: page template for creating/updating/deleting content during import. E.g. template: pdf_page
  • required_fields: page fields that should be set to successfully import an entry
  • overwrite: array of field names where existing data will be overwritten
  • merge: merge new fields but don't alter existing data ('no' if missing, 'yes' otherwise)
  • silent_duplum: if present, duplicated entries will not be reported as warnings (meaningful if merge or overwrite is NOT set) ('no' if missing, 'yes' otherwise)
  • skip_new: if present, don't import new data (meaningful if merge or overwrite is set) ('no' if missing, 'yes' otherwise)
  • selector: a valid ProcessWire selector to select child pages matching the input. Imported fields can be referenced using @field. E.g. selector: 'pdftitle=@pdftitle'.
    Note: the maximum length of a selectorvalue is 100 characters (PW constraint). Longer values will be truncated.
  • 'search_global': if present, the selector will search all (not just child) matching pages will be selected. Useful if you want to update / merge your data with datasets uploaded elsewhere.

fieldmappings

Describes how to match input data to template fields.
The format is: "fieldname: input_selector" where fieldname is the appropriate field of the template, and "input_selector" specifies the location and assembly of the data that should be imported to that field.
CSV imports may specify the column number (starting at 1) in the input selector. Examples:

  pdftitle: 1  
  pdf_file: 16

Note: column_id 0 always contains the serial number of the row.

During CSV imports a "glue" array may also be used as an input selector. In this case, numerical array values will be replaced by their matching columns in the input, string values will be simply merged to the output. If columns are used and they are all empty in the input, the field will be ignored and a warning is issued (unless silent_missing is specified). Examples:

  fullname: [1, ' ', 2]     # joining columns #1 and #2 and a space between them
  title: [1, ' [', 0, ']']  # add the row serial number in brackets to the 1st column value.

It is also possible to import multiple values from a single input field. In this case the selector should be an associative array (a sequence of mappings in YAML), where the "type" key specifies the input data type (e.g. array), a "separator" is used to construct the array from the field, and a "column" key holds the input selector.
WARNING: these array import fields cannot be used in page selectors during the import.
Example:

  values: { type: array, separator: "|", column: 7 }   # explode the data in column 7 to an array using | as the separator

XML imports may use XPath expressions to select the data. Examples:

  title: '@headword'  
  xml_data: . # the entire record including its outer tag  

csv_data_defaults

Default values for empty CSV cells as "column_id: value" pairs. Only works for CSV inputs.

field_data_defaults

Default values for fields ("field_name: value" pairs).
This is preloaded before importing a data entry. Default values will be overwritten by input data (if present).

Handling special data types during import

Page reference

In most cases references are resolved using the page title.
You can, however, specify other fields to match during import. In order to do this use the "Page Autocomplete" field type and fill in the "fields to query for autocomplete" textfield.

Option types

Numeric option values and titles are also handled during the import if you specify option fields using the following scheme:

id=title

or

id=value|title

where id is the serial ID of the option (PW internal), the value specifies a numeric value present in the input file, and the title is the human-readable display form of the data. If you omit the value then the import process will look for the title in the input dataset.

Files

In order to import files to pages you may use locations that are supported by the PW ->add() method.
E.g. during CSV import you specify a file URL in the CSV document, and create a file field to store the downloaded file. Example:

  pdf_file, 16   # the 16th field in the CSV file contains URLs like ftp://example.org/file.zip

The import module will fetch the file from the specified location and it will attach it to the page's pdf_file field.

Examples

A complete configuration example

name: Dictionary import
input: # Source configuration
  type: xml
  delimiter: entry
fieldmappings:
  title: '@headword'
  xml_data: . # the entire record including its outer tag
pages:
  template: headword
  selector: 'title=@title'

More advanced examples

name: Merging and altering input data
input: # Source configuration
  type: csv
  delimiter: ','
  header: 2      # contains 2 header rows
  exclude_filter: ' @2 == "" && @5 > 10 '    # exclude rows where the 2nd column is empty and the 5th's value is higher than 10
fieldmappings: # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: [2, ' - ', 4]   # Merging several columns into a string
  place_id: ['PlaceID-#', 1]   # Adding a prefix
  place_name: 2
  county: 4
pages:
  template: Place
  merge: 1    # Extend records with new data but don't alter already existing fields
  selector: 'place_id=@place_id, title=@title'
name: Specifying default values
input: # Source configuration
  type: csv
  delimiter: ';'
  header: 1
csv_data_defaults: # default values for CSV columns
  6: '1'   # month
  7: '1'   # day
  # if column 6 or 7 is empty then use the default values
fieldmappings:  # specified as field_name: csv_column_id (1, 2, 3, ...)
  title: [2, ' - ', 3]
  event_id: 1
  event_type: [ '3' ]   # a constant text value
  event_date: [5, '-', 6, '-', 7]  # construct the date as YYYY-MM-DD
  event_date_accuracy: 8
pages:  # Config for child pages
  template: Event
  merge: 1     # Merge new field values
  selector: 'event_id=@event_id'