Skip to content

josesamos/starschemar

Repository files navigation

starschemar starschemar website

CRAN status R-CMD-check Codecov test coverage Downloads

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work. I am not aware of any tools with operations designed to specifically support this transformation process.

The goal of starschemar is to define transformations that allow you to easily obtain star schemas from flat tables. In addition, it includes basic data cleaning, dimension enrichment, incremental data refresh and query operations, adapted to this context.

Based on this package, the rolap package has been developed. It incorporates the main functionalities for which starschemar was initially intended and it has a new data model defined with the aim of facilitating its maintenance and evolution.

Installation

You can install the released version of starschemar from CRAN with:

install.packages("starschemar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("josesamos/starschemar")

Example

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set in the form of a flat table, the first rows of which are shown below.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 1 1962-01-06 1 CT Bridgeport 3 46 43
1962 2 1962-01-13 1 CT Bridgeport 2 43 41
1962 3 1962-01-20 1 CT Bridgepor 2 40 38
1962 1 1962-01-06 9 WA Tacoma 4 50 46
NA NA 1962-01-13 9 WA Tacoma 2 45 43
1962 3 1962-01-20 9 WA Tacoma 0 39 39

The original data has been modified to have some erroneous or missing data.

The transformation to obtain a star schema from the flat table using starschemar package is as follows:

library(starschemar)

# columns to consider in the definition
dput(colnames(ft))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Pneumonia and Influenza Deaths", "All Deaths", "Other Deaths"
#> )

dm <- dimensional_model() |>
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) |>
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) |>
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

st <- star_schema(ft, dm) |>
  snake_case() |>
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(week = 2)
  )

The tables of dimensions and facts of the obtained star schema are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-13 Unknown Unknown
4 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 9 WA Tacoma
when_key where_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 2 3 43 1
1 3 4 46 1
2 2 2 41 1
3 3 2 43 1
4 1 2 38 1
4 3 0 39 1

The tables show the erroneous and missing data. We are going to perform some data cleaning operations to correct them.

where <- st |>
  get_dimension("where")

when <- st |>
  get_dimension("when")

updates <- record_update_set() |>
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  ) |>
  match_records(dimension = when,
                old = 3,
                new = 2)

st <- st |>
  modify_dimension_records(updates)

The new dimension and fact tables are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgeport
2 9 WA Tacoma
where_key when_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 1 3 43 1
1 2 2 41 1
1 3 2 38 1
2 1 4 46 1
2 2 2 43 1
2 3 0 39 1

In addition to the operations in the examples shown, starschemar offers support for defining role playing and role dimensions in a star schema, fact constellations with conformed dimensions, dimension enrichment, incremental data refresh operations, the ability to export results in various tibble-based formats, and query operations, adapted to this context.

About

Obtaining Star Schemas from Flat Tables

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published