practicum3-notebook.Rmd

---
title: 'CS5200 Fall 2020: Practicum 3'
author: "Chandra Davis, Evan Douglass"
output:
  pdf_document: default
  word_document: default
  html_document:
    df_print: paged
---

## Overview

We've decided to work with SQLite for this practicum. As such, to work with these files you will need SQLite installed on your machine. The data we are using was provided with the practicum

```{r, warning=FALSE, message=FALSE}
# Libraries needed for processing
library(RSQLite)
library("XML")
library(sqldf)
library(dplyr)
library(tibble)
library(sjmisc)
library(ggplot2)

# Set to False if you do not want to print the top 5 from each table in the transactional database
printHead = TRUE
```


## Part 1
Create a normalized relational OLTP database and populate it with data from an XML document.

### Task 1
Create a normalized relational schema that contains minimally the following entities: Article, Journal, Author, History. Use the XML document to determine the appropriate attributes (fields/columns) for the entities (tables). While there may be other types of publications in the XML, you only need to deal with articles in journals. Create appropriate primary and foreign keys. Where necessary, add surrogate keys. Include an image of an ERD showing your model in your R Notebook.

Lucidchart link: 

![Task1.1](imgs/CS5200 - Practicum 3 ERD.png)

### Task 2
Realize the relational schema in SQLite (place the CREATE TABLE statements into SQL chunks in your R Notebook).

```{r}
DB_NAME <- "pubMed.db"

conn <- dbConnect(RSQLite::SQLite(), DB_NAME)
```


```{r, results="hide"}
# Since the dataset is small, the database should be re-created at runtime
drop_table <- function(table_name) {
    paste("DROP TABLE IF EXISTS ", table_name, ";", sep="")
}

# Since we are dropping all tables, disable the FK checks
dbExecute(conn, "PRAGMA foreign_keys = OFF;")
# Get a list of all tables currently in the database
table_list <- dbListTables(conn)
# Drop every table in the database
for(table in table_list){
  if(!str_contains(table,"sqlite")){
    dbExecute(conn, drop_table(table))
  }
}
dbExecute(conn, "PRAGMA foreign_keys = ON;")
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ELocType (
  eType_id INTEGER PRIMARY KEY AUTOINCREMENT,
  eType TEXT NOT NULL,
  CONSTRAINT unique_eLocType_eType UNIQUE (eType)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS CategoryLabels (
  label_id INTEGER PRIMARY KEY AUTOINCREMENT,
  label TEXT NOT NULL,
  CONSTRAINT unique_catLab_label UNIQUE (label)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Affiliations (
  aff_id INTEGER PRIMARY KEY AUTOINCREMENT,
  aff TEXT NOT NULL,
  CONSTRAINT unique_aff_aff UNIQUE (aff)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Pagination (
  pgn_id INTEGER PRIMARY KEY AUTOINCREMENT,
  medlinePgn TEXT NOT NULL,
  CONSTRAINT unique_pgn_medpgn UNIQUE (medlinePgn)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Languages (
  lang_id INTEGER PRIMARY KEY AUTOINCREMENT,
  language TEXT NOT NULL,
  CONSTRAINT unique_lang_lang UNIQUE (language)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS IsoAbbreviation (
  abbr_id INTEGER PRIMARY KEY AUTOINCREMENT,
  abbr TEXT NOT NULL,
  CONSTRAINT unique_isoAbbr_abbr UNIQUE (abbr)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS MediumType (
  medium_id INTEGER PRIMARY KEY AUTOINCREMENT,
  medium TEXT NOT NULL,
  CONSTRAINT unique_med_med UNIQUE (medium)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Countries (
  country_id INTEGER PRIMARY KEY AUTOINCREMENT,
  country TEXT NOT NULL,
  CONSTRAINT unique_country_country UNIQUE (country)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Agencies (
  agency_id INTEGER PRIMARY KEY AUTOINCREMENT,
  agency TEXT NOT NULL,
  CONSTRAINT unique_agency_agency UNIQUE (agency)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Acronyms (
  acronym_id INTEGER PRIMARY KEY AUTOINCREMENT,
  acronym TEXT NOT NULL,
  CONSTRAINT unique_acr_acr UNIQUE (acronym)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS PublicationType (
  pubType_id INTEGER PRIMARY KEY AUTOINCREMENT,
  pubType TEXT NOT NULL,
  CONSTRAINT unique_pubType_pubType UNIQUE (pubType)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS PubStatus (
  status_id INTEGER PRIMARY KEY AUTOINCREMENT,
  status TEXT NOT NULL,
  CONSTRAINT unique_pubStatus_status UNIQUE (status)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Date (
  date_id INTEGER PRIMARY KEY AUTOINCREMENT,
  year INTEGER NOT NULL,
  month INTEGER NOT NULL,
  day INTEGER NOT NULL,
  CONSTRAINT unique_date_yymmdd UNIQUE (year, month, day)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Article (
  article_id INTEGER PRIMARY KEY AUTOINCREMENT,
  pubModel INTEGER NOT NULL,
  title Text NOT NULL,
  pgn_id INTEGER NOT NULL,
  authorListComplete BOOLEAN NOT NULL,
  lang_id INTEGER NOT NULL,
  grantListComplete BOOLEAN,
  copyright Text,
  FOREIGN KEY (pubModel) REFERENCES MediumType(medium_id),
  FOREIGN KEY (pgn_id) REFERENCES Pagination(pgn_id),
  FOREIGN KEY (lang_id) REFERENCES Languages(lang_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ArticleDate (
  artDate_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  dateType INTEGER NOT NULL,
  date_id INTEGER NOT NULL,
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  FOREIGN KEY (dateType) REFERENCES MediumType(medium_id),
  FOREIGN KEY (date_id) REFERENCES Date(date_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS PubTypeList (
  pubList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  pubType_id INTEGER NOT NULL,
  FOREIGN KEY (pubType_id) REFERENCES PublicationType(pubType_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_pubTypeList_type_art UNIQUE (pubType_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ELocationID (
  eLoc_id INTEGER PRIMARY KEY AUTOINCREMENT,
  eType_id INTEGER NOT NULL,
  valid BOOLEAN NOT NULL,
  value Text NOT NULL,
  FOREIGN KEY (eType_id) REFERENCES ELocType(eType_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ELocList (
  elList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  eLoc_id INTEGER NOT NULL,
  FOREIGN KEY (eLoc_id) REFERENCES ELocationID(eLoc_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_eloclist_eLoc_art UNIQUE (eLoc_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS PubDate (
  pubDate_id INTEGER PRIMARY KEY AUTOINCREMENT,
  medlineDate BOOLEAN NOT NULL,
  dateYYMM Text NOT NULL,
  CONSTRAINT unique_pubDate_med_date UNIQUE (medlineDate, dateYYMM)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS JournalIssue (
  issue_id INTEGER PRIMARY KEY AUTOINCREMENT,
  volume INTEGER NOT NULL,
  issue INTEGER NOT NULL,
  citedMedium INTEGER NOT NULL,
  pubDate_id INTEGER NOT NULL,
  FOREIGN KEY (citedMedium) REFERENCES MediumType(medium_id),
  FOREIGN KEY (pubDate_id) REFERENCES PubDate(pubDate_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ArticleList (
  artList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  issue_id INTEGER NOT NULL,
  FOREIGN KEY (issue_id) REFERENCES JournalIssue(issue_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_artlist_issue_art UNIQUE (issue_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS ISSN (
  issn_id INTEGER PRIMARY KEY AUTOINCREMENT,
  value Text NOT NULL,
  issnType INTEGER NOT NULL,
  CONSTRAINT unique_issn_typeissn UNIQUE (issnType, value),
  FOREIGN KEY (issnType) REFERENCES MediumType(medium_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Journal (
  journal_id INTEGER PRIMARY KEY AUTOINCREMENT,
  title Text NOT NULL,
  issn_id INTEGER NOT NULL,
  isoAbbr INTEGER NOT NULL,
  FOREIGN KEY (issn_id) REFERENCES ISSN(issn_id),
  FOREIGN KEY (isoAbbr) REFERENCES IsoAbbreviation(abbr_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS IssueList (
  issueList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  issue_id INTEGER NOT NULL,
  journal_id INTEGER NOT NULL,
  FOREIGN KEY (issue_id) REFERENCES JournalIssue(issue_id),
  FOREIGN KEY (journal_id) REFERENCES Journal(journal_id),
  CONSTRAINT unique_issuelist_issue_jornal UNIQUE (issue_id, journal_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS AbstractText (
  abText_id INTEGER PRIMARY KEY AUTOINCREMENT,
  abText Text NOT NULL,
  label INTEGER,
  nlmCategory INTEGER,
  FOREIGN KEY (label) REFERENCES CategoryLabels(label_id),
  FOREIGN KEY (nlmCategory) REFERENCES CategoryLabels(label_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS AbstractList (
  abList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  abText_id INTEGER NOT NULL,
  FOREIGN KEY (abText_id) REFERENCES AbstractText(abText_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_ablist_abstract_art UNIQUE (abText_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Authors (
  author_id INTEGER PRIMARY KEY AUTOINCREMENT,
  valid BOOLEAN NOT NULL,
  lName Text NOT NULL,
  fName Text NOT NULL,
  initials Text NOT NULL,
  aff_id INTEGER,
  FOREIGN KEY (aff_id) REFERENCES Affiliations(aff_id),
  CONSTRAINT unique_auth_authors UNIQUE (lName, fName, initials, aff_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS AuthorList (
  authList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  author_id INTEGER NOT NULL,
  FOREIGN KEY (author_id) REFERENCES Authors(author_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_authlist_author_art UNIQUE (author_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS Grants (
  grant_id INTEGER PRIMARY KEY AUTOINCREMENT,
  grantID Text NOT NULL,
  acronym_id INTEGER,
  agency_id INTEGER NOT NULL,
  country_id INTEGER NOT NULL,
  FOREIGN KEY (acronym_id) REFERENCES Acronyms(acronym_id),
  FOREIGN KEY (agency_id) REFERENCES Agencies(agency_id),
  FOREIGN KEY (country_id) REFERENCES Countries(country_id),
  CONSTRAINT unique_grant_id UNIQUE (grantID)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS GrantList (
  grantList_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  grant_id INTEGER NOT NULL,
  FOREIGN KEY (grant_id) REFERENCES Grants(grant_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_grantlist_grant_art UNIQUE (grant_id, article_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS PubMedDate (
  pubMedDate_id INTEGER PRIMARY KEY AUTOINCREMENT,
  status INTEGER NOT NULL,
  date INTEGER NOT NULL,
  hour INTEGER,
  minute INTEGER,
  FOREIGN KEY (status) REFERENCES PubStatus(status_id),
  FOREIGN KEY (date) REFERENCES Date(date_id)
);
```


```{sql connection=conn}
CREATE TABLE IF NOT EXISTS History (
  history_id INTEGER PRIMARY KEY AUTOINCREMENT,
  article_id INTEGER NOT NULL,
  pubMedDate_id INTEGER NOT NULL,
  FOREIGN KEY (pubMedDate_id) REFERENCES PubMedDate(pubMedDate_id),
  FOREIGN KEY (article_id) REFERENCES Article(article_id),
  CONSTRAINT unique_history_date_art UNIQUE (pubMedDate_id, article_id)
);
```

### Task 3
Extract and transform the data from the XML and then load into the appropriate tables in the database. You cannot use xmlToDataFrame but instead must parse the XML node by node using a combination of node-by-node tree traversal and XPath. It is not feasible to use XPath to extract all journals, then all authors, etc. as some are missing and won't match up. You will need to iterate through the top-level nodes.

```{r, warning=FALSE}
# Read the XML file into memory
file <- xmlParse(file = "pubmed_sample.xml")

# Get the first node
root <- xmlRoot(file)
```

```{r, warning=FALSE}
# Categorical tables need to be unique so do not need to go node by node
# There are 12 categorical attribute tables
# Created function for duplicate processing

getVector <- function(path, att){
  # Grab all of the unique genres
  if(att){
    values <- unique(xpathSApply(file, path))
  } 
  else{
    values <- unique(xpathSApply(file, path, xmlValue))
  }
  
  return(values)
}

addId <- function(df){
  # Add unique ids
  df <- tibble::rowid_to_column(df, "id")
  return(df)
}

dfToDatabase <- function(df, tablename, needID){
  if(needID){
    df_table <- addId(df)
  }
  else {
    df_table <- df
  }
  
  # Update column names to match table
  names <- dbGetQuery(conn, 
              paste("pragma table_info(",tablename,")",sep=""))[["name"]]
  colnames(df_table) <- names
  
  # Put in database
  # Append=TRUE is the preferred method because we want to maintain the referential integrity relationships from table creation
  dbWriteTable(conn, tablename, df_table, append=TRUE)
  
  # Return the dataframe for processing FK relationships
  return(df_table)
}

catTable <- function(path, tablename, att, mult=FALSE){
  # A few of the categorical tables are made of multiple XML elements/attributes
  if(mult){
    values <- c()
    split <- unlist(strsplit(path, ","))
    for(i in split){
      values <- append(values,getVector(i,att))
    }
    values <- unique(values)
  }
  else {
    values <- getVector(path, att)
  }
  
  df_table <- data.frame(values)
  
  return(dfToDatabase(df_table,tablename, TRUE))
}
```

```{r}
# Call function to build all twelve tables
# Save the dataframes for look-up in other tables
df_aff <- catTable("//Affiliation", "Affiliations", FALSE)
df_pgn <- catTable("//MedlinePgn", "Pagination", FALSE)
df_lang <- catTable("//Language", "Languages", FALSE)
df_iso <- catTable("//ISOAbbreviation", "IsoAbbreviation", FALSE)
df_pubType <- catTable("//PublicationType", "PublicationType", FALSE)
df_country <- catTable("//Country", "Countries", FALSE)
df_agency <- catTable("//Agency", "Agencies", FALSE)
df_acr <- catTable("//Acronym", "Acronyms", FALSE)
df_eIdType <- catTable("//@EIdType", "ELocType", TRUE)
df_pubStatus <- catTable("//@PubStatus", "PubStatus", TRUE)
df_labels <- catTable("//@Label,//@NlmCategory", "CategoryLabels", TRUE, TRUE)
df_mediums <- catTable("//@IssnType,//@CitedMedium,//@PubModel,//@DateType",
                       "MediumType", TRUE, TRUE)
```

The remaining tables need to be processed node by node because of potentially extra or missing nodes. Not all of the tables/attributes will have a 1:1 relationship with the article.

The next few chunks will read the XML into initial dataframes for further manipulation.

```{r}
# Function to check for missing values
# Will add a NULL value to the vector if the data does not exist
checkSize <- function(num, df){
  if(length(df) < num){
    return(append(df,NA))
  }
  else {
    return(df)
  }
}
```


```{r}
# Function processes the history XML element
# Uses the article number for tying back to the original article
histDF <- function(art, node){
  status <- c()
  year <- c()
  month <- c()
  day <- c()
  hour <- c()
  minute <- c()
  
  for(i in 1:xmlSize(node)){
    newNode <- node[[i]]
    # Status is an attribute and values are obtained differently
    status <- append(status,xmlGetAttr(newNode,"PubStatus",NA))
    for(j in 1:xmlSize(newNode)){
      child <- newNode[[j]]
      name <- xmlName(child)
      
      # Add to the appropriate vector depending on child name
      if(name == "Year"){
        year <- append(year, xmlValue(child))
      }
      else if(name == "Month"){
        month <- append(month, xmlValue(child))
      }
      else if(name == "Day"){
        day <- append(day, as.numeric(xmlValue(child)))
      }
      else if(name == "Hour"){
        hour <- append(hour, xmlValue(child))
      }
      else if(name == "Minute"){
        minute <- append(minute, as.numeric(xmlValue(child)))
      }
      else {
        print(paste("Vector not created for:", name))
      }
    }
    
    # Check for missing values on all vectors
    status <- checkSize(i,status)
    year <- checkSize(i,year)
    month <- checkSize(i,month)
    day <- checkSize(i,day)
    hour <- checkSize(i,hour)
    minute <- checkSize(i,minute)
  }
  
  hist_df <- data.frame("article_id"=art, 
                        status, year, month, day, hour, minute)
  return(hist_df)
}

# Create an empty dataframe to match dataframe from function
# History is the linking table between article and PubMedDate
df_pmDate <- data.frame("article_id"=c(), "status"=c(), 
                      "year"=c(), "month"=c(), "day"=c(), "hour"=c(),
                      "minute"=c())

# Confirmed only one instance of history for every article
# Goes through every top level article
# Creates a combined dataframe for tables relating to the history
for(i in 1:xmlSize(root)){
  newRoot <- root[[i]][[2]]
  for(j in 1:xmlSize(newRoot)){
    name <- xmlName(newRoot[[j]])
    if(name=="History"){
      df_pmDate <- rbind(df_pmDate,histDF(i,newRoot[[j]]))
    }
  }
}
```

```{r}
# Function processes the journal XML element
# Uses the article number for tying back to the original article
journalDF <- function(num, node){
  issnType <- c()
  issn <- c()
  medium <- c()
  volume <- c()
  issue <- c()
  med <- c()
  date <- c()
  title <- c()
  iso <- c()
  
  for(i in 1:xmlSize(node)){
    newNode <- node[[i]]
    name <- xmlName(newNode)
    if(name=="ISSN"){
      issnType <- append(issnType, xmlGetAttr(newNode,"IssnType",NA))
      issn <- append(issn, xmlValue(newNode))
    }
    else if(name=="JournalIssue"){
      medium <- append(medium, xmlGetAttr(newNode,"CitedMedium",NA))
      for(j in 1:xmlSize(newNode)){
        child <- newNode[[j]]
        name2 <- xmlName(child)
        if(name2 == "Volume"){
          volume <- append(volume, as.numeric(xmlValue(child)))
        }
        else if(name2 == "Issue"){
          issue <- append(issue, as.numeric(xmlValue(child)))
        }
        else if(name2 == "PubDate"){
          if(xmlName(child[[1]])=="MedlineDate"){
            med <- append(med, "Y")
            date <- append(date, xmlValue(child[[1]]))
          }
          else {
            med <- append(med, "N")
            year <- xmlValue(child[[1]])
            month <- xmlValue(child[[2]])
            date <- append(date, paste(year,month))
          }
        }
        else {
          print(paste("Vector not created for: JournalIssue/", name2))
        }
      }
    }
    else if(name=="Title"){
      title <- append(title, xmlValue(newNode))
    }
    else if(name=="ISOAbbreviation"){
      iso <- append(iso, xmlValue(newNode))
    }
    else {
      print(paste("Vector not created for:", name))
    }
  }
  
  journal_df <- data.frame("article_id"=num, issnType, issn, medium, volume, 
                           issue, med, date, title, iso)
  return(journal_df)
}
```

```{r}
# Function processes the eLocationID XML element
# Uses the article number for tying back to the original article
eLocDF <- function(num, node){
  valid <- xmlGetAttr(node,"ValidYN",NA)
  type <- xmlGetAttr(node,"EIdType",NA)
  value <- xmlValue(node)
  
  eLoc_df <- data.frame("article_id"=num, valid, type, value)
  return(eLoc_df)
}
```

```{r}
# Function processes the abstract XML element
# Uses the article number for tying back to the original article
abstractDF <- function(num, node){
  cat <- c()
  lab <- c()
  text <- c()
  
  for(i in 1:xmlSize(node)){
    newNode <- node[[i]]
    name <- xmlName(newNode)
    if(name=="AbstractText"){
      cat <- append(cat, xmlGetAttr(newNode,"NlmCategory",NA))
      lab <- append(lab, xmlGetAttr(newNode,"Label",NA))
      text <- append(text, xmlValue(newNode))
      
      # Check for missing values on all vectors
      cat <- checkSize(i,cat)
      lab <- checkSize(i,lab)
      text <- checkSize(i,text)
    }
  }
  
  abstract_df <- data.frame("article_id"=num, cat, lab, text)
  return(abstract_df)
}
```

```{r}
# Function processes the authorList XML element
# Uses the article number for tying back to the original article
authorDF <- function(num, node){
  valid <- c()
  last <- c()
  first <- c()
  init <- c()
  aff <- c()
  
  for(i in 1:xmlSize(node)){
    newNode <- node[[i]]
    valid <- append(valid, xmlGetAttr(newNode,"ValidYN",NA))
    for(j in 1:xmlSize(newNode)){
      child <- newNode[[j]]
      name <- xmlName(child)
      if(name == "LastName"){
        last <- append(last, xmlValue(child))
      }
      else if(name == "ForeName"){
        first <- append(first, xmlValue(child))
      }
      else if(name == "Initials"){
        init <- append(init, xmlValue(child))
      }
      else if(name == "Affiliation"){
        aff <- append(aff, xmlValue(child))
      }
      else {
        print(paste("Vector not created for:", name))
      }
    }
    
    # Check for missing values on all vectors
    valid <- checkSize(i,valid)
    last <- checkSize(i,last)
    first <- checkSize(i,first)
    init <- checkSize(i, init)
    aff <- checkSize(i,aff)
  }
  
  author_df <- data.frame("article_id"=num, valid, last, first, init, aff)
  return(author_df)
}
```

```{r}
# Function processes the publicationTypeList XML element
# Uses the article number for tying back to the original article
pubTypeDF <- function(num, node){
  type <- c()
  for(i in 1:xmlSize(node)){
    type <- append(type, xmlValue(node[[i]]))
  }
  
  return(data.frame("article_id"=num, type))
}
```

```{r}
# Function processes the grantList XML element
# Uses the article number for tying back to the original article
grantDF <- function(num, node){
  grantid <- c()
  ac <- c()
  ag <- c()
  ctry <- c()
  
  for(i in 1:xmlSize(node)){
    newNode <- node[[i]]
    for(j in 1:xmlSize(newNode)){
      child <- newNode[[j]]
      name <- xmlName(child)
      if(name == "GrantID"){
        grantid <- append(grantid, xmlValue(child))
      }
      else if(name == "Acronym"){
        ac <- append(ac, xmlValue(child))
      }
      else if(name == "Agency"){
        ag <- append(ag, xmlValue(child))
      }
      else if(name == "Country"){
        ctry <- append(ctry, xmlValue(child))
      }
      else {
        print(paste("Vector not created for:", name))
      }
    }
    
    # Check for missing values on all vectors
    grantid <- checkSize(i,grantid)
    ac <- checkSize(i,ac)
    ag <- checkSize(i,ag)
    ctry <- checkSize(i,ctry)
  }
  
  grant_df <- data.frame("article_id"=num, grantid, ac, ag, ctry)
  return(grant_df)
}
```

```{r}
# Function processes the articleDate XML element
# Uses the article number for tying back to the original article
artDateDF <- function(num, node){
  type <- xmlGetAttr(node,"DateType",NA)
  year <- as.numeric(xmlValue(node[[1]]))
  month <- as.numeric(xmlValue(node[[2]]))
  day <- as.numeric(xmlValue(node[[3]]))
  
  artDate_df <- data.frame("article_id"=num, type, year, month, day)
  return(artDate_df)
}
```

```{r}
# Create vectors needed for the article dataframe
id <- c()
title <- c()
pagination <-c()
aListC <- c()
gListC <- c()
lang <- c()
model <- c()
copyright <-c()

# Create an empty dataframes to match above functions
df_journal <- data.frame("article_id"=c(), "issnType"=c(), "issn"=c(),
                         "medium"=c(), "volume"=c(), "issue"=c(), "med"=c(),
                         "date"=c(), "title"=c(), "iso"=c())
df_eLoc <- data.frame("article_id"=c(), "valid"=c(), "type"=c(), "value"=c())
df_abstract <- data.frame("article_id"=c(), "cat"=c(), "lab"=c(), "text"=c())
df_author <- data.frame("article_id"=c(), "valid"=c(), "last"=c(),
                        "first"=c(), "init"=c(), "aff"=c())
df_pubTypeList <- data.frame("article_id"=c(), "type"=c())
df_grant <- data.frame("article_id"=c(), "grantid"=c(), "ac"=c(), "ag"=c(),
                       "ctry"=c())
df_artDate <- data.frame("article_id"=c(), "type"=c(), "year"=c(),
                         "month"=c(), "day"=c())

# Confirmed only one instance of article for every PubMedArticle
for(i in 1:xmlSize(root)){
  newRoot <- root[[i]][[1]]
  for(j in 1:xmlSize(newRoot)){
    name <- xmlName(newRoot[[j]])
    if(name=="Article"){
      article <- newRoot[[j]]
      id <- append(id,i)
      model <- append(model,xmlGetAttr(article,"PubModel",NA))
      for(k in 1:xmlSize(article)){
        child <- article[[k]]
        name <- xmlName(child)
        if(name=="Journal"){
          df_journal <- rbind(df_journal,journalDF(i,child))
        }
        else if(name=="ArticleTitle"){
          title <- append(title,xmlValue(child))
        }
        else if(name=="Pagination"){
          pagination <- append(pagination,xmlValue(child[[1]]))
        }
        else if(name=="ELocationID"){
          df_eLoc <- rbind(df_eLoc,eLocDF(i,child))
        }
        else if(name=="Abstract"){
          value <- xpathSApply(child,"CopyrightInformation", xmlValue)
          if(class(value)=="list"){
            copyright <- append(copyright,NA)
          }
          else {
            copyright <- append(copyright,value)
          }
          df_abstract <- rbind(df_abstract,abstractDF(i,child))
        }
        else if(name=="AuthorList"){
          aListC <- append(aListC,xmlGetAttr(child,"CompleteYN",NA))
          df_author <- rbind(df_author,authorDF(i,child))
        }
        else if(name=="Language"){
          lang <- append(lang,xmlValue(child))
        }
        else if(name=="GrantList"){
          gListC <- append(gListC,xmlGetAttr(child,"CompleteYN",NA))
          df_grant <- rbind(df_grant,grantDF(i,child))
        }
        else if(name=="PublicationTypeList"){
          df_pubTypeList <- rbind(df_pubTypeList,pubTypeDF(i,child))
        }
        else if(name=="ArticleDate"){
          df_artDate <- rbind(df_artDate,artDateDF(i,child))
        }
        else {
          print(paste("Missing tables for",name))
        }
      }
      
      # Check for missing values on all vectors
      id <- checkSize(i,id)
      title <- checkSize(i,title)
      pagination <- checkSize(i,pagination)
      aListC <- checkSize(i,aListC)
      gListC <- checkSize(i,gListC)
      lang <- checkSize(i,lang)
      model <- checkSize(i,model)
      copyright <- checkSize(i, copyright)
    }
  }
}

df_article <- data.frame(i, model, title, pagination, aListC, lang, 
                         gListC, copyright)
```

Now that all of the data has been read, the dataframes should be manipulated to match the tables. Since the dfToDatabase function will rename the columns, the format has to be exact for the data to be added properly.


```{r}
# Replace the applicable categorical attribute with the FK id
fkReplacement <- function(df_cat, df_list, catCol, listCol, name){
  # Get a list of the categories available
  types <- df_cat[,c(0,2)]
  # For each category, assign the FK id
  for(term in types) {
    search1 <- df_cat[,catCol] == term
    # Get the FK id for the category
    id <- df_cat[which(search1),][[paste0(name,"_id")]]
    search2 <- df_list[,listCol] == term
    # Replace default 0 value with the FK id
    df_list$update_id[search2] <- id
  }
  
  # Drop the original categorical attribute column
  df_list[,listCol] <- NULL
  # Rename the FK id column to match attribute
  names(df_list)[names(df_list) == "update_id"] <- paste0(name,"_id")
  
  return(df_list)
}
```


```{r, messge=FALSE}
# Replace FK attributes
df_article <- fkReplacement(df_mediums,df_article,2,2,"medium")
df_article <- fkReplacement(df_pgn,df_article,2,3,"pgn")
df_article <- fkReplacement(df_lang,df_article,2,4,"lang")

# Organize columns into correct order (needed for adding to database)
df_article <- data.frame(cbind(id, df_article$medium_id,
                               df_article$title, df_article$pgn_id,
                               df_article$aListC, df_article$lang_id,
                               df_article$gListC, df_article$copyright))

# Add the values to the database
df_article <- dfToDatabase(df_article,"Article",FALSE)
```


```{r}
# Replace FK attributes
df_pubTypeList <- fkReplacement(df_pubType,df_pubTypeList,2,2,"pubType")
# Add the values to the database
df_pubTypeList <- dfToDatabase(df_pubTypeList,"PubTypeList",TRUE)
```


```{r}
# Need to add the unique id key so the bridge list can be made
df_eLoc <- addId(df_eLoc)
# Bridge list should only have the article and the unique ids
df_eLocList <- df_eLoc[,c(2,1)]
colnames(df_eLocList) <- c("art","eLoc")

# Replace FK attributes
df_eLoc <- fkReplacement(df_eIdType,df_eLoc,2,4,"eType")

# Organize columns into correct order (needed for adding to database)
df_eLoc <- data.frame(cbind(df_eLoc$id, df_eLoc$eType_id, df_eLoc$valid,
                            df_eLoc$value))

# Add the values to the database
df_eLoc <- dfToDatabase(df_eLoc,"ELocationID",FALSE)
df_eLocList <- dfToDatabase(df_eLocList,"ELocList",TRUE)
```


```{r}
# Need to add the unique id key so the bridge list can be made
df_abstract <- addId(df_abstract)
# Bridge list should only have the article and the unique ids
df_abstractList <- df_abstract[,c(2,1)]
colnames(df_abstractList) <- c("art","abstract")

# Replace FK attributes
df_abstract <- fkReplacement(df_labels,df_abstract,2,3,"label")
colnames(df_abstract) <- c("id","art","lab","text","cat")
df_abstract <- fkReplacement(df_labels,df_abstract,2,3,"label")

# Organize columns into correct order (needed for adding to database)
df_abstract <- data.frame(cbind(df_abstract$id, df_abstract$text,
                                df_abstract$label_id, df_abstract$cat))

# Add the values to the database
df_abstract <- dfToDatabase(df_abstract,"AbstractText",FALSE)
df_abstractList <- dfToDatabase(df_abstractList,"AbstractList",TRUE)
```


```{r}
# Replace FK attributes
df_grant <- fkReplacement(df_acr,df_grant,2,3,"acronym")
df_grant <- fkReplacement(df_agency,df_grant,2,3,"agency")
df_grant <- fkReplacement(df_country,df_grant,2,3,"country")
# Copy to bridge list for multi-column FK check
df_grantList <- df_grant[,c(1,2)]

# Remove unneeded column from authors
df_grant$article_id <- NULL

# GrantIds should be unique
df_grant <- unique(df_grant)

# Need to add the unique id key so the bridge list can be updated
df_grant <- addId(df_grant)
colnames(df_grant) <- c("grantid_id","grantid","ac","ag","ctry")

# Replace FK attributes in bridge list
df_grantList <- fkReplacement(df_grant,df_grantList,2,2,"grantid")
df_grantList <- unique(df_grantList)

# Add the values to the database
df_grant <- dfToDatabase(df_grant,"Grants",FALSE)
df_grantList <- dfToDatabase(df_grantList,"GrantList",TRUE)
```


```{r}
# Replace FK attributes
df_author <- fkReplacement(df_aff,df_author,2,6,"aff")

# Copy to bridge list for multi-column FK check
df_dupe <- df_author

# Remove unneeded column from authors
df_author$article_id <- NULL

# Authors should be unique
df_author <- unique(df_author)

# Add the values to the database
df_author <- dfToDatabase(df_author,"Authors",TRUE)

# Find the FK ids based on the multi-column attributes
df_authorList <- sqldf("select article_id, author_id 
      from df_dupe al 
      join df_author au 
      on (au.lName=al.last and 
          au.fName=al.first and 
          au.initials=al.init and 
          (au.aff_id=al.aff_id or 
          (au.aff_id is null and al.aff_id is null)))")

# Add the values to the database
df_authorList <- dfToDatabase(df_authorList,"AuthorList",TRUE)
```


```{r}
# Grab the needed columns
df_pubDate <- df_journal[,c(7,8)]
# Keep only unique values
df_pubDate <- unique(df_pubDate)
# Add the values to the database
df_pubDate <- dfToDatabase(df_pubDate,"PubDate",TRUE)
```

### Need to adjust for multi-column FK search!!!
```{r}
# Grab the needed columns
df_jIssue <- df_journal[,c(4,5,6,7,8)]
# Replace FK attributes
df_dupe <- fkReplacement(df_mediums,df_jIssue,2,1,"medium")

# Find the FK ids based on the multi-column attributes
df_jIssue <- sqldf("select volume, issue, medium_id, pubDate_id
      from df_dupe jI 
      join df_pubDate pd 
      on (med=medlineDate and 
          date=dateYYMM)")

# Add the values to the database
df_jIssue <- dfToDatabase(df_jIssue,"JournalIssue",TRUE)
```


```{r}
# There is currently a 1:1 relationship between article and journal/issue
# Relational database was built for future expansion so it was M:M
# The bridge table ArticleList will resolve the potential future M:M but currently is 1:1
df_artList <- data.frame("art"=df_journal[,c(0,1)],"issue"=df_jIssue[,c(0,1)])
# Add the values to the database
df_artList <- dfToDatabase(df_artList,"ArticleList",TRUE)
```


```{r}
# Grab the needed columns
df_issn <- df_journal[,c(2,3)]
# Keep only unique values
df_issn <- unique(df_issn)
# Replace FK attributes
df_issn <- fkReplacement(df_mediums,df_issn,2,1,"medium")
# Add the values to the database
df_issn <- dfToDatabase(df_issn,"ISSN",TRUE)
```


```{r}
# Grab the needed columns
df_dupe <- df_journal[,c(2,3,9,10)]
# Replace FK attributes
df_dupe <- fkReplacement(df_iso,df_dupe,2,4,"abbr")
df_dupe <- fkReplacement(df_mediums,df_dupe,2,1,"medium")

# Find the FK ids based on the multi-column attributes
df_journ <- sqldf("select title, issn_id, abbr_id
      from df_dupe jo 
      join df_issn issn 
      on (issn=value and 
          medium_id=issnType)")

# Add the values to the database
df_journ <- dfToDatabase(df_journ,"Journal",TRUE)
```


```{r}
# There is currently a 1:1 relationship between issue and journal
# Relational database was built for future expansion so it was M:M
# The bridge table IssueList will resolve the potential future M:M but currently is 1:1
df_issueList <- data.frame("issue"=df_jIssue[,c(0,1)],"journ"=df_journ[,c(0,1)])
# Add the values to the database
df_issueList <- dfToDatabase(df_issueList,"IssueList",TRUE)
```


```{r}
# The Date table contains data from two tables: ArticleDate and PubMedDate
df_date <- df_artDate[,c(3,4,5)]
df_date <- rbind(df_date, df_pmDate[,c(3,4,5)])

# Keep only unique values
df_date <- unique(df_date)
# Add the values to the database
df_date <- dfToDatabase(df_date,"Date",TRUE)
```


```{r}
# Replace FK attributes
df_dupe <- fkReplacement(df_mediums,df_artDate,2,2,"medium")

# Find the FK ids based on the multi-column attributes
df_artDate <- sqldf("select article_id, medium_id, date_id
      from df_dupe ad 
      join df_date date 
      on (ad.year=date.year and 
          ad.month=date.month and 
          ad.day=date.day)")

# Add the values to the database
df_artDate <- dfToDatabase(df_artDate,"ArticleDate",TRUE)
```


```{r}
# Replace FK attributes
df_dupe <- fkReplacement(df_pubStatus,df_pmDate,2,2,"status")

# Find the FK ids based on the multi-column attributes
df_history <- sqldf("select article_id, status_id, date_id, hour, minute
      from df_dupe pm 
      join df_date date 
      on (pm.year=date.year and 
          pm.month=date.month and 
          pm.day=date.day)")

# Determine columns for just the pub med table
df_pmDate <- df_history[,c(2,3,4,5)]

# Add the values to the database
df_pmDate <- dfToDatabase(df_pmDate,"PubMedDate",TRUE)

# Adjust the history dataframe to reference pubMed_id
df_dupe <- df_history
df_history <- sqldf("select article_id, pubMedDate_id
      from df_dupe hi 
      join df_pmDate pm 
      on (hi.status_id=pm.status and 
          hi.date_id=pm.date  and 
          (hi.hour=pm.hour or 
          (hi.hour is null and pm.hour is null)) and 
          (hi.minute=pm.minute or 
          (hi.minute is null and pm.minute is null)))")

# Add the values to the database
df_history <- dfToDatabase(df_history,"History",TRUE)
```

#### COnfirm the data loaded to the database

```{r}
headSql <- function(tableName) {
  msg <- paste("Top 5 rows for the", tableName, "table")
  print(msg)
  sql <- paste("SELECT * FROM", tableName, "LIMIT 5;")
  print(dbGetQuery(conn, sql))
}
```


```{r}
tableNames <- c(
  "ELocType",
  "CategoryLabels",
  "Affiliations",
  "Pagination",
  "Languages",
  "IsoAbbreviation",
  "MediumType",
  "Countries",
  "Agencies",
  "Acronyms",
  "PublicationType",
  "PubStatus",
  "Date",
  "Article",
  "ArticleDate",
  "PubTypeList",
  "ELocationID",
  "ELocList",
  "PubDate",
  "JournalIssue",
  "ArticleList",
  "ISSN",
  "Journal",
  "IssueList",
  "AbstractText",
  "AbstractList",
  "Authors",
  "AuthorList",
  "Grants",
  "GrantList",
  "PubMedDate",
  "History"
)

if(printHead){
  for (name in tableNames) {
    headSql(name)
  }
}
```


## Part 2
Add to the normalized schema fact tables and turn the normalized schema into a de-normalized schema suitable for OLAP.

### Task 1
Create and populate a star schema with dimension and transaction fact tables. Each row in the fact table will represent one article. Include the image of an updated ERD that contains the fact table and any additional required dimension tables. Populate the star schema in R.

Lucidchart diagram: 

![Task2.1](imgs/StarSchemaERD.png)

To start, we'll put the OLAP database in a new sqlite file. Below we create the the database and the tables in our schema.

```{r}
DB_NAME <- "pubMedWarehouse.db"

starConn <- dbConnect(RSQLite::SQLite(), DB_NAME)
```

```{r, results="hide"}
# Since the dataset is small, the database should be re-created at runtime
drop_table <- function(table_name) {
    paste("DROP TABLE IF EXISTS ", table_name, ";", sep="")
}

# Since we are dropping all tables, disable the FK checks
dbExecute(starConn, "PRAGMA foreign_keys = OFF;")
# Get a list of all tables currently in the database
table_list <- dbListTables(starConn)
# Drop every table in the database
for(table in table_list){
  if(!str_contains(table,"sqlite")){
    dbExecute(starConn, drop_table(table))
  }
}
dbExecute(starConn, "PRAGMA foreign_keys = ON;")
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS ArticleDim (
  article_id INTEGER PRIMARY KEY,
  title TEXT NOT NULL,
  medlinePagination TEXT NOT NULL,
  language TEXT NOT NULL,
  medium TEXT NOT NULL
);
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS AuthorDim (
  author_id INTEGER PRIMARY KEY,
  valid BOOLEAN NOT NULL,
  lName TEXT NOT NULL,
  fName TEXT NOT NULL,
  initials TEXT NOT NULL
);
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS JournalIssueDim (
  issue_id INTEGER PRIMARY KEY,
  title TEXT NOT NULL,
  issn TEXT NOT NULL,
  issnMedium TEXT,
  isoAbbreviation TEXT NOT NULL,
  volume INTEGER NOT NULL,
  issue INTEGER NOT NULL,
  citedMedium TEXT NOT NULL,
  medlineDate BOOLEAN NOT NULL,
  pubDate TEXT NOT NULL
);
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS DateTimeDim (
  dt_id INTEGER PRIMARY KEY,
  year INTEGER NOT NULL,
  quarter INTEGER NOT NULL,
  month INTEGER NOT NULL,
  day INTEGER NOT NULL,
  hour INTEGER,
  minute INTEGER
);
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS AuthorPublicationHistoryFact (
  aph_id INTEGER PRIMARY KEY AUTOINCREMENT,
  author_id INTEGER NOT NULL,
  article_id INTEGER NOT NULL,
  issue_id INTEGER NOT NULL,
  dt_id INTEGER NOT NULL,
  pubStatus TEXT NOT NULL,
  FOREIGN KEY (author_id) REFERENCES AuthorDim(author_id),
  FOREIGN KEY (article_id) REFERENCES ArticleDim(article_id),
  FOREIGN KEY (issue_id) REFERENCES JournalIssueDim(issue_id),
  FOREIGN KEY (dt_id) REFERENCES DateTimeDim(dt_id)
);
```

The next step is to get this data from the transactional database that we already have in place and put it into the new OLAP database.

_Author_

```{sql connection=conn, output.var='author_dim'}
SELECT
  author_id, valid, lName, fName, initials
FROM
  Authors;
```

_Article_

```{sql connection=conn, output.var='article_dim'}
SELECT DISTINCT
  A.article_id,
  A.title,
  P.medlinePgn AS medlinePagination,
  L.language,
  M.medium
FROM
  Article AS A,
  Pagination AS P,
  Languages AS L,
  MediumType AS M
WHERE
  A.pgn_id = P.pgn_id AND
  A.lang_id = L.lang_id AND
  A.pubModel = M.medium_id;
```

_JournalIssue_

```{sql connection=conn, output.var='journalissue_dim'}
SELECT
  JI.issue_id,
  J.title,
  I.value AS issn,
  IA.abbr AS isoAbbreviation,
  JI.volume,
  JI.issue,
  M.medium AS citedMedium,
  PD.medlineDate,
  PD.dateYYMM AS pubDate
FROM
  JournalIssue AS JI,
  Journal AS J,
  ISSN AS I,
  IsoAbbreviation AS IA,
  IssueList AS IL,
  MediumType AS M,
  PubDate AS PD
WHERE 
  IL.issue_id = JI.issue_id AND
  IL.journal_id = J.journal_id AND
  J.issn_id = I.issn_id AND
  J.isoAbbr = IA.abbr_id AND
  JI.citedMedium = M.medium_id AND
  JI.pubDate_id = PD.pubDate_id;
```

_DateTime_

```{sql connection=conn, output.var='datetime_dim'}
SELECT
  PMD.pubMedDate_id AS dt_id,
  D.year,
  CASE
    WHEN D.month < 4 THEN 1
    WHEN D.month < 7 THEN 2
    WHEN D.month < 10 THEN 3
    ELSE 4
  END AS quarter,
  D.month,
  D.day,
  PMD.hour,
  PMD.minute
FROM
  PubMedDate AS PMD,
  Date AS D
WHERE
  PMD.date = D.date_id;
```

_Fact_

```{sql connection=conn, output.var='fact'}
SELECT DISTINCT
  AL.author_id,
  AL.article_id,
  ArtL.issue_id,
  PMD.pubMedDate_id AS dt_id,
  PS.status AS pubStatus
FROM
  AuthorList AS AL,
  History AS H,
  PubMedDate AS PMD,
  PubStatus AS PS,
  ArticleList AS ArtL,
  Article AS A
WHERE
  AL.article_id = A.article_id AND
  A.article_id = H.article_id AND
  H.pubMedDate_id = PMD.pubMedDate_id AND
  PMD.status = PS.status_id AND
  A.article_id = ArtL.article_id;
```

Let's take a look at the data we just gathered.

```{r}
head(author_dim)
```

```{r}
head(article_dim)
```

```{r}
head(journalissue_dim)
```

```{r}
head(datetime_dim)
```

```{r}
head(fact)
```

Now they can be inserted into the OLAP database.

```{r}
# Author dimension table
dbWriteTable(starConn, "AuthorDim", author_dim, append=TRUE)

# Article dimension table
dbWriteTable(starConn, "ArticleDim", article_dim, append=TRUE)

# Journal dimension table
dbWriteTable(starConn, "JournalIssueDim", journalissue_dim, append=TRUE)

# DateTime dimension table
dbWriteTable(starConn, "DateTimeDim", datetime_dim, append=TRUE)

# Publication fact table
dbWriteTable(starConn, "AuthorPublicationHistoryFact", fact, append=TRUE)
```

Let's make sure they made it into the database.

```{sql connection=starConn}
SELECT * FROM AuthorDim LIMIT 5;
```

```{sql connection=starConn}
SELECT * FROM ArticleDim LIMIT 5;
```

```{sql connection=starConn}
SELECT * FROM JournalIssueDim LIMIT 5;
```

```{sql connection=starConn}
SELECT * FROM AuthorPublicationHistoryFact LIMIT 5;
```

### Task 2
In the same schema as the previous step, create and populate a summary fact table that represents number of articles per time period (quarter, year) by author and by journal. Include the image of an updated ERD that contains the fact table. Populate the fact table in R.

Lucidchart link: 

![Task2.2](imgs/StarSchemaERDWithSummary.png)

First we need to make the new tables. We are creating two separate summary tables, one for authors and one for journals, as these are the two entities we are most interested in getting article update statistics from.

```{sql connection=starConn}
-- These will have been dropped on database creation above when run more than once
CREATE TABLE IF NOT EXISTS AuthorFactSummary (
  sum_id INTEGER PRIMARY KEY,
  author_id INTEGER NOT NULL,
  year INTEGER,
  quarter INTEGER,
  month INTEGER,
  changeCount INTEGER,
  FOREIGN KEY (author_id) REFERENCES AuthorDim(author_id)
);
```

```{sql connection=starConn}
CREATE TABLE IF NOT EXISTS JournalFactSummary (
  sum_id INTEGER PRIMARY KEY,
  journalTitle TEXT NOT NULL,
  year INTEGER,
  quarter INTEGER,
  month INTEGER,
  changeCount INTEGER
);
```

Then we can gather the initial stats and update them. In the future, if data is added, a trigger could be used to update these as information is entered. Alternatively, bulk updates could run as bulk data is added during off-hours.

_Author Summary_

```{sql connection=starConn, output.var='auth_sum'}
SELECT
  F.author_id,
  DDim.year,
  DDim.quarter,
  DDim.month,
  COUNT(F.article_id) AS changeCount
FROM
  AuthorPublicationHistoryFact AS F,
  DateTimeDim AS DDim
WHERE
  F.dt_id = DDim.dt_id
GROUP BY
  F.author_id, DDim.year, DDim.quarter, DDim.month;
```

```{r, warning=FALSE}
dbWriteTable(starConn, "AuthorFactSummary", auth_sum, append=TRUE)
```

```{sql connection=starConn}
SELECT * FROM AuthorFactSummary LIMIT 8;
```

_Journal Summary_

```{sql connection=starConn, output.var='journal_sum'}
SELECT
  JDim.title AS journalTitle,
  DDim.year,
  DDim.quarter,
  DDim.month,
  COUNT(F.article_id) AS changeCount
FROM
  AuthorPublicationHistoryFact AS F,
  DateTimeDim AS DDim,
  JournalIssueDim AS JDim
WHERE
  F.dt_id = DDim.dt_id AND
  F.issue_id = JDim.issue_id
GROUP BY
  JDim.title, DDim.year, DDim.quarter, DDim.month;
```

```{r}
dbWriteTable(starConn, "JournalFactSummary", journal_sum, append=TRUE)
```

```{sql connection=starConn}
SELECT * FROM JournalFactSummary LIMIT 8;
```

Before moving on, lets do a quick sanity check to make sure the summary tables are correct. The sum of all change counts in each summary table should equal the total number of entries in the main fact table, as each record in that table is a change.

```{sql connection=starConn}
SELECT COUNT(*) as "Total Change Count" FROM AuthorPublicationHistoryFact;
```

```{sql connection=starConn}
SELECT SUM(changeCount) FROM AuthorFactSummary;
```

```{sql connection=starConn}
SELECT SUM(changeCount) FROM JournalFactSummary;
```

## Part 3
Use the OLAP star schema to do some (simple) data mining.

### Task 1
Write queries using your data warehouse to explore whether the publications show a seasonal pattern. Look beyond the pattern of number of publications per season. Adjust your fact tables as needed to support your new queries. If you need to update the fact table, document your changes and your reasons why the changes are needed.

#### Changes

The star schema and summary tables above were not our first attempt at completing this task. Our first attempt ended poorly, as the design was not good. A copy of the original design for the star schema is below. One of the main problems is that the ArticleDim table was forced to have duplicate article_ids because of the several many-to-one relationships article has with entities such as author, grants and eLocation. This caused unexpected consequences when trying to join by that field while creating summary statistics. After more research on star schema design we decided to take out all of these multi-valued attributes and move Author into it's own entity, among other small changes. This allowed us to ensure each id in each entity was unique.

![Original Schema](imgs/Practicum III Star Schema Full.png)

Another conceptual change that allowed these changes was rethinking our fact table. Originally we had considered an article's publication as the main process we were tracking. However we came to realize that tracking an individual author's publications allowed us to remove the problem of multiple authors on an article. We then further drilled down to provide a fact table of the changes an author's article goes through (its history). This provided us with a little more data to work with for our analyses.

#### Queries

##### Query 1

How many article actions/activities are recorded per quarter on average by each author?

```{sql, connection=starConn, output.var='quarter_auth_avg'}
SELECT
  AD.fName || " " || AD.lName AS author,
  AFS.quarter,
  AVG(AFS.changeCount) AS average_changes
FROM
  AuthorDim AS AD,
  AuthorFactSummary AS AFS
WHERE
  AD.author_id = AFS.author_id
GROUP BY 
  AFS.author_id, AFS.quarter;
```

```{r}
quarter_auth_avg$quarter <- as.factor(quarter_auth_avg$quarter)
head(quarter_auth_avg)
```

##### Query 2

How many article actions/activities are recorded per quarter on average by each journal?

```{sql, connection=starConn, output.var='quarter_journ_avg'}
SELECT
  JFS.journalTitle,
  JFS.quarter,
  AVG(JFS.changeCount) AS average_changes
FROM
  JournalFactSummary AS JFS
GROUP BY 
  JFS.journalTitle, JFS.quarter;
```

```{r}
quarter_journ_avg$quarter <- as.factor(quarter_journ_avg$quarter)
head(quarter_journ_avg)
```

##### Query 3

How many article actions/activities are recorded per year on average by each author?

```{sql, connection=starConn, output.var='year_auth_avg'}
SELECT
  AD.fName || " " || AD.lName AS author,
  AFS.year,
  AVG(AFS.changeCount) AS average_changes
FROM
  AuthorDim AS AD,
  AuthorFactSummary AS AFS
WHERE
  AD.author_id = AFS.author_id
GROUP BY 
  AFS.author_id, AFS.year;
```

```{r}
year_auth_avg$year <- as.factor(year_auth_avg$year)
head(year_auth_avg)
```

##### Query 4

How many article actions/activities are recorded per year on average by each journal?

```{sql, connection=starConn, output.var='year_journ_avg'}
SELECT
  JFS.journalTitle,
  JFS.year,
  AVG(JFS.changeCount) AS average_changes
FROM
  JournalFactSummary AS JFS
GROUP BY 
  JFS.journalTitle, JFS.year;
```

```{r}
year_journ_avg$year <- as.factor(year_journ_avg$year)
head(year_journ_avg)
```

##### Query 5

How many article actions/activities are recorded per month on average by each author?

```{sql, connection=starConn, output.var='month_auth_avg'}
SELECT
  AD.fName || " " || AD.lName AS author,
  AFS.month,
  AVG(AFS.changeCount) AS average_changes
FROM
  AuthorDim AS AD,
  AuthorFactSummary AS AFS
WHERE
  AD.author_id = AFS.author_id
GROUP BY 
  AFS.author_id, AFS.month;
```

```{r}
month_auth_avg$month <- as.factor(month_auth_avg$month)
head(month_auth_avg)
```

##### Query 6

How many article actions/activities are recorded per month on average by each journal?

```{sql, connection=starConn, output.var='month_journ_avg'}
SELECT
  JFS.journalTitle,
  JFS.month,
  AVG(JFS.changeCount) AS average_changes
FROM
  JournalFactSummary AS JFS
GROUP BY 
  JFS.journalTitle, JFS.month;
```

```{r}
month_journ_avg$month <- as.factor(month_journ_avg$month)
head(month_journ_avg)
```

### Task 2
Either (a) visualize (graph/plot) the data from the previous step using R to explore seasonality and explain what you found, or (b) build a predictive model to forecast the expected number of publications for a quarter. (Note that we do not cover predictive modeling in this course, so if you do not know this from a prior course, then simply create the visualization.)

We've decided to choose the first option for our submission: visualize the data. We'll plot all the data above moving from year to quarter to month. We'll start with the author data, then show the journal data.

#### By Authors

```{r, fig.height=10}
ggplot(year_auth_avg, aes(x=year, y=average_changes, fill=author)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

In this plot we can see that in our data set, there were more authors publishing in earlier years than later ones, and that these authors were generally publishing more in earlier years. The year with the thinnest lines - meaning more authors publishing in that year - was 2012. It also appears that most authors only wrote one article in any given year, however, a good number wrote 2 or 3 in 2011 and 2012.

```{r, fig.height=10}
ggplot(quarter_auth_avg, aes(x=quarter, y=average_changes, fill=author)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

The distribution of authors is more spread out here, as might be expected of an aggregate over many years. The most striking thing to notice here is that relatively few papers go through any changes in the 2nd quarter of a given year. The other quarters are more even in terms of changes, but it appears as though the most changes occur in the 4th quarter of a year.

```{r, fig.height=10}
ggplot(month_auth_avg, aes(x=month, y=average_changes, fill=author)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

This follows the patter from before. The end of the year has relatively more changes on an article. Interestingly, March appears to be the month in which the most changes occur for authors on average.

#### By Journals

```{r, fig.height=10}
ggplot(year_journ_avg, aes(x=year, y=average_changes, fill=journalTitle)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

As with the authors data, there are more journals with articles going though changes in 2012 than any other year. 2011 and 2012 also appear to have the most changes in general. Only a single journal was involved in article changes in 2014.

```{r, fig.height=10}
ggplot(quarter_journ_avg, aes(x=quarter, y=average_changes, fill=journalTitle)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

There does not appear to be any distinctive pattern here. Again, the second quarter appears to have the least changes, but that is mostly due to large spikes in other quarters from certain journals.

```{r, fig.height=10}
ggplot(month_journ_avg, aes(x=month, y=average_changes, fill=journalTitle)) +
  geom_bar(stat="identity", position = 'dodge') +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust=1,
                                   margin = margin(1, 500, 1, 500)),
        legend.position = 'bottom')
```

This gives a little more interesting information than the previous graph. In terms of average changes, there does seem to be a slight sine wave pattern, with high points in the early spring and fall. However, the number of journals involved in changes seems to get smaller in March, November and December.