Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INF-427: Download country/institution data in CSV format #56

Merged
merged 1 commit into from
Apr 3, 2023

Conversation

alexmassen-hane
Copy link
Collaborator

Use fetch to grab the json from Cloudflare KV storage, parse it to CSV and allow it to be downloaded when user clicks the button.

Need to figure out a few kinks with how Cloudflare worker-api gives it's data.

@alexmassen-hane alexmassen-hane changed the title Allow users to download data from the website for each country and institution INF-427: Download country/institution data in CSV format Nov 29, 2022
@alexmassen-hane alexmassen-hane force-pushed the INF-427-download-csv branch 2 times, most recently from eb74fa4 to c4dca76 Compare January 4, 2023 02:15
@alexmassen-hane
Copy link
Collaborator Author

Modified the API worker to pull the entity data from KV storage, parse the timeseries and repository data into a CSV like arrays and then zip them up for download when the link of /api/download/[entity.type]/[entity.id] is hit.

The clickable download link for the zip on the COKI OA website for the specific country or institution appears as

image

@alexmassen-hane
Copy link
Collaborator Author

alexmassen-hane commented Jan 4, 2023

Instead including the schemas in a separate file in the zip, I have just added a link to the data page for the user to investigate. Also shortened the names of the CSVs to "{entity.id}_repositories.csv" and "{entity.id}_yearly_aggregated_stats.csv" .

image

@alexmassen-hane
Copy link
Collaborator Author

Having to force flexsearch to stay at v0.7.2 as there are issues with type definitions in v0.7.3. When the issue has been fixed we can considering upgrading.

nextapps-de/flexsearch#342
nextapps-de/flexsearch#364

Also added necessary config so it will ignore library files when doing the yarn types:check step in the tests, as there are conflicts with the current node and cloudflare type definitions.

@jdddog jdddog marked this pull request as ready for review February 13, 2023 23:17
@jdddog jdddog self-requested a review February 13, 2023 23:18
Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Alex, thanks for the PR, it is looking good, the main feedback is about:

  • Using the MetadataLink object for the download UI functionality.
  • Using routing features from itty-router so that you don't have to manually parse route parameters.
  • And using a JSON to CSV converter library as it will reduce the amount of code that we have to write for conversion and handle things such as quoting when commas are in strings etc.

@@ -105,3 +122,17 @@ export interface EntityRequest extends Request {
};
query: {};
}

export interface dataRequest extends Request {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As dataRequest is an interface for an object rather than a function it should use CamelCase.

Also take a look at how the params for EntityRequest is defined so that you don't have to manually parse the entityType and id fields in downloadDataZipHandler.

workers-api/src/types.ts Outdated Show resolved Hide resolved
workers-api/src/router.test.ts Outdated Show resolved Hide resolved
workers-api/src/router.ts Outdated Show resolved Hide resolved
workers-api/src/downloadDataZip.ts Outdated Show resolved Hide resolved
e2e/downloadZIP.spec.ts Outdated Show resolved Hide resolved
Comment on lines 81 to 139
export const parseEntityTimeseriesToCSV = async (entityData: Array<Year>) => {
const dataToCSV = [];

// Obtain headers from the Entity data.
// Scan through all years for columns that exist.
var numColumns = 0;
var columnHeaders: Array<string> = ["year"];
for (const dataLine of entityData) {
if (numColumns < Object.keys(dataLine.stats).length) {
numColumns = Object.keys(dataLine.stats).length;
for (const column in dataLine.stats) {
columnHeaders.push(`${column}`);
}
}
}
dataToCSV.push(columnHeaders);

// Transform data from the enitity object to csv like array
var dataLineTimeSeries: Array<string> = [""];
for (const entityTimeSeriesData of entityData) {
const statsArray = Object.keys(entityTimeSeriesData.stats).map(key => `${entityTimeSeriesData.stats[key]}`);
dataLineTimeSeries = [`${entityTimeSeriesData.year}`].concat(statsArray);
dataToCSV.push(dataLineTimeSeries);
}

// Add commas, line breaks and join into large string
let csvContent: string = dataToCSV.map(e => e.join(",")).join("\n");

return csvContent;
};

export const parseEntityRepositoriesToCSV = async (entityData: Array<Repository>) => {
const dataToCSV = [];

// Obtain headers from the Entity data.
var numColumns = 0;
var columnHeaders: Array<string> = [];
for (const dataLine of entityData) {
if (numColumns < Object.keys(dataLine).length) {
numColumns = Object.keys(dataLine).length;
for (const column in dataLine) {
columnHeaders.push(`${column}`);
}
}
}
dataToCSV.push(columnHeaders);

// Transform data from the enitity object to csv like array
// Some of the repository names have commas in them and need to be removed
for (const entityRepositories of entityData) {
const repository = Object.keys(entityRepositories).map(key => `${entityRepositories[key]}`.replaceAll(",", ""));
dataToCSV.push(repository);
}

// Add the commas and line breaks
let csvContent: string = dataToCSV.map(e => e.join(",")).join("\n");

return csvContent;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there would be less code to maintain and it would be more robust to use a JSON to CSV library, e.g. https://juanjodiaz.github.io/json2csv/#/parsers/parser. You could use the Parser object.

It would be more robust because they already handle string quoting when there are commas in the repository names etc.

workers-api/src/downloadDataZip.ts Outdated Show resolved Hide resolved
if (data != undefined) {
var a = document.createElement("a");
a.href = window.URL.createObjectURL(data);
a.download = `COKI_data_${entity.entity_type}_${entity.id}.zip`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call the file Country name + COKI Dataset.zip, e.g "Mali COKI Dataset.zip"?

components/details/DownloadDataLink.tsx Outdated Show resolved Hide resolved
@jdddog jdddog self-requested a review March 6, 2023 21:23
Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Alex, the changes look great. It looks like Typescript support is coming for the json2csv package, so we can update it when that is released.

I've left a few smaller comments. I also need to do a few things, including making the icon, checking about the authors and license of the jszip library.

workers-api/jest.config.js Outdated Show resolved Hide resolved
@@ -35,8 +37,11 @@
"typescript": "^4.4.4"
},
"dependencies": {
"flexsearch": "^0.7.21",
"@json2csv/node": "^6.1.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be @json2csv/plainjs rather than @json2csv/node.

"itty-router": "^2.6.6",
"jszip": "^3.10.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache might be compatible with GPL, but GPL is not compatible with Apache, so I need to check with Cameron to make sure the dual licensing is fine.

workers-api/src/json2csv.d.ts Outdated Show resolved Hide resolved
workers-api/src/downloadZip.ts Show resolved Hide resolved
workers-api/src/downloadZip.ts Outdated Show resolved Hide resolved
workers-api/src/downloadZip.ts Outdated Show resolved Hide resolved
workers-api/src/downloadZip.test.ts Outdated Show resolved Hide resolved
@jdddog jdddog force-pushed the INF-427-download-csv branch 3 times, most recently from 81a6984 to a6d46a0 Compare April 3, 2023 00:42
Co-authored-by: Alex Massen-Hane <[email protected]>
Co-authored-by: Jamie Diprose <[email protected]>
@jdddog jdddog merged commit 9d3ce36 into develop Apr 3, 2023
@jdddog jdddog deleted the INF-427-download-csv branch September 12, 2023 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants