Skip to content

Commit

Permalink
feat(interactive): Support parsing csv files with special delimiters (#…
Browse files Browse the repository at this point in the history
…4336)

Support special delimiters like `\t`.
  • Loading branch information
zhanglei1949 authored Nov 25, 2024
1 parent 94a02d7 commit 4d7bd16
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 4 deletions.
12 changes: 12 additions & 0 deletions .github/workflows/interactive.yml
Original file line number Diff line number Diff line change
Expand Up @@ -605,3 +605,15 @@ jobs:
SCHEMA_FILE=${GITHUB_WORKSPACE}/flex/tests/rt_mutable_graph/movie_schema_test.yaml
BULK_LOAD_FILE=${GITHUB_WORKSPACE}/flex/tests/rt_mutable_graph/movie_import_test.yaml
GLOG_v=10 ./bin/bulk_loader -g ${SCHEMA_FILE} -l ${BULK_LOAD_FILE} -d /tmp/csr-data-dir/
- name: Test graph loading with different delimiter
env:
GS_TEST_DIR: ${{ github.workspace }}/gstest/
FLEX_DATA_DIR: ${{ github.workspace }}/gstest/flex/modern_graph_tab_delimiter/
run: |
rm -rf /tmp/csr-data-dir/
cd ${GITHUB_WORKSPACE}/flex/build/
SCHEMA_FILE=${GITHUB_WORKSPACE}/flex/interactive/examples/modern_graph/graph.yaml
BULK_LOAD_FILE=${GITHUB_WORKSPACE}/flex/interactive/examples/modern_graph/bulk_load.yaml
sed -i 's/|/\\t/g' ${BULK_LOAD_FILE}
GLOG_v=10 ./bin/bulk_loader -g ${SCHEMA_FILE} -l ${BULK_LOAD_FILE} -d /tmp/csr-data-dir/
2 changes: 1 addition & 1 deletion docs/flex/interactive/data_import.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ The table below offers a detailed breakdown of each configuration item. In this
| loading_config.scheme | file | The source of input data. Currently only `file` and `odps` are supported | No |
| loading_config.format | N/A | The format of the raw data in CSV | Yes |
| loading_config.format.metadata | N/A | Mainly for configuring the options for reading CSV | Yes |
| loading_config.format.metadata.delimiter | '\|' | Delimiter used to split a row of data | Yes |
| loading_config.format.metadata.delimiter | '|' | Delimiter used to split a row of data, escaped char are also supported, i.e. '\t' | Yes |
| loading_config.format.metadata.header_row | true | Indicate if the first row should be used as the header | No |
| loading_config.format.metadata.quoting | false | Whether quoting is used | No |
| loading_config.format.metadata.quote_char | '\"' | Quoting character (if `quoting` is true) | No |
Expand Down
21 changes: 18 additions & 3 deletions flex/storages/rt_mutable_graph/loader/csv_fragment_loader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -158,10 +158,25 @@ static std::vector<std::string> read_header(
static void put_delimiter_option(const LoadingConfig& loading_config,
arrow::csv::ParseOptions& parse_options) {
auto delimiter_str = loading_config.GetDelimiter();
if (delimiter_str.size() != 1) {
LOG(FATAL) << "Delimiter should be a single character";
if (delimiter_str.size() != 1 && delimiter_str[0] != '\\') {
LOG(FATAL) << "Delimiter should be a single character, or a escape "
"character, like '\\t'";
}
if (delimiter_str[0] == '\\') {
if (delimiter_str.size() != 2) {
LOG(FATAL) << "Delimiter should be a single character";
}
// escape the special character
switch (delimiter_str[1]) {
case 't':
parse_options.delimiter = '\t';
break;
default:
LOG(FATAL) << "Unsupported escape character: " << delimiter_str[1];
}
} else {
parse_options.delimiter = delimiter_str[0];
}
parse_options.delimiter = delimiter_str[0];
}

static bool put_skip_rows_option(const LoadingConfig& loading_config,
Expand Down

0 comments on commit 4d7bd16

Please sign in to comment.