Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bufio.Scanner: token too long #299

Closed
chapmanjacobd opened this issue Oct 15, 2022 · 4 comments
Closed

bufio.Scanner: token too long #299

chapmanjacobd opened this issue Oct 15, 2022 · 4 comments

Comments

@chapmanjacobd
Copy link

I'm running out of vespene gas or somteh

$ wget https://files.pushshift.io/reddit/submissions/RS_2022-08.zst
$ unzstd --memory=2048MB --stdout RS_2022-08.zst | octosql "SELECT count(*) FROM stdin.json" -o csv
...
Error: couldn't run query: couldn't run source: couldn't run source: bufio.Scanner: token too long

sad :'(

the great octopus god is able to work with this other, smaller, file in 110.6s:

$ unzstd --memory=2048MB --stdout RS_2021-08.zst | octosql "SELECT count(*) FROM stdin.json" -o csv
count
28384220

It does not use much RAM with either file so not sure what's up :? Both are similar-ish file-ish size-ish 7.8G vs 10GB compressed. maybe 200GB uncompressed

@cube2222
Copy link
Owner

Hey, it's actually the line size that's the problem (it's limited to 1MB right now), but I'm happy to add a config option for this.

@cube2222
Copy link
Owner

This has now been added in 6644557 and released in 0.11.1.

You are now able to configure the maximum line size in your ~/.octosql/octosql.yml file:

databases:
  # ...
files:
  json:
    max_line_size_bytes: 33554432

Thanks for the report!

@DeluxeOwl
Copy link

DeluxeOwl commented Sep 3, 2024

edit: added PR (2 loc) here #336

Hi @cube2222, this doesn't actually work (I don't think the context is passed properly).

I've added some printing in cmd/root.go:

fmt.Printf("%+v\n", cfg)
ctx = config.ContextWithConfig(ctx, cfg)
fmt.Printf("%+v\n", ctx)

And in datasources/json/execution.go:

func (d *DatasourceExecuting) Run(ctx ExecutionContext, produce ProduceFn, metaSend MetaSendFn) error {
	fmt.Printf("from json.Run, ctx: %+v\n", ctx)

	f, err := files.OpenLocalFile(ctx, d.path, files.WithTail(d.tail))
	if err != nil {
		return fmt.Errorf("couldn't open local file: %w", err)
	}
	defer f.Close()

	sc := bufio.NewScanner(f)

	sc.Buffer(nil, config.FromContext(ctx).Files.JSON.MaxLineSizeBytes)
	fmt.Printf("from json.Run, config from context: %+v\n", config.FromContext(ctx))

And it doesnt seem like it's doing anything:

$ ./octosql/main "select * from nat_rules.json"  --describe
from root.go, config: &{Databases:[] Files:{JSON:{MaxLineSizeBytes:33554432} BufferSizeBytes:4194304}}
from root.go, context: context.Background.WithCancel.WithValue(config.contextKey, *config.Config)
Usage:
  octosql <query> [flags]
  octosql [command]

Examples:
octosql "SELECT * FROM myfile.json"
octosql "SELECT * FROM mydir/myfile.csv"
octosql "SELECT * FROM plugins.plugins"

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  plugin      

Flags:
      --describe         Describe query output schema.
      --explain int      Describe query output schema.
  -h, --help             help for octosql
      --optimize         Whether OctoSQL should optimize the query. (default true)
  -o, --output string    Output format to use. Available options are live_table, batch_table, csv, json and stream_native. (default "live_table")
      --profile string   Enable profiling of the given type: cpu, memory, trace.
  -v, --version          version for octosql

Use "octosql [command] --help" for more information about a command.

Error: typecheck error: couldn't create datasource: couldn't scan lines: bufio.Scanner: token too long

I'll take a more in depth look later and open a PR

edit: added PR (2 loc) here #336

@chapmanjacobd
Copy link
Author

For what it's worth I think this was working at one point--or maybe I just filtered out the long line, I don't really remember. But here is my octosql config:

$ cat ~/.octosql/octosql.yml
files:
  buffer_size_bytes: 33554432
  json:
    max_line_size_bytes: 33554432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants