Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to speed up "File Tail" ? #48

Open
dongbin86 opened this issue Mar 24, 2017 · 5 comments
Open

how to speed up "File Tail" ? #48

dongbin86 opened this issue Mar 24, 2017 · 5 comments

Comments

@dongbin86
Copy link

I have a log named access.log size 3.8G

I create a simple pipeline filetail-trash

not rate limited . but I found Record Throughput only less then 20 records/s

pipeline with default config

how to optimize ?

@metadaddy
Copy link
Contributor

What version of SDC? Can you export the pipeline and post it here? You should get thousands of records/sec!

@metadaddy
Copy link
Contributor

7 seconds to ingest 9890 records on my laptop:

image

@dongbin86
Copy link
Author

2.4.0.0
f056e6c0-40bd-4cdc-bb4a-8df2a53576c2.txt
not support .json file ,so i rename to .txt ,you can download and rerename it
yes ,yesterday ,I use a script write lines to a file with rate 10000 lines/sec, streamsets file tail can reach that rate , so I wonder the reason is the file size too big , every batch rewrite the offset, and next batch need to re seek from top to that offset ?
I need your help ,@metadaddy

@dongbin86
Copy link
Author

also I want know when FileTail trigger to collect log file ?
if i have a file ,but no new line appended , FileTail will not be triggered ?

@metadaddy
Copy link
Contributor

I looked at your pipeline - I don't see anything that would slow it down.

The file tail reader will only seek at the beginning of each batch, so it shouldn't impact performance that much. You could test this by changing the batch size. Note - you will need to edit sdc.properties to increase batch size beyond 1000 - see https://streamsets.com/documentation/datacollector/latest/help/#Troubleshooting/Troubleshooting_title.html#concept_ay2_w1l_2s

File tail will read all of the existing data, then wait for new data, so it should work for you. A better choice, if the file will not be changing, might be the directory origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants