Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanning very slow #170

Open
ptulpen opened this issue Jan 19, 2022 · 18 comments
Open

scanning very slow #170

ptulpen opened this issue Jan 19, 2022 · 18 comments

Comments

@ptulpen
Copy link

ptulpen commented Jan 19, 2022

Hello,
I have the issue that the scanning is very slow, now running over a week for around 3000 files
Is there a way to restart and/or troubleshoot the scanning progress?
(restarting the container and even the server does not help)
Version is 3.4.2

@danthony06
Copy link
Contributor

There should be logs in /usr/local/tomcat/logs, but I'm not sure if they will have information useful for this problem. Was it previously running fast, or is this a new installation?

@ptulpen
Copy link
Author

ptulpen commented Jan 21, 2022

Hello,
before it was not so slow.
The only interesting part in the logs are snippets like:

19-Jan-2022 20:29:41.728 INFO [MessageBroker-4] org.springframework.web.socket.config.WebSocketMessageBrokerStats.lambda$initLoggingTask$0 WebSocketSession[0 current WS(0)-HttpStream(0)-HttpPoll(0), 5 total, 0 closed abnormally (0 connect failure, 0 send limit, 0 transport error)], stompSubProtocol[processed CONNECT(0)-CONNECTED(0)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 5], sockJsScheduler[pool size = 8, active threads = 1, queued tasks = 4, completed tasks = 275440]

@ptulpen
Copy link
Author

ptulpen commented Aug 24, 2022

One thought that came to my mind:
Could setting indexes or similar options in the PostgreSQL help to improve the performance?

@pjreed
Copy link
Contributor

pjreed commented Aug 24, 2022

The database should already have quite a few indexes on relevant columns; you could check using a tool like DBeaver to connect to the database, although if your tables are missing indexes, I would expect that to drastically slow down searching, not scanning new files. In fact, lacking indexes would actually make inserting new records faster since it can simply append them to the table without updating the indexes.

Scanning should mostly be limited by disk read speed, since it has to read in the entire file, and to a lesser degree by CPU speed, since it has to generate a hash to identify the bag file. This could be an issue if you're reading very large bags over a slow network connection, or potentially if you're reading large bag files from slow HDDs, especially if the bag files themselves are unindexed and there's a lot of disk thrashing going on.

@danthony06
Copy link
Contributor

danthony06 commented Aug 24, 2022 via email

@ptulpen
Copy link
Author

ptulpen commented Aug 29, 2022

Hello,
@pjreed: indexes look fine and what you say about indexes sounds reasonable, so probably nothing to do with that
@danthony06: yes, they are mostly cut at 4.1 GB
Is there a limit ? or something to optimize?

@danthony06
Copy link
Contributor

@ptulpen There's not limit to my knowledge. I mostly wanted to make sure you weren't uploading 100GB bag files that might be causing network issues.

@pjreed
Copy link
Contributor

pjreed commented Aug 29, 2022

How much free RAM does your server have? Is it possible that it's hitting swap space while trying to read the bags?

@ptulpen
Copy link
Author

ptulpen commented Aug 30, 2022

I have 32 GB ram(and 8 cpu) and it is not fully used
I also increased the java memory limit with -e CATALINA_OPTS=" -Xmx10g"

@ptulpen
Copy link
Author

ptulpen commented Sep 6, 2022

I also see in the scanning process some interesting errors like these:

2022-09-06 14:53:24.289 [pool-2-thread-1] ERROR c.g.s.b.s.f.FilesystemBagStorageImpl - Unexpected error updating bag file:
java.lang.IllegalArgumentException: Chunk [**description**] is not a valid entry
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:219)
        at com.google.common.base.Splitter$MapSplitter.split(Splitter.java:526)
        at com.github.swrirobotics.bags.BagService.lambda$getMetadata$6(BagService.java:1307)
        at com.github.swrirobotics.bags.reader.BagFile.forMessagesOnTopic(BagFile.java:395)
        at com.github.swrirobotics.bags.BagService.getMetadata(BagService.java:1303)
        at com.github.swrirobotics.bags.BagService.extractTagsFromBagFile(BagService.java:1352)
        at com.github.swrirobotics.bags.BagService.addTagsToBag(BagService.java:1564)
        at com.github.swrirobotics.bags.BagService.insertNewBag(BagService.java:1499)
        at com.github.swrirobotics.bags.BagService.updateBagInDatabase(BagService.java:1761)
        at com.github.swrirobotics.bags.BagService.updateBagFile(BagService.java:1689)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl.lambda$updateBags$3(FilesystemBagStorageImpl.java:159)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl.updateBags(FilesystemBagStorageImpl.java:146)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl$$FastClassBySpringCGLIB$$17031e11.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:783)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:698)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl$$EnhancerBySpringCGLIB$$953167bd.updateBags(<generated>)
        at com.github.swrirobotics.bags.storage.BagScanner.lambda$scanStorage$0(BagScanner.java:371)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at com.github.swrirobotics.bags.storage.BagScanner.lambda$scanStorage$2(BagScanner.java:374)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Description is part of the forders structure, which, but not even a complete folder name.
The name would be of the pattern YYYY-MM-DD_description
So are there any "forbidden" characters or patterns?

@pjreed
Copy link
Contributor

pjreed commented Sep 6, 2022

That's interesting, that definitely isn't a normal error...

That exception looks like it's being thrown from code that is trying to parse tags in the bag file. If you've configured metadata topics, then it expects every message on those topics to have a string field named data, and each one of those should be a newline-separated set of key:value pairs; for example:

name: John Doe
email: [email protected]

I suspect you've got a bag file with metadata that is formatted in a way it doesn't expect; do you have an example of anything in your files that might be formatted differently from that?

@pjreed
Copy link
Contributor

pjreed commented Sep 6, 2022

I've submitted a PR at #196 that will make it handle invalid metadata more gracefully when scanning bag files. I don't know if that will fix the speed issue you're having, but it may fix some other issues people have seen with it failing to recognize certain bag files...

@ptulpen
Copy link
Author

ptulpen commented Sep 9, 2022

Hello,
yes you are completely right
rostopic gave me
%time,field.data
1234567894591808795,mydescription

now I rebuild it to
%time,field.data
1234567894591808795,description: mydescription

With a small subset I tested it and if looks much faster
Now I rewrite upload scripts and a "repair" script

The more graceful metadata scanning sounds also good, issues like that could happen in other scenarios as well
I tried to test that as well, but I could not build it. can you provide the branch also as container ? (this is how I run the current system)

@pjreed
Copy link
Contributor

pjreed commented Sep 9, 2022

Sure, I've pushed a image containing my build to ghcr.io/hatchbed/bag-database:v3.5.1-SNAPSHOT. Give that a try and see if it works for you.

@danthony06
Copy link
Contributor

v3.5.1 has been released with this fix.

@ptulpen
Copy link
Author

ptulpen commented Sep 22, 2022

The patch regarding the metadata is great.
But a larger set of files showed, that is still takes a long time (for 875 files it took a week)
What also looks interesting is, that when I add new files and start a scan, a lot of files gets scanned and only when the scan is finished, they appear all at once in on the website
Is this an intended behaviour?

EDIT: also it appears to be on the database when everything is done (at least according to grepping through pqdump )
Maybe this is also connected with: #195
Blind guess would be that there is some kind of lock and caching

@ptulpen
Copy link
Author

ptulpen commented Nov 9, 2022

another thought regarding this: we saw that we have many images and quite big videos inside the bags
can we maybe skip the analysis/extraction of that during the scan and focus on text based stuff?

@Timple
Copy link

Timple commented Oct 20, 2023

For us analysis is CPU bound. It seems to be single-threaded which is a waste of the 16 threads available 🙁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants