scanning very slow #170

ptulpen · 2022-01-19T12:21:09Z

Hello,
I have the issue that the scanning is very slow, now running over a week for around 3000 files
Is there a way to restart and/or troubleshoot the scanning progress?
(restarting the container and even the server does not help)
Version is 3.4.2

danthony06 · 2022-01-20T15:34:40Z

There should be logs in /usr/local/tomcat/logs, but I'm not sure if they will have information useful for this problem. Was it previously running fast, or is this a new installation?

ptulpen · 2022-01-21T12:29:29Z

Hello,
before it was not so slow.
The only interesting part in the logs are snippets like:

19-Jan-2022 20:29:41.728 INFO [MessageBroker-4] org.springframework.web.socket.config.WebSocketMessageBrokerStats.lambda$initLoggingTask$0 WebSocketSession[0 current WS(0)-HttpStream(0)-HttpPoll(0), 5 total, 0 closed abnormally (0 connect failure, 0 send limit, 0 transport error)], stompSubProtocol[processed CONNECT(0)-CONNECTED(0)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 15], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 5], sockJsScheduler[pool size = 8, active threads = 1, queued tasks = 4, completed tasks = 275440]

ptulpen · 2022-08-24T18:15:49Z

One thought that came to my mind:
Could setting indexes or similar options in the PostgreSQL help to improve the performance?

pjreed · 2022-08-24T21:01:34Z

The database should already have quite a few indexes on relevant columns; you could check using a tool like DBeaver to connect to the database, although if your tables are missing indexes, I would expect that to drastically slow down searching, not scanning new files. In fact, lacking indexes would actually make inserting new records faster since it can simply append them to the table without updating the indexes.

Scanning should mostly be limited by disk read speed, since it has to read in the entire file, and to a lesser degree by CPU speed, since it has to generate a hash to identify the bag file. This could be an issue if you're reading very large bags over a slow network connection, or potentially if you're reading large bag files from slow HDDs, especially if the bag files themselves are unindexed and there's a lot of disk thrashing going on.

danthony06 · 2022-08-24T21:18:50Z

Are your bags large by any chance, like more than 2GB?

…

On Wed, Aug 24, 2022, 4:01 PM P. J. Reed ***@***.***> wrote: The database should already have quite a few indexes on relevant columns; you could check using a tool like DBeaver to connect to the database, although if your tables are missing indexes, I would expect that to drastically slow down searching, not scanning new files. In fact, lacking indexes would actually make inserting new records faster since it can simply append them to the table without updating the indexes. Scanning should mostly be limited by disk read speed, since it has to read in the entire file, and to a lesser degree by CPU speed, since it has to generate a hash to identify the bag file. This could be an issue if you're reading very large bags over a slow network connection, or potentially if you're reading large bag files from slow HDDs, especially if the bag files themselves are unindexed and there's a lot of disk thrashing going on. — Reply to this email directly, view it on GitHub <#170 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBZWHDLKSD74EUXGOQZXF3V22ETRANCNFSM5MJZDFSQ> . You are receiving this because you commented.Message ID: ***@***.***>

ptulpen · 2022-08-29T14:08:11Z

Hello,
@pjreed: indexes look fine and what you say about indexes sounds reasonable, so probably nothing to do with that
@danthony06: yes, they are mostly cut at 4.1 GB
Is there a limit ? or something to optimize?

danthony06 · 2022-08-29T14:57:08Z

@ptulpen There's not limit to my knowledge. I mostly wanted to make sure you weren't uploading 100GB bag files that might be causing network issues.

pjreed · 2022-08-29T15:15:50Z

How much free RAM does your server have? Is it possible that it's hitting swap space while trying to read the bags?

ptulpen · 2022-08-30T12:40:45Z

I have 32 GB ram(and 8 cpu) and it is not fully used
I also increased the java memory limit with -e CATALINA_OPTS=" -Xmx10g"

ptulpen · 2022-09-06T15:04:47Z

I also see in the scanning process some interesting errors like these:

2022-09-06 14:53:24.289 [pool-2-thread-1] ERROR c.g.s.b.s.f.FilesystemBagStorageImpl - Unexpected error updating bag file:
java.lang.IllegalArgumentException: Chunk [**description**] is not a valid entry
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:219)
        at com.google.common.base.Splitter$MapSplitter.split(Splitter.java:526)
        at com.github.swrirobotics.bags.BagService.lambda$getMetadata$6(BagService.java:1307)
        at com.github.swrirobotics.bags.reader.BagFile.forMessagesOnTopic(BagFile.java:395)
        at com.github.swrirobotics.bags.BagService.getMetadata(BagService.java:1303)
        at com.github.swrirobotics.bags.BagService.extractTagsFromBagFile(BagService.java:1352)
        at com.github.swrirobotics.bags.BagService.addTagsToBag(BagService.java:1564)
        at com.github.swrirobotics.bags.BagService.insertNewBag(BagService.java:1499)
        at com.github.swrirobotics.bags.BagService.updateBagInDatabase(BagService.java:1761)
        at com.github.swrirobotics.bags.BagService.updateBagFile(BagService.java:1689)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl.lambda$updateBags$3(FilesystemBagStorageImpl.java:159)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl.updateBags(FilesystemBagStorageImpl.java:146)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl$$FastClassBySpringCGLIB$$17031e11.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:783)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:698)
        at com.github.swrirobotics.bags.storage.filesystem.FilesystemBagStorageImpl$$EnhancerBySpringCGLIB$$953167bd.updateBags(<generated>)
        at com.github.swrirobotics.bags.storage.BagScanner.lambda$scanStorage$0(BagScanner.java:371)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at com.github.swrirobotics.bags.storage.BagScanner.lambda$scanStorage$2(BagScanner.java:374)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Description is part of the forders structure, which, but not even a complete folder name.
The name would be of the pattern YYYY-MM-DD_description
So are there any "forbidden" characters or patterns?

pjreed · 2022-09-06T16:18:22Z

That's interesting, that definitely isn't a normal error...

That exception looks like it's being thrown from code that is trying to parse tags in the bag file. If you've configured metadata topics, then it expects every message on those topics to have a string field named data, and each one of those should be a newline-separated set of key:value pairs; for example:

name: John Doe
email: [email protected]

I suspect you've got a bag file with metadata that is formatted in a way it doesn't expect; do you have an example of anything in your files that might be formatted differently from that?

pjreed · 2022-09-06T20:48:37Z

I've submitted a PR at #196 that will make it handle invalid metadata more gracefully when scanning bag files. I don't know if that will fix the speed issue you're having, but it may fix some other issues people have seen with it failing to recognize certain bag files...

ptulpen · 2022-09-09T08:26:35Z

Hello,
yes you are completely right
rostopic gave me
%time,field.data
1234567894591808795,mydescription

now I rebuild it to
%time,field.data
1234567894591808795,description: mydescription

With a small subset I tested it and if looks much faster
Now I rewrite upload scripts and a "repair" script

The more graceful metadata scanning sounds also good, issues like that could happen in other scenarios as well
I tried to test that as well, but I could not build it. can you provide the branch also as container ? (this is how I run the current system)

pjreed · 2022-09-09T15:00:59Z

Sure, I've pushed a image containing my build to ghcr.io/hatchbed/bag-database:v3.5.1-SNAPSHOT. Give that a try and see if it works for you.

danthony06 · 2022-09-09T21:30:43Z

v3.5.1 has been released with this fix.

ptulpen · 2022-09-22T07:56:27Z

The patch regarding the metadata is great.
But a larger set of files showed, that is still takes a long time (for 875 files it took a week)
What also looks interesting is, that when I add new files and start a scan, a lot of files gets scanned and only when the scan is finished, they appear all at once in on the website
Is this an intended behaviour?

EDIT: also it appears to be on the database when everything is done (at least according to grepping through pqdump )
Maybe this is also connected with: #195
Blind guess would be that there is some kind of lock and caching

ptulpen · 2022-11-09T13:08:10Z

another thought regarding this: we saw that we have many images and quite big videos inside the bags
can we maybe skip the analysis/extraction of that during the scan and focus on text based stuff?

Timple · 2023-10-20T11:27:33Z

For us analysis is CPU bound. It seems to be single-threaded which is a waste of the 16 threads available 🙁

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scanning very slow #170

scanning very slow #170

ptulpen commented Jan 19, 2022

danthony06 commented Jan 20, 2022

ptulpen commented Jan 21, 2022

ptulpen commented Aug 24, 2022

pjreed commented Aug 24, 2022

danthony06 commented Aug 24, 2022 via email

ptulpen commented Aug 29, 2022

danthony06 commented Aug 29, 2022

pjreed commented Aug 29, 2022

ptulpen commented Aug 30, 2022

ptulpen commented Sep 6, 2022

pjreed commented Sep 6, 2022

pjreed commented Sep 6, 2022

ptulpen commented Sep 9, 2022

pjreed commented Sep 9, 2022

danthony06 commented Sep 9, 2022

ptulpen commented Sep 22, 2022 •

edited

Loading

ptulpen commented Nov 9, 2022

Timple commented Oct 20, 2023

scanning very slow #170

scanning very slow #170

Comments

ptulpen commented Jan 19, 2022

danthony06 commented Jan 20, 2022

ptulpen commented Jan 21, 2022

ptulpen commented Aug 24, 2022

pjreed commented Aug 24, 2022

danthony06 commented Aug 24, 2022 via email

ptulpen commented Aug 29, 2022

danthony06 commented Aug 29, 2022

pjreed commented Aug 29, 2022

ptulpen commented Aug 30, 2022

ptulpen commented Sep 6, 2022

pjreed commented Sep 6, 2022

pjreed commented Sep 6, 2022

ptulpen commented Sep 9, 2022

pjreed commented Sep 9, 2022

danthony06 commented Sep 9, 2022

ptulpen commented Sep 22, 2022 • edited Loading

ptulpen commented Nov 9, 2022

Timple commented Oct 20, 2023

ptulpen commented Sep 22, 2022 •

edited

Loading