Engine fails to extract UASTs on actual Spark cluster #402

bzz · 2018-06-18T09:11:34Z

When running on local mode with --packages "tech.sourced:engine:0.6.3" - extracting UASTs works.

But after switching to actual Apache Spark cluster with the same params and query i.e in Standalone mode - extractUAST fails with

java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;

Steps to Reproduce

Start Apache Spark in a cluster mode and a spark-shell \w Engine

export MASTER_HOST=127.0.0.1
$SPARK_HOME/sbin/start-master.sh -h $MASTER_HOST -p 7077
$SPARK_HOME/sbin/start-slave.sh $MASTER_HOST:7077
$SPARK_HOME/bin/spark-shell --master "spark://$MASTER_HOST:7077" --packages 
"tech.sourced:engine:0.6.3"

Run extractUASTs

import tech.sourced.engine._
val path = "<path-to-siva-files>"
val engine = Engine(spark, path, "siva")

val repos = engine.getRepositories
val files = repos.getHEAD
     .getCommits
     .getTreeEntries
     .getBlobs
val uast = files.extractUASTs

uast.count

Expected Behavior

get the number of UASTs

Current Behavior

java.lang.NoSuchMethodError

18/06/18 10:39:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.1.37, executor 0): java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;
	at io.grpc.internal.GrpcUtil$4.get(GrpcUtil.java:566)
	at io.grpc.internal.GrpcUtil$4.get(GrpcUtil.java:563)
	at io.grpc.internal.CensusStatsModule$ClientCallTracer.<init>(CensusStatsModule.java:333)
	at io.grpc.internal.CensusStatsModule.newClientCallTracer(CensusStatsModule.java:137)
	at io.grpc.internal.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:672)
	at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:104)
	at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:636)
	at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
	at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
	at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
	at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
	at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
	at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
	at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)

Context

This mimics the file-duplication workflow we have in Gemini on hash. Ability to reproduce it in spark-shell is crucial for debugging.

Possible Solution

update the build, so final fatJar avoids calling un-shaded version of Guava
Add TravisCI profile that runs this query on actual local Apache Spark standalone cluster

Your Environment (for bugs)

Spark version: 2.2.0
Engine version: 0.6.3
Operating System and version: tested on Linux and macOS

The text was updated successfully, but these errors were encountered:

bzz · 2018-06-18T09:56:34Z

Update: this seems to be related with how --packages in Apache Spark work 😕

If spark-shell is started with:

--packages "tech.sourced:engine:0.6.3" -> java.lang.NoSuchMethodError
--jars <path-to-engine>/target/engine-0.6.3.jar -> works as expected

smola · 2018-06-18T10:10:12Z

--packages used to work for me. This seems to be the usual dependency hell with conflicting Guava versions.

ajnavarro · 2018-06-18T10:12:22Z

@bzz maybe can be related with some kind of cache used by the --packages command?

Maybe deleting .ivy2 folder solves the problem.

bzz · 2018-06-18T11:58:03Z

True. @smola and --packages used to work for me as well.

@ajnavarro Hmmm.. But it did not work neither on my local machine nor on from a new pod on staging pipeline cluster. Or do you mean some Spark Master-side cache?

A quick verification on a new pod \w empty cache and local standalone cluster:

kubectl run -i --tty spark-new --image=srcd/spark:2.2.0_v2 --generator="run-pod/v1" --command -- /bin/bash

whoami
ls -la /root

export SPARK_HOME="/opt/spark-2.2.0-bin-hadoop2.7"
export MASTER_HOST=127.0.0.1
$SPARK_HOME/sbin/start-master.sh -h $MASTER_HOST -p 7077
$SPARK_HOME/sbin/start-slave.sh $MASTER_HOST:7077
$SPARK_HOME/bin/spark-shell --master "spark://$MASTER_HOST:7077" --packages "tech.sourced:engine:0.6.4"

and then

import tech.sourced.engine._
val path = "hdfs://hdfs-namenode/pga/siva/latest/ff/"
val engine = Engine(spark, path, "siva")

val repos = engine.getRepositories
val files = repos.getHEAD
     .getCommits
     .getTreeEntries
     .getBlobs
val uast = files.extractUASTs

uast.count

results in

18/06/18 12:07:43 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 2, 10.2.15.90, executor 0): java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
	at com.google.protobuf.GeneratedMessageV3$FieldAccessorTable.<init>(GeneratedMessageV3.java:1727)
	at com.google.protobuf.DurationProto.<clinit>(DurationProto.java:52)
	at com.google.protobuf.duration.DurationProto$.javaDescriptor$lzycompute(DurationProto.scala:26)
	at com.google.protobuf.duration.DurationProto$.javaDescriptor(DurationProto.scala:25)
	at gopkg.in.bblfsh.sdk.v1.protocol.generated.GeneratedProto$.javaDescriptor$lzycompute(GeneratedProto.scala:63)
	at gopkg.in.bblfsh.sdk.v1.protocol.generated.GeneratedProto$.javaDescriptor(GeneratedProto.scala:59)
	at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$.<init>(ProtocolServiceGrpc.scala:30)
	at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$.<clinit>(ProtocolServiceGrpc.scala)
	at org.bblfsh.client.BblfshClient.<init>(BblfshClient.scala:20)

smacker · 2018-06-19T08:36:24Z

@bzz it's a cache on master (or workers) side. I used to have the same problem. Removing cache helped. Reference: https://github.com/src-d/engine/issues/389

smola added the bug label Jun 18, 2018

bzz mentioned this issue Jun 18, 2018

Run Gemini file-level duplicate detection on PGA src-d/gemini#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine fails to extract UASTs on actual Spark cluster #402

Engine fails to extract UASTs on actual Spark cluster #402

bzz commented Jun 18, 2018 •

edited

Loading

bzz commented Jun 18, 2018

smola commented Jun 18, 2018

ajnavarro commented Jun 18, 2018 •

edited

Loading

bzz commented Jun 18, 2018 •

edited

Loading

smacker commented Jun 19, 2018

Engine fails to extract UASTs on actual Spark cluster #402

Engine fails to extract UASTs on actual Spark cluster #402

Comments

bzz commented Jun 18, 2018 • edited Loading

Steps to Reproduce

Expected Behavior

Current Behavior

Context

Possible Solution

Your Environment (for bugs)

bzz commented Jun 18, 2018

smola commented Jun 18, 2018

ajnavarro commented Jun 18, 2018 • edited Loading

bzz commented Jun 18, 2018 • edited Loading

smacker commented Jun 19, 2018

bzz commented Jun 18, 2018 •

edited

Loading

ajnavarro commented Jun 18, 2018 •

edited

Loading

bzz commented Jun 18, 2018 •

edited

Loading