Column Stats #42

c-horn · 2020-05-06T21:13:09Z

No description provided.

…into CompleteStats type

phpisciuneri · 2020-05-20T21:02:41Z

…ent ColStats as TwoPassCheapCheck

c-horn · 2020-05-27T21:28:17Z

TwoPassCheapCheck allows both the first and second pass aggregations of ColStats to run in exactly 2 passes over the data instead of N+1 where N is the number of columns being analyzed.

…ss was run

JayGaynor · 2020-06-02T20:13:04Z

@c-horn and @phpisciuneri What's the status of this?

c-horn · 2020-06-02T20:48:59Z

Waiting on reviews @JayGaynor

phpisciuneri · 2020-06-02T20:50:25Z

@JayGaynor Yep, waiting on me. Sorry @c-horn. I've been messing around with hive3.

colindean

I touch a lot on bins in these comments, but it looks like they're likely already to be in order. The question becomes this: do we want to guarantee order contractually? ¯\_(ツ)_/¯

I'm gonna run it when I get a chance. I'll ✅ after I've done that.

colindean · 2020-06-04T19:47:47Z

src/main/scala/com/target/data_validator/stats/Histogram.scala

@@ -0,0 +1,3 @@
+package com.target.data_validator.stats
+
+case class Histogram(bins: Seq[Bin])


Thought: Should the collection passed into this be sorted going in or upon use?

If bins are non-overlapping, should this be a sorted collection? SortedSet?

Considering we have no fine grain control over Spark's serialziation and deserialization of these types (to and from internal Row, and network formats like Kryo) I would say no.

I can get it to compile with this:

diff --git a/src/main/scala/com/target/data_validator/stats/Bin.scala b/src/main/scala/com/target/data_validator/stats/Bin.scala index e392153..fec308c 100644 --- a/src/main/scala/com/target/data_validator/stats/Bin.scala +++ b/src/main/scala/com/target/data_validator/stats/Bin.scala @@ -1,3 +1,7 @@ package com.target.data_validator.stats case class Bin(lowerBound: Double, upperBound: Double, count: Long) + +object Bin { + implicit val ordering = Ordering by { (_: Bin).lowerBound } +} diff --git a/src/main/scala/com/target/data_validator/stats/Histogram.scala b/src/main/scala/com/target/data_validator/stats/Histogram.scala index 52a194b..4b09898 100644 --- a/src/main/scala/com/target/data_validator/stats/Histogram.scala +++ b/src/main/scala/com/target/data_validator/stats/Histogram.scala @@ -1,3 +1,5 @@ package com.target.data_validator.stats -case class Histogram(bins: Seq[Bin]) +import scala.collection.SortedSet + +case class Histogram(bins: SortedSet[Bin]) diff --git a/src/main/scala/com/target/data_validator/stats/SecondPassStats.scala b/src/main/scala/com/target/data_validator/stats/SecondPassStats.scala index 33ab191..b736919 100644 --- a/src/main/scala/com/target/data_validator/stats/SecondPassStats.scala +++ b/src/main/scala/com/target/data_validator/stats/SecondPassStats.scala @@ -4,6 +4,8 @@ import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.ScalaReflection import org.apache.spark.sql.types.DataType +import scala.collection.SortedSet + case class SecondPassStats(stdDev: Double, histogram: Histogram) object SecondPassStats { @@ -21,13 +23,13 @@ object SecondPassStats { SecondPassStats( stdDev = row.getDouble(0), histogram = Histogram( - row.getStruct(1).getSeq[Row](0) map { + row.getStruct(1).getSeq[Row](0).map { bin => Bin( lowerBound = bin.getDouble(0), upperBound = bin.getDouble(1), count = bin.getLong(2) ) - } + } (collection.breakOut) ) ) } diff --git a/src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala b/src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala index 9c6e474..9567304 100644 --- a/src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala +++ b/src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala @@ -4,6 +4,8 @@ import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ +import scala.collection.SortedSet + /** * Calculate the standard deviation and histogram of a numeric column */ @@ -100,7 +102,7 @@ class SecondPassStatsAggregator(firstPassStats: FirstPassStats) extends UserDefi } SecondPassStats( math.sqrt(buffer.getDouble(sumOfSquares) / (buffer.getLong(count) - 1)), - Histogram(bins) + Histogram(bins.map(x => x)(collection.breakOut)) ) } diff --git a/src/test/scala/com/target/data_validator/stats/NumericData.scala b/src/test/scala/com/target/data_validator/stats/NumericData.scala index 5209952..31559c7 100644 --- a/src/test/scala/com/target/data_validator/stats/NumericData.scala +++ b/src/test/scala/com/target/data_validator/stats/NumericData.scala @@ -1,5 +1,7 @@ package com.target.data_validator.stats +import scala.collection.SortedSet + case class NumericData(value1: Double) object NumericData { @@ -22,7 +24,7 @@ object NumericData { val secondPassStats = SecondPassStats( 3.0276503540974917, Histogram( - Seq( + SortedSet( Bin(0.0, 0.9, 1), Bin(0.9, 1.8, 1), Bin(1.8, 2.7, 1), diff --git a/src/test/scala/com/target/data_validator/validator/ColStatsSpec.scala b/src/test/scala/com/target/data_validator/validator/ColStatsSpec.scala index fe58a8e..da48836 100644 --- a/src/test/scala/com/target/data_validator/validator/ColStatsSpec.scala +++ b/src/test/scala/com/target/data_validator/validator/ColStatsSpec.scala @@ -6,6 +6,8 @@ import com.target.data_validator.stats._ import io.circe.Json import org.scalatest._ +import scala.collection.SortedSet + // scalastyle:off magic.number class ColStatsSpec extends FunSpec with Matchers with TestingSparkSession { import spark.implicits._ @@ -86,7 +88,7 @@ object ColStatsSpec { 21.0, 5.916079783099616, Histogram( - Seq( + SortedSet( Bin(2.0, 3.9, 2), Bin(3.9, 5.8, 2), Bin(5.8, 7.699999999999999, 2), @@ -109,7 +111,7 @@ object ColStatsSpec { 0.958, 0.26725316654123255, Histogram( - Seq( + SortedSet( Bin(0.0536, 0.14404, 3), Bin(0.14404, 0.23448, 3), Bin(0.23448, 0.32492, 2),

But spark doesn't like non-"Seq" collections, and I assume circe might have similar issues:

[info] - should correctly calculate the standard deviation and histogram *** FAILED *** [info] java.lang.UnsupportedOperationException: Schema for type scala.collection.SortedSet[com.target.data_validator.stats.Bin] is not supported [info] at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780) [info] at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715) [info] at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)

Ah, OK, that settles it then. Ordering is an exercise left to the consumer! It might be wise to say that in the docs (cc @phpisciuneri)

I guess I wasn't too concerned about this. We can revisit if it is a problem. The JSON output really isn't intended to be read by eyeballs, instead it should be used to plot the histogram. In which case the bin boundaries and counts should be sufficient, no matter the order.

src/main/scala/com/target/data_validator/stats/Bin.scala

colindean · 2020-06-04T20:36:49Z

src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala

+  */
+class SecondPassStatsAggregator(firstPassStats: FirstPassStats) extends UserDefinedAggregateFunction {
+
+  val NUMBER_OF_BINS = 10


Does it make sense to make this configurable?

Suggested change

val NUMBER_OF_BINS = 10

final val NUMBER_OF_BINS = 10

colindean · 2020-06-04T20:41:15Z

src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala

+    * evaluate the standard deviation and define bins of histogram
+    */
+  override def evaluate(buffer: Row): Any = {
+    val bins: Seq[Bin] = for (i <- binStart to binEnd) yield {


Hmm. Bins will already be in order de facto. Maybe we don't need to explicitly sort or guarantee sort order?

I don't think there is any guarantee of a specific order when Spark serializes/deserializes from row format (DataFrame -> Dataset).

Actually there probably is consistency with columns containing arrays, but we definitely can't guarantee it with a type like SortedSet. The aggregator will always output them low to high.

I think the fact that Histogram makes no guarantees that the bins are continuous and do not overlap is a larger "logical hole" than the bin sorting.

@c-horn yeah it might make sense to guard against overlap. Gaps are less concerning as they can be interpreted as empty bins, but maybe we should check that the bins have equal width? It doesn't have to be a part of this PR, we can create an issue and do separately.

phpisciuneri

This is a nice implementation @c-horn. The only thing I would add is some doc on usage in the README. I can do that if you'd like.

src/main/scala/com/target/data_validator/stats/CompleteStats.scala

phpisciuneri · 2020-06-05T12:05:59Z

src/main/scala/com/target/data_validator/stats/SecondPassStatsAggregator.scala

+    * evaluate the standard deviation and define bins of histogram
+    */
+  override def evaluate(buffer: Row): Any = {
+    val bins: Seq[Bin] = for (i <- binStart to binEnd) yield {


@c-horn yeah it might make sense to guard against overlap. Gaps are less concerning as they can be interpreted as empty bins, but maybe we should check that the bins have equal width? It doesn't have to be a part of this PR, we can create an issue and do separately.

…cala Co-authored-by: phpisciuneri <[email protected]>

colindean

Good to go for now, needs docs in this PR or in a quick follow up before a release.

…-stats

add UDAFs for statistics

346cea9

c-horn self-assigned this May 6, 2020

colindean self-requested a review May 6, 2020 21:14

colindean added the enhancement New feature or request label May 6, 2020

c-horn marked this pull request as draft May 6, 2020 21:14

c-horn requested review from colindean and removed request for colindean May 6, 2020 21:14

c-horn and others added 4 commits May 6, 2020 17:18

package: com.tgt.edabi.dse -> com.target

099b903

fix compilation

ab7dbc5

Adds new colstats check which generates FirstPassStats

26a33d2

implement second pass of stats with costly check; unify event output …

34f231f

…into CompleteStats type

add TwoPassCheapCheck abstraction and ValidatorTable handling; implem…

a9493e1

…ent ColStats as TwoPassCheapCheck

c-horn added 4 commits May 27, 2020 17:30

register first pass timer with event output

af4f8b4

clean up some code; first pass timer should only record when first pa…

3b002b2

…ss was run

colstats json report integration and tests

669e3c8

tidy up some sample data definitions

1ef2f15

c-horn changed the title ~~[WIP] Column Stats~~ Column Stats May 28, 2020

c-horn marked this pull request as ready for review May 28, 2020 23:01

c-horn requested a review from phpisciuneri May 28, 2020 23:02

Merge branch 'master' into add-stats

b51179d

colindean reviewed Jun 4, 2020

View reviewed changes

phpisciuneri approved these changes Jun 5, 2020

View reviewed changes

Update src/main/scala/com/target/data_validator/stats/CompleteStats.s…

8f6426e

…cala Co-authored-by: phpisciuneri <[email protected]>

colindean approved these changes Jun 9, 2020

View reviewed changes

colstats readme entry

967a1bb

Merge branch 'add-stats' of github.com:target/data-validator into add…

eb19879

…-stats

c-horn requested a review from samratmitra-0812 as a code owner June 9, 2020 20:10

phpisciuneri added 2 commits June 10, 2020 08:44

Merge branch 'master' into add-stats

aca0f09

turn off cyclomatic.complexity for block

b05b915

phpisciuneri merged commit 9a8aa9c into master Jun 10, 2020

phpisciuneri deleted the add-stats branch January 15, 2021 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column Stats #42

Column Stats #42

c-horn commented May 6, 2020

phpisciuneri commented May 20, 2020

c-horn commented May 27, 2020

JayGaynor commented Jun 2, 2020

c-horn commented Jun 2, 2020

phpisciuneri commented Jun 2, 2020

colindean left a comment •

edited

Loading

colindean Jun 4, 2020

colindean Jun 4, 2020

c-horn Jun 4, 2020

c-horn Jun 4, 2020

colindean Jun 9, 2020

phpisciuneri Jun 10, 2020

colindean Jun 4, 2020

colindean Jun 4, 2020

c-horn Jun 4, 2020

c-horn Jun 4, 2020

phpisciuneri Jun 5, 2020

phpisciuneri left a comment

phpisciuneri Jun 5, 2020

colindean left a comment

		@@ -0,0 +1,3 @@
		package com.target.data_validator.stats

		case class Histogram(bins: Seq[Bin])

Column Stats #42

Column Stats #42

Conversation

c-horn commented May 6, 2020

phpisciuneri commented May 20, 2020

c-horn commented May 27, 2020

JayGaynor commented Jun 2, 2020

c-horn commented Jun 2, 2020

phpisciuneri commented Jun 2, 2020

colindean left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phpisciuneri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colindean left a comment

Choose a reason for hiding this comment

colindean left a comment •

edited

Loading