Releases: derrickoswald/CIMSpark
CIMReader-2.11-2.0.1-1.9.1
Fix for Issue #6 Dropped Elements.
When determining the amount extra to read beyond the end of a Split, the computation was based on the FSDataInputStream.available() function. This returns an integer and not a long. So for all files over the 2GB barrier (maximum integer value) the available() was topping out at 2147483647.
This meant the extra being read in was zero - no over read at all - and hence the last dropped element at the end of some Splits for large files. The striped files were all under 2GB and hence did not have this problem.
This has been fixed by using the FileSystem.getFileStatus() function instead which returns a long.
CIMReader-2.11-2.0.1-1.9.0
Release under the new name: CIMReader
- add Asset/LifecycleDate to edges
- add "split size" option (ch.ninecode.cim.split_maxsize) to ease memory pressure for worker nodes
CIMScala-2.11-2.0.1-1.9.0
Maintenance release for GridLAB-D work.
- fixes for Join and Topological Processing options to update the superclass RDDs of affected RDDs
- name topological islands by trafo low voltage pin
- when using option ch.ninecode.cim.do_topo_islands=true, an attempt is made to name the islands based on the transformer secondary pin (or failing tha, the topological node name)
- add checkpointing, optimize Graphx trace
- if checkpointing is enabled (that is, the Spark context CheckpointDir has been set) final RDDs will be checkpointed
- add abgang nummer, add mRID to classes not inheriting from IdentifiedObject
- when using optioon ch.ninecode.cim.make_edges=true, a column for description has been added to the generated edges RDD which contains the Abgang Nummer
- fixed a problem for DataFrames (and hence also for R data.frames) where objects not inheriting from IdentifiedObject had no primary key (mRID)
CIMScala-2.11-2.0.1-1.8.1
Fix warning and error messages when creating redges.RData.
Note
Existing R scripts work, but issue warning messages like so:
Warning message:
'sparkR.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated")
Warning message:
'sparkRSQL.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated")
Warning message:
'sql(sqlContext...)' is deprecated.
Use 'sql(sqlQuery)' instead.
See help("Deprecated")
It is possible to eliminate these messages using the script below, but testing this code against large data sets indicates severe memory issues.
So, at this time, we recommend using the same R script as was used with version 1.6.0 - ignoring warning messages - and not using the code below.
R code changes for Spark 2.0 (avoids warning messages):
# record the load time
begin = proc.time ()
# set up the Spark system
Sys.setenv (YARN_CONF_DIR="/spark/spark-2.0.2-bin-hadoop2.7/conf")
Sys.setenv (SPARK_HOME="spark/spark-2.0.2-bin-hadoop2.7")
library (SparkR, lib.loc = c (file.path (Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session ("spark://sandbox:7077", "Sample", sparkJars = c ("CIMScala-2.11-2.0.1-1.8.1.jar"), sparkEnvir = list (spark.driver.memory="1g", spark.executor.memory="4g", spark.serializer="org.apache.spark.serializer.KryoSerializer"))
# record the start time
pre = proc.time ()
# read the data file and process topologically and make the edge RDD
elements = sql ("create temporary view elements using ch.ninecode.cim options (path 'hdfs://sandbox:8020/data/NIS_CIM_Export_sias_current_20161220_V9.rdf', StorageLevel 'MEMORY_AND_DISK_SER', ch.ninecode.cim.make_edges 'true', ch.ninecode.cim.do_topo 'false', ch.ninecode.cim.do_topo_islands 'false')")
head (sql ("select * from elements")) # triggers evaluation
# record the time spent creating the redges data frame
post = proc.time ()
# read the edges RDD as an R data frame
edges = sql ("select * from edges")
redges = SparkR::collect (edges, stringsAsFactors=FALSE)
# save the redges data frame
save ("redges", file="./NIS_CIM_Export_sias_current_20161220_V9")
finish = proc.time ()
# show timing
print (paste ("setup", as.numeric (pre[3] - begin[3])))
print (paste ("read", as.numeric (post[3] - pre[3])))
print (paste ("redges", as.numeric (finish[3] - post[3])))
# example to read an RDD directly
terminals = sql ("select * from Terminal")
rterminals = SparkR::collect (terminals, stringsAsFactors=FALSE)
# example to read a three-way join of RDD
switches = sql ("select s.sup.sup.sup.sup.mRID mRID, s.sup.sup.sup.sup.aliasName aliasName, s.sup.sup.sup.sup.name name, s.sup.sup.sup.sup.description description, open, normalOpen no, l.CoordinateSystem cs, p.xPosition, p.yPosition from Switch s, Location l, PositionPoint p where s.sup.sup.sup.Location = l.sup.mRID and s.sup.sup.sup.Location = p.Location and p.sequenceNumber = 0")
rswitches = SparkR::collect (switches, stringsAsFactors=FALSE)
Timings on NIS AWS cluster for the sequence of operations on 8017082910 byte RDF file is:
setup 3.089 seconds
read 27.636 seconds
redges 1296.595 seconds
CIMScala-2.11-2.0.1-1.8.0
Initial Spark 2.0.1 release.
- uses UDT (User Defined Type) hack
- rework class definitions
- CIMRelation not using HadoopFsRelation
- DefaultSource not using HadoopFsRelationProvider
- update Docker environment
CIMScala-2.10-1.6.0-1.7.2
Alter Edges creation to use the top level container (Substation or DistributionBox) where possible.
CIMScala-2.10-1.6.0-1.7.1
This is just a small update to revert to the original Edge schema when topological processing is not enabled,
dropping the columns related to topological islands.
CIMScala-2.10-1.6.0-1.7.0
This update includes many enhancements and extensions. Briefly, these are:
- support for multiple input files, specifically IS-U CIM files and joining them
- topological processor, creating TopologicalNode and TopologicalIsland elements linked from ConnectivityNode
- adding MIT license to clarify the status
- model package improvements
- completed Wires
- added Metering
- added InfAssets
CIMScala-2.10-1.6.0-1.6.0
Update artifact naming to include:
- version of Scala (2.10) which is necessary for some Scala repositories
- version of Spark (1.6.0) which is the required target system
- version of CIMScala (1.6.0) which will change from release to release and follows semantic versioning
Note it is just coincidence that the CIMScala version is the same as the Spark version for this release.
CIMScala-2.10-1.4.1-0.6.0
Update artefact naming to include:
- version of Scala (2.10) which is necessary for some Scala repositories
- version of Spark (1.4.1) which is the required target system
- version of CIMScala (0.6.0) which will change from release to release and follows semantic versioning