Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing Widths from Execution Instrumentation #114

Open
2 tasks
chick opened this issue Oct 25, 2017 · 52 comments
Open
2 tasks

Changing Widths from Execution Instrumentation #114

chick opened this issue Oct 25, 2017 · 52 comments
Assignees

Comments

@chick
Copy link
Contributor

chick commented Oct 25, 2017

This project aims to apply information found using the firrtl interpreter's run-time instrumentation of values passing through a node.

  • Run a firrtl circuit with instrumentation turned on
  • This will produce a CSV file of all values passed through every node
  • Parse this CSV file and find wires who's range should be changed (presumably made smaller)
  • Generate annotations that reflect these changes
  • Apply these changes using a Firrtl Transformation.

STILL TO DO:

  • changing wires that are in de-duped modules needs to un-de-dupe the module.
  • ChangeWidth should not include memories and black boxes in things changed
@chick
Copy link
Contributor Author

chick commented Oct 25, 2017

@azidar I am a little worried about ports and nodes in sub-modules. My worries specifically:

  • I am likely to find different changes for the same field when it appears in different instantiations of sub-modules.
    • The annotations Named does not have the specificity to handle fields in specific instances
    • I'll try to ignore this as long as I can but I am not sure how to handle it.
    • Or maybe the analyzer has to match up fields from different instances and only attempt to make changes that apply to all occurrences
  • Assuming submodule specific changes are allowed.
    • An edge case is that changing the same field in different instances would force splitting a submodule into become separate instances. Maybe there's code to do this sort of thing in existing passes or transforms.

@shunshou
Copy link
Member

One other thing I've thought about -- it'd be good if you could optimize a particular subcircuit and print out some kind of annotation list so that the circuit doesn't need to be reoptimized in a larger circuit. This is all food for thought and doesn't need to be implemented now. For demonstration purposes, running with a single module is good enough... But for long-term viability, it's important to consider sub-modules.

@seldridge
Copy link
Member

You are correct with target: Named in that this is going to have module collisions in the circuit. My understanding is that the correct approach is to then use the value: String attribute to store whatever else you need, i.e., the the full instance path. The tool can then emit annotations that cram this metadata into the value attribute. For the final pass, you'd then need to differentiate based on the value attribute and not just on the target.

For splitting an instance, this may actually be relatively straightforward:

  1. Generate a non-colliding name for the new module
  2. Copy the module into the circuit with the new name
  3. Replace the module referred to by that specific (or set of specific) DefInstances

I'm not aware of anything that currently does this, but Dedup is basically the inverse. That would likely be a good place to start.

@shunshou
Copy link
Member

shunshou commented Oct 25, 2017

@chick I think this is instructive. I've updated the ranges branch of Chisel3DSPDependencies to all of the latest branches I'm using. If you get the chance, you should try to pull from FFTGen and run:

sbt "test-only dsptools.intervals.tests.IAArithSpec"

As an example:

dut.t1                                     sint<13>,  227136,    -760,             584,

Simple range inference seems to expect a 13-wide SInt, but then tests (exhaustive inputs, unless I screwed up) indicate only 11 bits are needed. Now I need to see what went wrong. :( But Definitely as a starting point, it'd be good to also report which nodes are inefficiently allocated.

Also, stddev of 0 is... hmm...

key,                                          type,       n,     min,             max,            mean,          stddev,    bin0,    bin1,    bin2,    bin3
dut._T_15                                   sint<5>,  227136,       4,               6,         5.00000,         0.00000,       0,       0,  227136,       0
dut._T_16                                  sint<21>,  227136,  -35923,            3757,    -15555.00000,         0.03879,       0,  221648,    5488,       0
dut._T_18                                  sint<10>,  227136,      -6,              32,        12.50000,         0.00004,       0,   18928,  208208,       0
dut._T_20                                  sint<18>,  227135,  -36947,           36525,     -2755.03318,         0.05751,       0,  132818,   94317,       0
dut._T_22                                   uint<1>,  227149,       0,               1,         0.57145,         0.00000,   97345,  129804,       0,       0
dut._T_24                                  sint<26>,  227137,-2364608,         2337600,   -165003.07643,         2.78309,       0,  173241,   53896,       0
dut._T_26                                   uint<1>,  227149,       0,               1,         0.57145,         0.00000,   97345,  129804,       0,       0
dut._T_28                                  sint<23>,  227137, -295576,          292200,    -14349.66914,         0.35000,       0,  173241,   53896,       0
dut._T_30                                   uint<1>,  227149,       0,               1,         0.14285,         0.00000,  194701,   32448,       0,       0
dut._T_32                                  sint<23>,  227149, -295576,          292200,      3873.39003,         0.18002,       0,   18974,  208175,       0
dut._T_34                                   uint<1>,  227149,       0,               1,         0.42855,         0.00000,  129804,   97345,       0,       0
dut._T_37                                  sint<33>,  227147,-302669824,       299212800, -21656368.74038,       308.44167,       0,  186726,   40421,       0
dut._T_39                                   uint<1>,  227149,       0,               1,         0.28570,         0.00000,  162252,   64897,       0,       0
dut._T_42                                  sint<33>,  227147,-302669824,       299212800,   1041885.37558,       260.26485,       0,   37948,  189199,       0
dut._T_44                                   uint<1>,  227149,       0,               1,         0.85715,         0.00000,   32448,  194701,       0,       0
dut._T_47                                  sint<23>,  227135, -295576,          292200,    -18160.24835,         0.42804,       0,  113844,  113291,       0
dut.io_a                                    sint<8>,  227136,     -95,              73,       -11.00000,         0.00021,   41664,   86016,   86016,   13440
dut.io_b                                    sint<5>,  227136,       5,               8,         6.50000,         0.00000,       0,       0,  170352,   56784
dut.io_c                                    sint<4>,  227136,      -6,              -4,        -5.00000,         0.00000,  151424,   75712,       0,       0
dut.io_d                                    sint<3>,  227136,      -4,               3,        -0.50000,         0.00001,   56784,   56784,   56784,   56784
dut.io_m1                                  sint<15>,  227137,   -9237,            9131,      -644.85688,         0.01087,     352,  172889,   53736,     160
dut.io_m2                                  sint<15>,  227137,   -9237,            9131,      -448.57002,         0.01094,     352,  172889,   53736,     160
dut.io_m3                                  sint<15>,  227149,   -9237,            9131,       121.00773,         0.00563,      88,   18886,  208135,      40
dut.io_m4                                  sint<15>,  227147,   -9237,            9131,      -661.00706,         0.00941,     264,  186462,   40301,     120
dut.io_m5                                  sint<15>,  227147,   -9237,            9131,        31.72439,         0.00794,     176,   37772,  189119,      80
dut.io_m6                                  sint<15>,  227135,   -9237,            9131,      -567.72205,         0.01338,     528,  113316,  113051,     240
dut.io_sel                                  sint<4>,  227136,      -2,               4,         1.00000,         0.00001,       0,   64896,  129792,   32448
dut.sel                                     sint<4>,  227136,      -2,               4,         1.00000,         0.00001,       0,   64896,  129792,   32448
dut.t1                                     sint<13>,  227136,    -760,             584,       -71.50000,         0.00142,       0,  127680,   99456,       0
dut.t2                                      sint<7>,  227136,     -18,              24,         2.50000,         0.00005,       0,   85176,  141960,       0
dut.t3                                     sint<14>,  227136,    -665,             511,       -60.50000,         0.00121,       0,  127680,   99456,       0
dut.t4                                      sint<8>,  227136,     -13,              32,         9.00000,         0.00005,       0,   56784,  170352,       0
dut.t5                                     sint<15>,  227136,    -537,             703,        99.50000,         0.00121,       0,   87248,  139888,       0
dut.t6                                      sint<9>,  227136,     -10,              28,         8.50000,         0.00004,       0,   47320,  179816,       0
dut.t7                                     sint<17>,  227136,  -35923,            3757,    -15555.00000,         0.03879,    3472,  218176,    5488,       0
dut.t8                                      sint<7>,  227136,      -6,              32,        12.50000,         0.00004,       0,   18928,  205842,    2366
dut.t9                                     sint<23>,  227135, -295576,          292200,    -22040.26543,         0.46012,       0,  132818,   94317,       0
io_a                                        sint<8>,  113568,     -95,              73,       -11.00000,         0.00043,   20832,   43008,   43008,    6720
io_b                                        uint<4>,  113568,       5,               8,         6.50000,         0.00001,       0,   85176,   28392,       0
io_c                                        sint<4>,  113568,      -6,              -4,        -5.00000,         0.00001,   75712,   37856,       0,       0
io_d                                        sint<3>,  113568,      -4,               3,        -0.50000,         0.00002,   28392,   28392,   28392,   28392
io_m1                                      sint<15>,  227137,   -9237,            9131,      -644.85688,         0.01087,     352,  172889,   53736,     160
io_m2                                      sint<15>,  227137,   -9237,            9131,      -448.57002,         0.01094,     352,  172889,   53736,     160
io_m3                                      sint<15>,  227149,   -9237,            9131,       121.00773,         0.00563,      88,   18886,  208135,      40
io_m4                                      sint<15>,  227147,   -9237,            9131,      -661.00706,         0.00941,     264,  186462,   40301,     120
io_m5                                      sint<15>,  227147,   -9237,            9131,        31.72439,         0.00794,     176,   37772,  189119,      80
io_m6                                      sint<15>,  227135,   -9237,            9131,      -567.72205,         0.01338,     528,  113316,  113051,     240
io_sel                                      sint<4>,  113568,      -2,               4,         1.00000,         0.00002,       0,   32448,   64896,   16224
reset                                       uint<1>,       4,       0,               1,         0.50000,         0.12500,       2,       2,       0,       0

@shunshou
Copy link
Member

shunshou commented Oct 25, 2017

@chick @azidar -- this is actually SUPER useful to sanity check firrtl range inference stuff, esp if you can add another column that just says UNDERCONSTRAINED or something when ti's width is larger than required from simulation.
ti = a * b

-->
a -2.96875, 2.28125 (bp = 5)
b 5, 8 (bp = 0)
t1 = -23.75, 18.25
Shift up to get SInt representation i.e. t1 * (1 << 5)
-760, 584
Which matches firrtl interpreter -- I don't understand why the type is reported as SInt<13> then :(... The width propagation stuff looked right.

@shunshou
Copy link
Member

@azidar Here's my CHIRRTL. If you can make sense of why the constraint solver expects t1 to be 13 bits rather than 11, that'd be great! t1, t2 constraints should definitely match firrtl-interpreter results. t3, t4 shouldn't since there's some cancellation... I'll follow up with the range/width I expect from every intermediate node, given Interval analysis.

Circuit:
circuit TestModule : @[:@2.0]
  module IAArith : @[:@3.2]
    input clock : Clock @[:@4.4]
    input reset : UInt<1> @[:@5.4]
    output io : { flip a : Interval[-2.96875, 2.28125].5, flip b : Interval[5, 8].0, flip c : Interval[-6, -4].0, flip d : Interval[-4, 3].0, flip sel : Interval[-2, 4].0, m1 : Interval.5, m2 : Interval.5, m3 : Interval.5, m4 : Interval.5, m5 : Interval.5, m6 : Interval.5} @[:@6.4]
  
    clock is invalid @[:@8.4]
    reset is invalid @[:@9.4]
    io is invalid @[:@10.4]
    reg sel : Interval[-2, 4].0, clock with :
      reset => (UInt<1>("h0"), sel) @[IAArith.scala 41:20:@11.4]
    sel <= io.sel @[IAArith.scala 41:20:@12.4]
    node t1 = mul(io.a, io.b) @[IAArith.scala 43:17:@13.4]
    node t2 = mul(io.c, io.d) @[IAArith.scala 44:17:@14.4]
    node t3 = sub(t1, io.a) @[IAArith.scala 46:15:@15.4]
    node t4 = add(t2, io.b) @[IAArith.scala 47:15:@16.4]
    node _T_15 = sub(asInterval(UInt<1>("h0"), 0, 0, 0), io.c) @[IAArith.scala 49:18:@17.4]
    node t5 = add(t3, _T_15) @[IAArith.scala 49:15:@18.4]
    node t6 = add(t4, io.d) @[IAArith.scala 50:15:@19.4]
    node _T_16 = add(t5, asInterval(UInt<16>("hb6cd"), -18739, -18739, 10)) @[IAArith.scala 52:23:@20.4]
    reg t7 : Interval, clock with :
      reset => (UInt<1>("h0"), t7) @[IAArith.scala 52:19:@21.4]
    t7 <= _T_16 @[IAArith.scala 52:19:@22.4]
    node _T_18 = sub(t6, asInterval(UInt<3>("h4"), -4, -4, 0)) @[IAArith.scala 53:23:@23.4]
    reg t8 : Interval, clock with :
      reset => (UInt<1>("h0"), t8) @[IAArith.scala 53:19:@24.4]
    t8 <= _T_18 @[IAArith.scala 53:19:@25.4]
    node _T_20 = add(t7, t8) @[IAArith.scala 55:16:@26.4]
    node t9 = mul(_T_20, asInterval(UInt<5>("h8"), 8, 8, 0)) @[IAArith.scala 55:22:@27.4]
    node _T_22 = lt(sel, asInterval(UInt<7>("h28"), 40, 40, 5)) @[IAArith.scala 57:20:@28.4]
    node _T_24 = mux(_T_22, t9, asInterval(UInt<19>("h5b666"), -149914, -149914, 13)) @[IAArith.scala 57:15:@29.4]
    io.m1 <= _T_24 @[IAArith.scala 57:9:@30.4]
    node _T_26 = leq(sel, asInterval(UInt<2>("h1"), 1, 1, 0)) @[IAArith.scala 58:20:@31.4]
    node _T_28 = mux(_T_26, t9, asInterval(UInt<3>("h4"), -4, -4, 0)) @[IAArith.scala 58:15:@32.4]
    io.m2 <= _T_28 @[IAArith.scala 58:9:@33.4]
    node _T_30 = eq(sel, asInterval(UInt<3>("h2"), 2, 2, 0)) @[IAArith.scala 59:20:@34.4]
    node _T_32 = mux(_T_30, t9, asInterval(UInt<5>("h8"), 8, 8, 0)) @[IAArith.scala 59:15:@35.4]
    io.m3 <= _T_32 @[IAArith.scala 59:9:@36.4]
    node _T_34 = geq(sel, asInterval(UInt<3>("h2"), 2, 2, 0)) @[IAArith.scala 60:20:@37.4]
    node _T_37 = mux(_T_34, t9, asInterval(UInt<26>("h2c00000"), -20971520, -20971520, 20)) @[IAArith.scala 60:15:@38.4]
    io.m4 <= _T_37 @[IAArith.scala 60:9:@39.4]
    node _T_39 = gt(sel, asInterval(UInt<3>("h2"), 2, 2, 0)) @[IAArith.scala 61:20:@40.4]
    node _T_42 = mux(_T_39, t9, asInterval(UInt<25>("ha00000"), 10485760, 10485760, 20)) @[IAArith.scala 61:15:@41.4]
    io.m5 <= _T_42 @[IAArith.scala 61:9:@42.4]
    node _T_44 = neq(sel, asInterval(UInt<3>("h2"), 2, 2, 0)) @[IAArith.scala 62:20:@43.4]
    node _T_47 = mux(_T_44, t9, asInterval(UInt<4>("h5"), 5, 5, 0)) @[IAArith.scala 62:15:@44.4]
    io.m6 <= _T_47 @[IAArith.scala 62:9:@45.4]

  module TestModule : @[:@47.2]
    input clock : Clock @[:@48.4]
    input reset : UInt<1> @[:@49.4]
    output io : { flip a : Fixed<8><<5>>, flip b : UInt<4>, flip c : SInt<4>, flip d : SInt<3>, flip sel : SInt<4>, m1 : Fixed<69><<5>>, m2 : Fixed<69><<5>>, m3 : Fixed<69><<5>>, m4 : Fixed<69><<5>>, m5 : Fixed<69><<5>>, m6 : Fixed<69><<5>>} @[:@50.4]
  
    clock is invalid @[:@52.4]
    reset is invalid @[:@53.4]
    io is invalid @[:@54.4]
    inst dut of IAArith @[TestModule.scala 21:19:@55.4]
    dut.io is invalid @[:@56.4]
    dut.clock <= clock @[:@57.4]
    dut.reset <= reset @[:@58.4]
    dut.clock <= clock @[TestModule.scala 25:31:@59.4]
    dut.reset <= reset @[TestModule.scala 27:31:@60.4]
    node _T_24 = asFixedPoint(dut.io.m6, 5) @[Utils.scala 108:37:@61.4]
    io.m6 <= _T_24 @[Utils.scala 108:15:@62.4]
    node _T_25 = asFixedPoint(dut.io.m5, 5) @[Utils.scala 108:37:@63.4]
    io.m5 <= _T_25 @[Utils.scala 108:15:@64.4]
    node _T_26 = asFixedPoint(dut.io.m4, 5) @[Utils.scala 108:37:@65.4]
    io.m4 <= _T_26 @[Utils.scala 108:15:@66.4]
    node _T_27 = asFixedPoint(dut.io.m3, 5) @[Utils.scala 108:37:@67.4]
    io.m3 <= _T_27 @[Utils.scala 108:15:@68.4]
    node _T_28 = asFixedPoint(dut.io.m2, 5) @[Utils.scala 108:37:@69.4]
    io.m2 <= _T_28 @[Utils.scala 108:15:@70.4]
    node _T_29 = asFixedPoint(dut.io.m1, 5) @[Utils.scala 108:37:@71.4]
    io.m1 <= _T_29 @[Utils.scala 108:15:@72.4]
    node _T_30 = asInterval(io.sel, -2, 4, 0) @[Utils.scala 100:35:@73.4]
    dut.io.sel <= _T_30 @[Utils.scala 100:16:@74.4]
    node _T_31 = asInterval(io.d, -4, 3, 0) @[Utils.scala 100:35:@75.4]
    dut.io.d <= _T_31 @[Utils.scala 100:16:@76.4]
    node _T_32 = asInterval(io.c, -6, -4, 0) @[Utils.scala 100:35:@77.4]
    dut.io.c <= _T_32 @[Utils.scala 100:16:@78.4]
    node _T_34 = cat(UInt<1>("h0"), io.b) @[Cat.scala 30:58:@79.4]
    node _T_35 = asInterval(_T_34, 5, 8, 0) @[Utils.scala 96:50:@80.4]
    dut.io.b <= _T_35 @[Utils.scala 96:16:@81.4]
    node _T_36 = asInterval(io.a, -95, 73, 5) @[Utils.scala 104:35:@82.4]
    dut.io.a <= _T_36 @[Utils.scala 104:16:@83.4]

chick added a commit that referenced this issue Oct 26, 2017
Working transform that will adjust widths of
registers, ports and wires through the annotation system.
Second piece of the augmented tool chain that will ultimately
take advantage of firrt interpreters instrumentation output.
Adjusting widths according to data gathered thereby.
Part of Issue #114
chick added a commit that referenced this issue Oct 26, 2017
Working transform that will adjust widths of
registers, ports and wires through the annotation system.
Second piece of the augmented tool chain that will ultimately
take advantage of firrt interpreters instrumentation output.
Adjusting widths according to data gathered thereby.
Part of Issue #114
@chick
Copy link
Contributor Author

chick commented Oct 26, 2017

Good news is my BitReducer caught it:

...
Creating annotation "dut.t1=11"
...
Instrumentation Manager
Change annotations created           17
Total bits removed                   44
Total bits considered               566
Precentage                       7.7739 %

chick added a commit that referenced this issue Oct 26, 2017
With this we now have most of the machinery for Issue #114
@chick
Copy link
Contributor Author

chick commented Oct 26, 2017

Current intervals-oct now has new tools

  • BitReducer: parses the output of the interpreter's instrumentation and creates width changing annotations
  • ChangeWidthTransform: applies aforementioned annotations to a low firrtl file

There's more to do. Not sure how to prioritize.

  • Make better bit reduction calculations
    • based min, max on some sigma based tolerance 2σ, 3σ, etc.
  • make some hacky toolchain that
    • calls interpreter with instrumentation
    • calls BitReducer
    • calls ChangeWidth transform on lofirrtl
    • Does the following (what's highest priority)
      • runs interpreter again
      • runs verilator
      • dumps saves lofirrtl file somewhere.
  • Something else I haven't thought of

@shunshou
Copy link
Member

Sorry, was out today. :x

So currently you

  1. Annotate which widths need to be shrunk from interpreter results.
  2. Run the ChangeWidthTransform to actually shrink the widths [right now based on min/max?]

You're asking if

  1. You should prioritize some kind of switch between min/max and sigma + clipping tolerance?
  2. I don't understand what part of the "hacky" toolchain doesn't already exist? You're saying if you want to iterate several times to potentially get better widths? I assume you can just go through Firrtl like normal after ChangeWidthTransform and generate Verilog?

I think being able to switch between min/max and sigma + clip would be more interesting to me personally. If you have time tomorrow, we can Google Hangouts or something to talk more, but even with just this, I'm super excited! :D

@shunshou
Copy link
Member

@chick I got rid of import com.sun.tools.javac.resources.compiler. Intellij said it wasn't used, and I can't find a tools.jar :(

@chick
Copy link
Contributor Author

chick commented Oct 28, 2017

Sorry, I was out most of today too. I think IntelliJ auto generated that import for me at some point.
Per your previous comment.
I was concerned that to get this all to work, there would have to be some manual working or scripting to get these pieces all to play together. I'm not sure without more coding that the width adjusted firrtl file could be re-run with the DspTester.
But if a sigma clipper is more valuable I'll work on that next.

@shunshou
Copy link
Member

Yeah no worries. I made like 2 comments on your commits? From looking at them, I understand that you have all of the pieces but haven't actually done the whole width adjusted firrtl + DspTester bit? Maybe that's the highest priority. Sigma clipper is cute and useful, but that can wait... Running into some other problems with interval assumptions, so thinking about what would be best to have...

@chick
Copy link
Contributor Author

chick commented Oct 28, 2017

Yes, I will hack out something that will do following:

  1. Run interpreter with instrumentation flags
  2. Run BitReducer saving annotations
  3. Run ChangeWidthTransfrom using annotations from BitReducer
  4. Run either verilog or interpreter using modified low Firrtl file.

Sound right? This will just be some big ugly hacked version of Dsptools Driver and iotesters Driver.

@shunshou
Copy link
Member

Yeah, ok... well, hacked plumbing is better than no plumbing. I hand-calculated optimal bitwidths for my circuit and am double checking that ranges actually generated the right widths (not accounting for stuff like (b * a) - a having some cancellations -- which only interpreter can detect).

chick added a commit that referenced this issue Oct 30, 2017
See InstrumentingSpec:"run with with bits reduced" for example of use

New executeWithBitReduction acts like ordinary dsptools.Driver.execute

Adds in the following

- Run with interpreter bit instrumentation
- Analyzes report, creating annotations to reduce size
- Runs transform to reduce bits in low Firrtl
- Re-runs the tests re-instrumenting

Produces files

- <dut-name>.signal-bitsizes.csv
- <dut-name>.signal-bitsizes-2.csv
- <dut-name>.bit-reduced.fir
chick added a commit that referenced this issue Oct 30, 2017
- Fixed changes to sizes in sub-module being lost
- Added warnings if annotation to change wire not used or used more than once
- Added a executeWithBitReduction to IAArithSpec, it works
- Fix spelling in createAnnotationIfAppropritate
@chick
Copy link
Contributor Author

chick commented Oct 30, 2017

@shunshou I just pushed a working example in
dsptools/intervals/IAArith.scala
It seems to work as I would expect, I believe module de-duplication is broken.

@shunshou
Copy link
Member

I assume that's a firrtl bug? What does that mean? I'm having problems, for example, getting MatMul to work with DspReal. Everything looks OK, but the data isn't getting fed in. Probably something missing on my end (with my custom TestModule), but I have no idea what...

@chick
Copy link
Contributor Author

chick commented Oct 31, 2017

It's my bug (or missing feature). I should have been more specific, I meant that if the bit reducer finds signals who's width should change but not by the same amount and that are the same signal but in different instances of the same module, the ChangeWidth transform will not work right.
It does not sound related to what you are experiencing

@shunshou
Copy link
Member

shunshou commented Nov 1, 2017

@chick do you think you can add a quick n sigma (or 3 sigma for now) thing to the resizer? Basically, I just realized, to be exhaustive for even a 4x4 matrix multiply, you'd need something like 8 ^ 16 (assuming 8 bit input) inputs, so I think the "best bet" is to sample some 1000 inputs or so and 3sigma the bitwidths...

@chick
Copy link
Contributor Author

chick commented Nov 1, 2017

@shunshou Sure, I need to finish something up right now. Should have it in an hour or two.

@shunshou
Copy link
Member

shunshou commented Nov 1, 2017

:) Still debugging DspReal bit so whenever is fine.

@chick
Copy link
Contributor Author

chick commented Nov 1, 2017

@shunshou What is the proscribed behavior if you say reduce to 3σ and that number is
greater than the largest number seen. My inclination is to take min(mean + 3σ, maxSeen)
but I am not sure. Thoughts? (same question for SInt minima)

@shunshou
Copy link
Member

shunshou commented Nov 2, 2017

So rather than taking min(mean + 3sigma, maxSeen), you'd want to start at the inputs and go to the outputs, replacing the wider bitwidth nodes i.e. say a = UInt<30> with narrower widths i.e. say a_3sigma = a.clip(interval_range_equivalent_to_8-bit) assuming a_3sigma is UInt<8>, which is 3 sigma out.

Also fyi solved the DspReal problem. Chisel didn't seem to catch a binding error or the tester didn't catch that I was poking an output. Didn't delve deep enough to figure out.

@chick
Copy link
Contributor Author

chick commented Nov 2, 2017

hmm, can i do this all at one time or do I have to identify the earliest nodes in the dependency wrap their assignment with a clip of the determined size, ( I think by low Firrtl there can only be one (like highlander)). Seems like I might want to re-run the instrumentation/bit-reduction again, seems likely the clipping could change the σ of its dependents. This could take a while, if it's the right way

@shunshou
Copy link
Member

shunshou commented Nov 2, 2017

So if you don't start at the beginning, the width might actually propagate and get unnecessarily large in previous nodes before it's trimmed at a given node. Maybe the clipping behavior is better discussed in a meeting w/ Paul and co... Let's just call that a stretch goal...

@shunshou
Copy link
Member

shunshou commented Nov 5, 2017

Dumb question: What happens if 3sigma is further out than the calculate range inference?

@chick
Copy link
Contributor Author

chick commented Nov 5, 2017

Currently I take the min of the 3σ max and the maxSeen. and vice versa on the low.
But I think you are right we should review the sigma option on Monday

@shunshou
Copy link
Member

shunshou commented Nov 6, 2017

How many iterations are needed to converge? Also, in the end, I need an output Verilog file... After it thinks its done, it should rerun tests with Verilator + generate Verilog.

@shunshou
Copy link
Member

shunshou commented Nov 6, 2017

@chick check out sbt "test-only dsptools.toys.DCTMatMulSpec"
I get errors like

info] firrtl.passes.CheckWidths$BitsWidthException:  @[IntervalTypeClass.scala 188:35:@8528.4]: [module MatrixOp] High bit 15 in bits operator is larger than input width 14 in bits(mul(_T_11269, _T_11271), 15, 0).

When using the reduction pass.

(It's in the MatrixOps.scala file)

Other thing:

After bitwidth reduction, generate Verilog + run through Verilator.
Allow user option to just increase bitwidth by say like 2 bits over min/max seen i.e. guard bits.

@chick
Copy link
Contributor Author

chick commented Nov 7, 2017

I'm struggling with getting the verilog to run after the firrtl's been bit reduced.
I have a local version with your fudge factor bits.
I think I understand the error you got. Downstream from signals I have changed there are Bit selection operators that are not adjusted.
I hope to talk to Adam at BRWC to decide what I can do about this.

@chick
Copy link
Contributor Author

chick commented Nov 7, 2017

Based on discussion with @azidar today.
Changing width pass must be more complex. Looks like it should

  1. Infer Types
  2. Nodes -> Wires
  3. Trim widths
  4. Resolve Bits with extending MSB
  5. Optimize reconfigured bits

@shunshou
Copy link
Member

shunshou commented Nov 9, 2017

@chick any luck? :o

@chick
Copy link
Contributor Author

chick commented Nov 9, 2017

Just pushed a new version that seem to run ok with you DCT example. There seems to be some magic operating. When I added a couple of passes that Adam recommended everything seemed to start working, even though there is one more pass that I am supposed to implement.

Also, I had to disable the updating of top level IO's because i could not get it to run with verilator when i did that. It may yet be fixable but I don't know specifically the mechanism of the failure.

BTW, seems to not run if bins set to zero. I'm going to look into that right now.

@shunshou
Copy link
Member

Cool. Yeah, the TestModule IO's don't need to be updated.

@chick
Copy link
Contributor Author

chick commented Nov 14, 2017

I have not figured out the broken problem but I have a new strategy suggested by Paul
Given a wire

wire dog : SInt<20>
dog <= add(cat, emu)

rather than changing dog instead we inject additional code along the lines of

wire dog : SInt<20>
wire dog_reduced : SInt<18>

dog_reduced <= clip(add(cat, emu), dog_reduced)
dog_msb <= head(dog, 1)

dog <= cat(dog_msb, cat(do_msb, dog_reduced))
  • This doesn't actually reduce the wires but both Paul and Steve feel that these steps would lead to fewer wires at synthesis.
  • This should simplify the ChangeWidthPass, it would no longer need to try and fix up the bit-level operators
  • What is the cost of introducing the clips. These are not free and maybe there's some way to calculate the break even point of wire reduction vs the wire expansion created by clips.

@shunshou @azidar Any thoughts? I think I will go ahead with this as an experiment as I am a bit stuck at the moment figuring out what's going wrong with the SystolicMatMul

@grebe did I miss anything here.

@azidar
Copy link
Contributor

azidar commented Nov 14, 2017

Why are you clipping?

More specifically, what are the semantics of the bit-reducer transform when given an out-of-bound value? Is this treated as "user error" or should the circuit act gracefully by clipping? If its the former, then you shouldn't clip (and that's what I thought we were doing).

The solution I'll outline below only fixes where an trimmed wire is referenced, not where it is declared. I think both will work. However, both solutions need to correctly handle bits, head, and tail.

To synchronize everyone, here is my solution. For example:

wire x: SInt<4>
wire y: UInt<3>
y <= bits(x, 3, 1)

First, we trim x to 3 bits and y to 2 bits:

wire x: SInt<3>
wire y: UInt<2>
y <= bits(x, 3, 1)

However, now the bits operator references non-existent bits. Thus, we need to do something more complicated, namely return x's msb since its an SInt.

wire x: SInt<3>
wire y: UInt<2>
y <= bits(cat(head(x, 1), x), 3, 1)

Finally, we need to trim the assignment to y:

wire x: SInt<3>
wire y: UInt<2>
y <= bits(bits(cat(head(x, 1), x), 3, 1), 1, 0)

Now we have a correct FIRRTL circuit, but its unoptimized. Namely, we could simplify the last expression to:

wire x: SInt<3>
wire y: UInt<2>
y <= bits(x, 2, 1)

However, this optimization IS NOT ALWAYS VALID. Thus, if we just jump from the first step to the last step, in some cases it will not be correct.

Thus, you need to do each of these transformations in sequence, without shortcuts. The transformation steps are as follows:

  1. Trim all wire/input bits according to the interpreter results
  2. Sign-extend or zero extend all inputs to bits() which are too small
  3. Pad the assignments (this is also probably optional for now, as our VerilogCompiler does this automatically)
  4. Optimize if you can (this is probably optional for now, I imagine the CAD tools can do this for us).

@azidar
Copy link
Contributor

azidar commented Nov 14, 2017

I'd say do your solution first - if we see a positive synthesis result, then we can call it a day. Otherwise we will need to do the optimization ourselves.

@chick
Copy link
Contributor Author

chick commented Nov 14, 2017

I put clip in there because I was also considering the use of n·σ bit reductions, where we know that we will be pushing in values that are bigger than the bits we have re-allocated.
The advantage of the Paul proposal is I don't have to worry about bit operators or any other downstream ops. Do you have specific objections to it.
OOPS, just saw you comment. I will try Paul's as least amount of work and see where we go from there.

@azidar
Copy link
Contributor

azidar commented Nov 14, 2017

Clip will (at best) add a mux for every trimmed wire, which could theoretically cause a HUGE explosion in area costs. I'm also a lot less certain that a CAD tool can optimize around the mux. For now, I would trim with bits instead of clip. Then, if that works, we can try the n·σ bit reductions with a clip.

@shunshou
Copy link
Member

shunshou commented Nov 14, 2017 via email

@chick
Copy link
Contributor Author

chick commented Nov 14, 2017

Not going to do clips at this time!

@chick
Copy link
Contributor Author

chick commented Nov 15, 2017

@azidar
Here's what I am doing and code runs, but sadly, I am getting the same errors as before (looks like sign problems)
For wires and registers I am constructing the following

    def decreaseTypeWidth(originalType: Type, delta: Int): Type = {
      originalType match {
        case SIntType(IntWidth(oldWidth)) => SIntType(IntWidth(oldWidth - delta))
        case UIntType(IntWidth(oldWidth)) => UIntType(IntWidth(oldWidth - delta))
        case other                        => other
      }
    }
      def signExtend(numberToDo: Int, firstArg: Expression, lastArg: Expression, tpe: Type): Expression = {
        if(numberToDo <= 1) {
          DoPrim(Cat, Seq(firstArg, lastArg), Seq(), tpe)
        }
        else {
          DoPrim(
            Cat,
            Seq(firstArg, signExtend(numberToDo - 1, firstArg, lastArg, tpe)),
            Seq(),
            decreaseTypeWidth(tpe, delta = numberToDo - 1)
          )
        }
      }

      def constructSmallerIntermediates(
                                         wire: Statement with IsDeclaration,
                                         tpe: Type,
                                         changeRequest: ChangeRequest
                                       ): Block = {
        val reduced = DefWire(wire.info, wire.name + "__reduced", changeTpe(tpe, changeRequest))
        val msb     = DefWire(wire.info, wire.name + "__msb",  UIntType(IntWidth(1)))

        val extendedSign = signExtend(
          (typeToWidth(tpe) - changeRequest.newWidth).toInt,
          WRef(msb),
          DoPrim(AsUInt, Seq(WRef(reduced)), Seq(), UIntType(IntWidth(changeRequest.newWidth))),
          tpe
        )

        Block(Seq(
          wire,
          reduced,
          msb,
          Connect(
            wire.info, WRef(msb),
            tpe match {
              case _: UIntType => UIntLiteral(BigInt(0), IntWidth(1))
              case _: SIntType => DoPrim(Head, Seq(WRef(reduced)), Seq(BigInt(1)), UIntType(IntWidth(1)))
            }
          ),
          Connect(
            wire.info, WRef(wire.name, tpe, WireKind),
            tpe match {
              case _: UIntType =>
                extendedSign
              case _: SIntType =>
                DoPrim(AsSInt, Seq(
                  extendedSign
                ), Seq(), tpe)
            }
          )
        ))
      }

      def changeWidthsInStatement(statement: Statement): Statement = {
        val resultStatement = statement map changeWidthsInStatement map changeWidthsInExpression
        resultStatement match {
          case connect: Connect =>
            changeRequests.get(expand(connect.loc.serialize)) match {
              case Some(changeRequest) =>
                val newLoc = connect.loc match {
                  case w: WRef => w.copy(name = w.name + "__reduced")
                  case s       => s
                }
                // logger.info(s"Changing:Connect ${register.name} new width ${changeRequest.newWidth}")
                Block(Seq(
                  connect.copy(loc = newLoc)
                ))
              case _ => connect
            }
          case register: DefRegister =>
            changeRequests.get(expand(register.name)) match {
              case Some(changeRequest) =>
                constructSmallerIntermediates(register, register.tpe, changeRequest)

                // logger.info(s"Changing:DefReg ${register.name} new width ${changeRequest.newWidth}")
              case _ => register
            }
          case wire: DefWire =>
            changeRequests.get(expand(wire.name)) match {
              case Some(changeRequest) =>
                constructSmallerIntermediates(wire, wire.tpe, changeRequest)
              // logger.info(s"Changing:DefReg ${wire.name} new width ${changeRequest.newWidth}")
              case _ => wire
            }

and

          case connect: Connect =>
            changeRequests.get(expand(connect.loc.serialize)) match {
              case Some(changeRequest) =>
                val newLoc = connect.loc match {
                  case w: WRef => w.copy(name = w.name + "__reduced")
                  case s       => s
                }
...

Am I missing the sign bit (off-by-one error?) or can you see anything else that looks like I might be wrong. Is there are more idiomatic or elegant way of constructing the sign extension

@azidar
Copy link
Contributor

azidar commented Nov 15, 2017

Shouldn't signExtend do nothing if nothingToDo == 0, not sign extend it? I'd also not worry about setting the types correctly, and just run InferTypes at the end.

Also, in the following line, why are you adding 2?

typeToWidth(tpe) - changeRequest.newWidth).toInt + 2,

@chick
Copy link
Contributor Author

chick commented Nov 16, 2017

@shunshou @azidar I have not been able to construct a test yet, but I have convinced myself that the errors I get when I run Angie's tests is do to different instantiations of the same module getting different bit reductions for wires and registers within the different instances. I realized this might be a problem earlier, but I kind of forgot about it.
What the heck is the way to annotate module for no de-dup. Doesn't David Biancolin use this.

@chick
Copy link
Contributor Author

chick commented Nov 16, 2017

@shunshou @azidar Adding NoDedup per chisel3 chiselTests/AnnotationNoDedup.scala I was able to run two of Angie's tests

test-only dsptools.toys.SystolicDCTMatMulSpec -- -z UNFILTERED
test-only dsptools.toys.SystolicDCTMatMulSpec -- -z "RANDOM FILTERED"

I'll try more soon. The changes for this are beginning to sprawl. Needs some hours of cleanup to tie everything up.
Sadly, I think a lot of my earlier code was mostly working but also getting bit by the lack of NoDedup.
We will need to run this new version through synthesis to see if this current bit-reduction technique is actually saving us anything.
Besides going through more examples what's the next best thing to work on.

@shunshou
Copy link
Member

shunshou commented Nov 16, 2017 via email

@chick
Copy link
Contributor Author

chick commented Nov 16, 2017

I pushed a version that the aforementioned tests run on. The DCT tests still fail but it's because of some verilog error.

Module.v:59228: Illegal bit or array select; type does not have a bit range, or bad dimension: type is logic

That line of the verilog file is

assign _T_6493__msb = _T_6493__reduced[0:0];

That's probably all I am going to get done tonight, I'll get back on that horse in the morning

@chick
Copy link
Contributor Author

chick commented Nov 16, 2017

@shunshou
My foot got caught in the stirrup attempting to dismount. Looks like the verilog error occurred when I reduced a signal down to 1 bit. I did a quick hack to not reduce anything to below 4 bits and the following tests ran

sbt 'test-only dsptools.toys.DCTMatMulSpec -- -z "RANDOM FILTERED"' > DCT.RF.fudge0
sbt 'test-only dsptools.toys.DCTMatMulSpec -- -z "UNFILTERED"' > DCT.UNF.fudge0

Heading off-line.
If you want to add any more tests that use submodules you have to look at test/scala/toys/SystolicMatMul.scala to see how I prevented de-duplication.

@shunshou
Copy link
Member

shunshou commented Nov 16, 2017 via email

@shunshou
Copy link
Member

shunshou commented Nov 16, 2017 via email

@shunshou
Copy link
Member

shunshou commented Nov 16, 2017 via email

chick added a commit that referenced this issue Sep 4, 2018
It monitors and bins signal values

Update node widths with Firrtl Transform
Working transform that will adjust widths of
registers, ports and wires through the annotation system.
Second piece of the augmented tool chain that will ultimately
take advantage of firrt interpreters instrumentation output.
Adjusting widths according to data gathered thereby.
Part of Issue #114

Update node widths with Firrtl Transform
Working transform that will adjust widths of
registers, ports and wires through the annotation system.
Second piece of the augmented tool chain that will ultimately
take advantage of firrt interpreters instrumentation output.
Adjusting widths according to data gathered thereby.
Part of Issue #114

A little bit of cleanup

Working on bit reduction calculator
BitReducer does simple min/max to determine new bit size
Can be made more advanced.

Added writer to write bit changing annotations to a file
With this we now have most of the machinery for Issue #114

remove tools.jar dependency

fix context vs. not inconsistencies in type classes. add support for conversion to interval.

add preliminary support for interval type class

forgot to fix some context ops in intervals

Minor changes

fix syntax errors for compilation

few minor syntax errors in intervaltype. add clip op to type classes.

change 1 << n to BigInt(1) << n

fixedprecisionchangerspec checks out

migrate dsp tools/interval tests from fft project

symlink local tests

dsptools.math was conflicting with scala.math

macros need to be enabled for chiselName. also, another scala.math fix

scala strassen winograd matrix multiplication

remove commented out typeclass stuff (to be revisited)

lots of prep work to get matrix ops working. still debugging...

minor update

save minor changes for debug

save temporary progress. too many wires...

pipelining works -- hardcoded

This creates a toolchain that unites the machinery for Issue #114

See InstrumentingSpec:"run with with bits reduced" for example of use

New executeWithBitReduction acts like ordinary dsptools.Driver.execute

Adds in the following

- Run with interpreter bit instrumentation
- Analyzes report, creating annotations to reduce size
- Runs transform to reduce bits in low Firrtl
- Re-runs the tests re-instrumenting

Produces files

- <dut-name>.signal-bitsizes.csv
- <dut-name>.signal-bitsizes-2.csv
- <dut-name>.bit-reduced.fir

debugged matrix lit not working -- breeze orders things differently

Some fixes for Issue #114

- Fixed changes to sizes in sub-module being lost
- Added warnings if annotation to change wire not used or used more than once
- Added a executeWithBitReduction to IAArithSpec, it works
- Fix spelling in createAnnotationIfAppropritate

adam fixed firrtl bug -- interval 4x4 and 8x8 work now

minor changes. trying to debug dspreal

fixed dspreal bug with matrix ops. dspreal direction wasn't getting propagated properly to TestModule, and poking a non-input didn't result in compile-time failure (only observed in sims)

separate out matmul tests

changes for dct

Added bit reduction by standard deviation.
Now you can specify a multiplier of the standard deviation.
The bit reduction will apply to the min and the max
being determined by the mean ± (multiplier * σ)
if these numbers exceed the min and max seen, then the more
restrictive limit will apply.
More testing required

Undo accidental changes

added systolic array matmul. dct constraint bug...

updates for getting random inputs working; bit reduction errors out

In process refactoring of change widths.

In process refactoring of change widths.

filtering should work now. waiting on bit reduction

Added ToWorkingIr and InferWidths to ChangeWidthTransform
Changed to not change width's of IO's, could not get verilator to
run when I did that.
Seems to fix Angie's problems

ChangeWidthTransform broke because
DefInstance became WDefInstance because of ToWorkingIR

Fix Bit operation errors
Add in a few more passes to try and fix up
Bit prim ops whose args are out of syncs with bit reduced signals

A bit of style and dead code cleanup for previous commit

more benchmarks

bump numtests

change tolerance for 8-bit

changed bitwidths. interpreter stuff not saving.

start working on fir filter example. tested real/fixed conversions.

fir working; needs cleanup

working fir example

start working on clicking when limiting to n*sigma

Added some BitWidth convenience tools
Added HTML BitWitdh report to BitReducer
Changed default binning to 16
Added UNFILTERED to two test names
to make it easier to run just that one.

minor code cleanup for fir demo

prep for interpreter tests

New strategy for reducing wires, does not actually reduce them
but creates shadow reduced wire with name <old-name>__reduced
and assigns to original wire with sign or zero extension
 <old-name>__msb joined by recursive cat operators.

Added NODedupAnnotator to SystolicMatMul, this fixes blow ups
due to a given wire getting two different bit reductions for different
instances it appears in

Right now I am not reducing IO's in submodules, doing so breaks IO
naming because of the __reduced.

Added UNFILTERED to two test names so I could select them easier in
sbt

Added some incomplete tests of things as I was tracking down bugs

Don't reduce anything to less than 4 bits.
Hack fix for now for verilog complaining about something
that was reduced to 1 bit

made some minor changes; looks like bit reduction still doesn't work. see help files in top-level dir

minor changes

use different multiply alg in matrixops

Auto pad mixed radix representation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants