Wednesday, 2022-01-19

*** tpb <[email protected]> has joined #yosys00:00
*** vidbina <[email protected]> has quit IRC (Ping timeout: 268 seconds)00:41
*** citypw <citypw!~citypw@gateway/tor-sasl/citypw> has joined #yosys01:28
*** bl0x_ <bl0x_!~bastii@p200300d7a7172200fba0d562efa7c9cd.dip0.t-ipconnect.de> has quit IRC (Ping timeout: 240 seconds)02:22
*** bl0x_ <bl0x_!~bastii@p200300d7a710cc00d7429a35fc1fea6a.dip0.t-ipconnect.de> has joined #yosys02:24
*** ec <ec!~ec@gateway/tor-sasl/ec> has quit IRC (Ping timeout: 276 seconds)02:51
*** tlwoerner <[email protected]> has quit IRC (Ping timeout: 240 seconds)05:46
*** tlwoerner <[email protected]> has joined #yosys05:52
*** tlwoerner <[email protected]> has quit IRC (Ping timeout: 268 seconds)06:04
*** tlwoerner <[email protected]> has joined #yosys06:28
*** FabM <FabM!~FabM@2a03:d604:103:600:87a3:5c19:7dbe:f486> has joined #yosys06:47
ikskuhhm, i figured out one hot path without reading the whole source code *grin*07:19
ikskuhyou shall not modulo arbitrary numbers07:19
ikskuhin synthesis07:19
*** vidbina <[email protected]> has joined #yosys08:34
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has joined #yosys09:29
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has quit IRC (Quit: Leaving)10:17
loftyikskuh: oh yeah, that's really something to avoid10:51
ikskuhyeah10:51
loftyIt might be feasible to do sequentially10:51
ikskuhstill impressed that i can run 16 mio divs per second10:51
loftyBut, definitely not combinationally10:52
ikskuhwhen i removed it, the synth/pnr jumped to 112 mhz10:53
ikskuhwhich is already way better10:53
ikskuhbut now i need to compile the nextpnr gui so i can figure out where the hotpath actually is10:53
ikskuhbecause i feel like it should be faster10:53
ikskuhmy goal is stable 160 MHz CPU clock with at least 20 MHz margin 10:54
tntis that on ecp5 ?10:59
ikskuhyeah11:00
ikskuhcurrent design without CPU synthesizes to roughly 202 MHz11:00
*** Lord_Nightmare <Lord_Nightmare!Lord_Night@user/lord-nightmare/x-3657113> has quit IRC (Ping timeout: 240 seconds)11:07
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has joined #yosys11:57
ikskuhthe gui is really slow /o\12:14
ikskuhare there requirements for the visualization? i might be able to improve the rendering a lot12:14
ikskuhis there a way to show the critical paths?12:24
ikskuhor reverse net/cell names back to their verilog lines?12:24
tnttry to guess based on the name of the nets ...12:26
ikskuhfrom to is both "posedge $glbnet$clk"12:28
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has quit IRC (Quit: Leaving)12:28
*** cr1901_ <cr1901_!~cr1901@2601:8d:8600:911:9d70:8f88:7606:6eed> has joined #yosys12:40
*** trabucay1e <[email protected]> has joined #yosys12:42
*** gatecat_ <[email protected]> has joined #yosys12:42
*** dnm_ <[email protected]> has joined #yosys12:42
*** Moe_Icenowy <Moe_Icenowy!~MoeIcenow@2604:a880:2:d1::1d1:f001> has joined #yosys12:42
*** unkraut_ <[email protected]> has joined #yosys12:44
*** gruetze_ <gruetze_!~quassel@wireguard/tunneler/gruetzkopf> has joined #yosys12:44
*** rektide_ <[email protected]> has joined #yosys12:45
*** jix__ <jix__!~jix@user/jix> has joined #yosys12:45
*** Kamilion|ZNC <Kamilion|[email protected]> has joined #yosys12:45
*** _whitelogger <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** Sarayan <Sarayan!~galibert@2a01:e0a:1d7:77e0:beae:c5ff:fee3:518f> has quit IRC (Ping timeout: 240 seconds)12:46
*** unkraut <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** jix_ <jix_!~jix@user/jix> has quit IRC (Ping timeout: 240 seconds)12:46
*** rektide <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** gatecat <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** dnm <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** cr1901 <cr1901!~cr1901@2601:8d:8600:911:51a1:26ce:1709:19d8> has quit IRC (Ping timeout: 240 seconds)12:46
*** Max-P <Max-P!thelounge@thelounge/maintainer/Max-P> has quit IRC (Ping timeout: 240 seconds)12:46
*** gruetzkopf <gruetzkopf!~quassel@wireguard/tunneler/gruetzkopf> has quit IRC (Ping timeout: 240 seconds)12:46
*** MoeIcenowy <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** trabucayre <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** tux3 <tux3!~tux3@user/tux3> has quit IRC (Ping timeout: 240 seconds)12:46
*** Kamilion <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** oldtopman <[email protected]> has quit IRC (Ping timeout: 240 seconds)12:46
*** gatecat_ is now known as gatecat12:46
*** _whitelogger <[email protected]> has joined #yosys12:46
*** dnm_ is now known as dnm12:46
*** oldtopman <[email protected]> has joined #yosys12:46
*** tux3 <[email protected]> has joined #yosys12:46
*** Kamilion|ZNC is now known as Kamilion12:47
*** rektide_ <[email protected]> has quit IRC (Ping timeout: 256 seconds)12:50
*** rektide <[email protected]> has joined #yosys12:51
*** gruetze_ is now known as gruetzkopf13:00
*** trabucay1e is now known as trabucayre13:03
*** Sarayan <Sarayan!~galibert@2a01:e0a:1d7:77e0:beae:c5ff:fee3:518f> has joined #yosys13:05
ikskuhhttps://bpa.st/RW3Q13:21
tpbTitle: View paste RW3Q (at bpa.st)13:21
ikskuhcould someone take a quick peek at this? It's a human readable display of the critical paths13:21
ikskuhevery line with a "!" is where the delay is more than the budget, and the cells in the path are displayed afterwards 13:22
ikskuhto me it looks like the "cpu halted" is delaying everything?13:22
tntwell kind of hard to say without seeing the sources ...13:26
ikskuhoh, sure13:27
ikskuhhttps://git.random-projects.net/ashet/mini/src/branch/master/src13:27
tpbTitle: ashet/mini - mini - Random Projects: Code for the masses (at git.random-projects.net)13:27
ikskuhspu.v is the problematic file13:28
ikskuhif i remove the instance of the spu_core module, i'm at ~200 MHz13:28
*** cr1901_ is now known as cr190113:48
*** Lord_Nightmare <Lord_Nightmare!Lord_Night@user/lord-nightmare/x-3657113> has joined #yosys14:26
*** ec <ec!~ec@gateway/tor-sasl/ec> has joined #yosys15:29
*** citypw <citypw!~citypw@gateway/tor-sasl/citypw> has quit IRC (Ping timeout: 276 seconds)15:35
*** cr1901 <cr1901!~cr1901@2601:8d:8600:911:9d70:8f88:7606:6eed> has quit IRC (Ping timeout: 240 seconds)15:41
*** Moe_Icenowy is now known as MoeIcenowy15:41
loftyikskuh: this state machine is pretty big16:09
loftyOh, is this the entire CPU as a state machine?16:10
ikskuhyes16:33
ikskuh"smol" cpu16:33
ikskuhwell, it actually is16:34
ikskuh32 instructions and only some transitions before/after16:34
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has joined #yosys16:45
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has quit IRC (Client Quit)16:46
loftyikskuh: well, CPUs as state machines tend not to go so well16:49
loftyThere's a lot of things that Yosys has to merge together when it's probably not necessary16:49
*** cr1901 <cr1901!~cr1901@2601:8d:8600:911:c11c:2a92:dcdc:271a> has joined #yosys16:52
loftyFor example the ALU carry things16:52
*** gsmecher <[email protected]> has joined #yosys16:57
*** ec <ec!~ec@gateway/tor-sasl/ec> has quit IRC (Ping timeout: 276 seconds)17:18
*** ec <ec!~ec@gateway/tor-sasl/ec> has joined #yosys17:19
*** vidbina <[email protected]> has quit IRC (Ping timeout: 268 seconds)17:29
*** peepsalot <peepsalot!~peepsalot@openscad/peepsalot> has quit IRC (Quit: Connection reset by peep)17:38
*** peepsalot <peepsalot!~peepsalot@openscad/peepsalot> has joined #yosys17:40
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has joined #yosys18:24
*** sagar_acharya <sagar_acharya!~sagar_ach@2405:201:f:1db9:4a96:8154:92bb:7691> has quit IRC (Quit: Leaving)18:38
ikskuhlofty: how do i do the CPU then instead?18:42
ikskuhi only know the state machine way18:43
loftyikskuh: a 3-stage pipeline should be doable18:43
ikskuhhm?18:43
loftypipelined CPU.18:43
ikskuhdoesn't work18:43
ikskuhinstructions are 100% dependent on each other18:44
loftydo you think they weren't on older RISC CPUs?18:44
lambdadoes your CPU not execute instructions in order?18:45
ikskuhlofty, so how does it work then? 18:45
ikskuhlambda: i have a stack machine18:45
ikskuhthat means op 1 has to be 100% completed before op 2 can fetch data18:45
lofty*or* that op 1 can feed its data into op 218:45
ikskuhright, but the memory write still has to happen18:46
loftyAnd it will18:46
ikskuhbut then i don't understand what you mean18:46
* ikskuh is still very much an FPGA noob18:46
*** truc is now known as bjonnh18:47
loftyImagine, fetch/execute/writeback18:47
ikskuhyeah, i have done that as a state machine right now18:47
lambdaikskuh: I think for the vast majority of unstructions you don't need to fully execute them in order to know that the next instruction will be at pc+118:47
lambdaso you can already fetch the next instruction while the current one is being executed18:47
ikskuhlambda: i can't, i'm already at 100% memory pressure18:48
loftyI think on a 3-stage pipeline you can resolve the next PC immediately though, right?18:48
ikskuhkinda.18:48
loftyEither a) split instruction and data buses18:48
loftyor b) caches18:48
loftyWhich I suppose is a way of achieving a18:49
ikskuhokay, so if i split them and have a instruction cache18:49
Sarayanif it's a stack machine it should already be split shouldn't it?18:49
ikskuhit isn't, it's a von neumann18:49
ikskuhbut i still don't understand what you mean with pipeline exactly18:49
ikskuhdo i chain stuff together as fifos?18:50
ikskuh*as=>with18:50
loftyNo18:50
loftyJust flops18:50
ikskuhhm, okay18:50
loftyokay, let's temporarily put aside the stack machine-ness to look from an ivory tower18:50
loftyYou need an ALU - this ALU can do things like shift, add, subtract, etc18:51
ikskuhright18:51
loftyThen you need something to program this ALU to perform an operation18:51
loftyThis is your fetch stage18:51
ikskuhis memory read/write performed by alu?18:51
Sarayanno18:51
Sarayanmemory access is another logical block18:52
ikskuhokay18:52
loftyYou're turning the 3 stage pipeline into a 4-stage pipeline, Sarayan :P18:52
Sarayanalu doesn't care where what comes in comes from, out goes18:52
loftyBut maybe 4 stages is easy to explain18:52
Sarayanlofty: then I fold it back, you'll see ;-)18:52
ikskuhokay18:53
ikskuhso i have a thing that fetches the instruction itself18:53
ikskuhwhich stage/unit fetches the instruction inputs?18:53
Sarayanthe #1 trick for a fast stack machine being not to have a real stack in the first place, but independant registers for the top and filling/spilling as needed18:53
ikskuhSarayan: that's an optimization for future days18:54
loftyunfortunately it isn't.18:54
ikskuhwhich i have planned already, but i don't wanna do now for simplicities sake18:54
lofty"where do your instruction inputs come from" - from the register file18:54
loftynot from memory18:54
loftyYou're at 100% memory utilisation, this is why :P18:54
ikskuhwell18:55
ikskuhthat is all not a problem atm18:55
ikskuhi know that these are architectural/high level design problems18:55
ikskuhbut right now i can live with raw uncached memory access18:55
loftyYou came to the channel asking for help with this, did you not?18:55
lofty"why is my CPU slow" because at a high level, it is a state machine18:56
loftyAnd state machines do not fit FPGAs well because all the logic is happening at once18:56
ikskuhyes, exactly. but i don't see how a "register file" (which i don't know what it is) can help18:56
loftyThis is why multiplication, division and modulo are slow18:56
Sarayanwell, at higher level your cpu is slow because it does way too many memory accesses and memory accesses are slow18:56
ikskuh↑ i know this part18:56
loftyIf you put the top of the stack in a small memory of its own18:56
loftya lot of the operations will use this18:57
loftyMaybe even the top N of the stack18:57
loftyAnd then you can avoid memory operations18:57
ikskuhyes, but this will make the *implementation* itself slower18:57
loftyAnd you can speed your CPU up by *already having the data*18:57
ikskuhas i need more combinatoric logic18:57
loftyNo, you need *less*18:57
ikskuhhuh?18:58
Sarayanthe amount is not what sets the speed, it's the depth18:58
loftyYou can already load data from the stack, can you not?18:58
ikskuhSarayan: yes, exactly18:58
ikskuhthat's why i was thinking it's slower as i need more decisions18:58
Sarayanso more logic is not necessarily slower if it's more parallel logic18:58
loftySo, if you are spending less time fetching operands you need fewer cycles18:58
ikskuhlofty: cycles aren't my problem18:59
loftyAnd thus your CPU executes the same instruction stream faster18:59
*** indy <[email protected]> has quit IRC (Quit: ZNC 1.8.2 - https://znc.in)18:59
Sarayanbut in any case if your cpi goes from 4 to 1.mumble even if your cycle gets slower it doesn't amtter18:59
Sarayan(4 = read instruction, read two values, write one)19:00
lofty"cycles aren't my problem" <-- then why come to us for help speeding up your CPU if it doesn't matter?19:00
ikskuhi don't want "less cycles", but "faster cycles"19:00
ikskuhi want to have a clk of 160 MHz 19:00
loftyThat's why you have the pipeline!19:00
ikskuhdue to other parts in the design19:00
ikskuhand a central memory bu19:00
ikskuh*bus19:01
Sarayanok19:01
ikskuhso if my cpu access this bus, it needs to react in 1 clk19:01
Sarayanhow are your instructions encoded?19:01
ikskuh1 u16 for "all information", then up to two immediates or stack operations as input0 and input119:01
loftyYou are talking to us about how the number of decisions causes your CPU to be slow, right?19:01
loftyYou are trying to decide everything at once per cycle19:02
loftyThe solution is therefore to not decide everything at once19:02
Sarayanso you have 4 cycles per instruction pretty much always?19:02
ikskuhSarayan: kinda, yeah. except for memory ops (ld8, ld16, st8, st16) and mul/div/mod19:03
ikskuhinstruction encoding: https://ashet.computer/specs/spu-mark-ii/#instruction-encoding19:03
tpbTitle: SPU Mark II - Ashet Home Computer (at ashet.computer)19:03
loftyYou don't need a mod instruction, but anyway19:03
ikskuhlofty: so one solution would be to have "more, but dumber steps" in the state machine?19:04
loftyHaving more steps means more decisions19:04
loftyAnd having fewer steps means more decisions done per cycle19:04
Sarayanwell, the #1 question is do you have prefetching?19:04
ikskuhno19:04
loftyfundamentally, you cannot resolve your problem within the framework of a finite state machine19:04
loftyand so you must leave it.19:05
ikskuhlofty: i got that part, but i don't understand HOW to leave19:05
ikskuhthis is something i've never done19:05
loftyCan you draw a graph of the steps of your state machine19:05
ikskuhi'll try19:05
loftythe goal is that a chain of steps within your state machine becomes a pipeline19:06
Sarayanif you don't have prefetching you have an extremely hot path where in the cycle the memory must return the instruction read result, you must decode the instruction and push the next address on the bus19:06
Sarayanthat's a lot of computation which starts with waiting for the memory to answer19:07
Sarayanyou really need to separate fetch from decode19:07
loftyHere's the teaching example I normally go for: washing batches of laundry. You have a washing machine and a tumble dryer.19:08
loftyWhat you are doing here is taking the time to get a batch of laundry, then standing around waiting for it to wash, then standing around waiting for it to dry, then hanging it up19:08
loftywhereas you could be simultaneously getting the laundry while one batch washes and another batch dries19:09
loftyThat is, fundamentally, a pipeline19:09
loftyWhere you have a state machine that has the state for steps A, B, C, D and executes them in the order A -> B -> C -> D, then what Yosys will do is execute A, B, C, and D simultaneously and then decide the result19:10
loftyWhich is slower than doing A, B, C, and D simultaneously and unconditionally19:11
ikskuhhttps://mq32.de/public/e78287e99aa734c4db3c791095273fe7532dd8fc.png19:11
ikskuhstate machine19:11
loftyI'm rambling, sorry.19:11
Sarayanotoh, pipelines are rather hard with stack machines19:11
Sarayanbut prefetching isn't, and it's going to help a lot19:12
loftySarayan: this state machine looks very pipeline-y to me19:12
ikskuhlofty: but each part of the pipeline will only activate every nth cycle, right?19:12
loftyFor your case, yes19:13
Sarayansomething easy to try, instead of executing the instructions in the order "read instruction, read param1, read param2, write result" try "read param1, read param2, read next instruction, write result", with an initial read instruction at reset and on jumps of course19:13
Sarayanif you manage it your hot path should be way less hot19:13
Sarayanbecause 1/ you can compute the result to be written while you're reading the next instruction19:14
ikskuhSarayan: problem is: i left interrupt handling out /o\19:14
ikskuhbut let's ignore that for now19:14
Sarayan2/ you can decode the instruction while writing the result19:14
Sarayanthese are two hot paths fetch to decode and compute to write19:15
Sarayanthat way you split them over two cycles19:15
ikskuhwhat does "decoding" mean exactly?19:15
ikskuhthe instruction is just a bunch of bit fields19:16
ikskuhdetermining what each part of the "pipeline" does19:16
Sarayandecoding for instance is computing where to read param119:16
ikskuhah, so "look at the instruction word"19:17
Sarayanyou have the whole "write result" cycle to compute that address rather than having to do it immediatly19:17
ikskuhlet me think about that 19:18
ikskuhand if i don't have "write result", i just stall then?19:18
Sarayanyeah19:19
*** Max-P <Max-P!Max-P@thelounge/maintainer/Max-P> has joined #yosys19:19
ikskuhhm19:19
ikskuhbut only for one clk, right?19:19
Sarayanyep19:19
ikskuhand if i actually have to do memory access, i'll "stall" for "how long that memory access takes"19:20
Sarayanyou memory is slower than 160MHz?19:20
ikskuhyes, i have an external 16 MB SPI flash and 16 MB SDRAM19:20
Sarayanouch, you really really really want some kind of cache for instructions and stack :-)19:21
ikskuhyes, i know19:21
ikskuhbut those are secondary problems19:22
ikskuhi wanna do everything step by step19:22
Sarayantry prefetching, it should have an interesting impact19:22
ikskuhand my current design allows me to tack caches for dram, flash, ... :)19:22
ikskuhyeah, i think i got the rough idea now19:22
ikskuhthis will take a while to get right19:22
ikskuhbut one very good thing: i already have a behaviour test suite19:23
ikskuhso i can see if i break things on the way19:23
Sarayanthat's useful19:23
ikskuhor if everything still works as intended19:23
ikskuhbut i think i will try prefetching and "pipelining"19:24
ikskuhalso separate instruction and data bus for separate caches19:24
Sarayanpipelining is hard to get right with a stack architecture19:24
ikskuhyeah, but the general idea of "not doing a state machine"19:24
Sarayanhonestly it's all a state machine in the end19:25
ikskuhright19:25
ikskuhbut trying to do more in parallel19:25
ikskuhand not cramping everything into a single process19:25
Sarayanyes, that's the important part19:25
ikskuhthis will take a day to see what i can all do in parallel19:25
Sarayannot having long chains of dependencies in the same clock19:25
ikskuhbut the idea of running "add" in parallel to "sub" and "shift" is the right idea, yeah?19:26
Sarayanif you have to data to run them with in parallel19:27
ikskuhalu only has two inputs "input0" and "input1"19:27
ikskuhand will output a single value which you can inspect for flags and/or push19:27
ikskuhso i have one step "add input0, input1, store output", one "select the output from the alu", one "push the output" parallel to "set the flags", right? and while i compute the output of the alu, i can fetch and decode the next instruction19:29
ikskuhdid i get this right?19:34
Sarayanyou fetch it while you compute the output of the alu, you decode it while you write the output of the alu19:35
Sarayanyou don't want to do fetch and decode in the same cycle, it's too much19:36
ikskuhokay19:38
ikskuhso "way more registers" :D19:38
Sarayanof course19:42
*** tlwoerner <[email protected]> has quit IRC (Remote host closed the connection)20:03
*** tlwoerner <[email protected]> has joined #yosys20:04
*** FabM <FabM!~FabM@armadeus/team/FabM> has quit IRC (Ping timeout: 268 seconds)20:07
*** ec <ec!~ec@gateway/tor-sasl/ec> has quit IRC (Remote host closed the connection)20:10
*** ec <ec!~ec@gateway/tor-sasl/ec> has joined #yosys20:11
*** ec <ec!~ec@gateway/tor-sasl/ec> has quit IRC (Ping timeout: 276 seconds)22:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!