Tuesday, 2022-06-21

*** tpb <[email protected]> has joined #openrisc00:00
shornezx2c4: alright, I might have a fix, the stack trace you had, I saw too it was always the same10:58
shorneit seems that it explained the whole thing, we were getting timer interrupts during cpu icache invalidation (which is done when user space executables are loaded)10:59
zx2c4oooo!10:59
zx2c4nice!!10:59
shorneopenrisc was not protecting against itnerrupts coming in during icache flushing10:59
shorneso if we got a timer during icache flushing it caused some kind of nested interrupts that would lock things up11:00
shornei.e. while loading a program it, gets interrupted and might reschedule another user program11:00
shorneanyway, posting a patch you might want to try, I will do some of my testing as well11:01
shornethere might be more I can do for the icache flushing, but this basic patch seems to work11:01
zx2c4that makes sense11:05
zx2c4ill give it a spin11:05
zx2c4lemme know when youve pushed to your branches and ill try qemu+linux11:05
shornethe kernel fix is here: https://github.com/stffrdhrn/linux/commits/or1k-virt11:06
shorneno changes on qemu since last I posted or1k-virt-2 branch11:07
zx2c4(compiling...)11:12
zx2c4so far, still seems overwhelmingly slow11:14
zx2c4ah and i just got hte lockup11:14
shorneah.. must be more, its it in userspace?11:15
zx2c4shorne: https://א.cc/8ie4vC0F11:15
shorneor during boot and selftests?11:15
zx2c4userspace11:15
zx2c4I picked `    openrisc: cache: Disable preemption when flushing icache pages`11:15
zx2c4this is rc3 + `    irqchip: or1k-pic: Undefine mask_ack for level triggered hardware` + `    openrisc: mm: Add support for multiple tlb ways` + `    openrisc: cache: Disable preemption when flushing icache pages`11:16
shorneyeah, I think probably the issue is a bit higher11:16
shorneI mean I need to also disable pre-emption in on the main cpu scheduling the IPI's to invalidate the icache11:17
shorneI need to be able to run this same test11:17
zx2c4`git clone -b jd/openrisc https://git.zx2c4.com/linux-rng`11:17
shorneI was planning to run the glibc test suite, but need to get the environment up again, but if you have something easy I can try yours11:18
zx2c4and then run11:18
zx2c4ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j1611:18
zx2c4here's whats in that branch https://git.zx2c4.com/linux-rng/log/?h=jd/openrisc11:18
tpbTitle: linux-rng - Development tree for the kernel CSPRNG (at git.zx2c4.com)11:18
zx2c4successive runs of `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16` will skip building the kernel if nothing wireguard-related changes, so you might want to prefix that with `touch drivers/char/wireguard/device.c` or something11:19
zx2c4it uses qemu-system-or1k from PATH, so adjust your PATH to have your qemu build directory in front11:20
shorneok, this is cool11:21
shornelet me give it a spin11:21
shornesorry, helping put kids to sleep now...11:21
zx2c4i wonder if the irq masking is broken?11:33
zx2c4im adding local_irq_save()/restore() inside of ipi_icache_page_inv, and i still get the same issue11:33
zx2c4i guess its tricky because this smp on_each_cpu stuff works with an interrupt11:34
zx2c4actually, i dunno if the timer interrupt is actually what's causing this. i suspect that's just what you see when RCU stalls, because that's how RCU knows it's stalled...11:36
shorneI think we need to put the protection inside of smp_icache_page_inv, but then in there we can invalidate the current CPU cache first, then invalidate other CPUs11:59
shornewe probably have to protect while waiting on smp_icache_page_inv to finish, and on each cpu11:59
shornethats what I was thining to do, but then I didn't have a great test case and the patch I posted seemed to help12:00
shornezx2c4: FYI, I will be moving to london in about 2 months12:01
zx2c4oh cool12:01
zx2c4short distance away!12:01
zx2c4the title of your blog will change i guess12:01
shornehaha, yeah I will need to write some new entries for it12:31
shorneI was planning on writing about qemu stuff12:31
shornezx2c4: are you using the toolchain downloaded from selftests? I am getting linker issues with it12:59
zx2c4yes. no modifications12:59
zx2c4can i see your output?12:59
zx2c4did you do `make mrproper` midway through the build process perchance?12:59
zx2c4run13:00
zx2c4ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean13:00
zx2c4followed by a fresh13:00
zx2c4ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j1613:00
zx2c4shorne: ^13:00
shornehttps://gist.github.com/e8456835b6eef8055dc695cbf5a42d3c13:00
zx2c4you ran `make mrproper` :)13:01
zx2c4run `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean` and then try again13:01
shorneok13:01
shornemakes sense13:01
shornethats where all those .o files went :)13:02
zx2c4yulp13:02
shorneits running now13:08
shornegot some stalls13:11
zx2c4good!13:18
zx2c4well, not good. but good that it's reproducable13:18
shorneyeah, well my fix is not working, I tried some different things, i still see the same rcu failure14:07
shorneit is good I have a test though, sometimes I see there are just lockups and no progress14:07
shorneI need to build a debug kernel to see what is going on in those cases via gdb14:07
shorneinteresting, It's two cpus running cache invalidation at the same time on differnt pages14:21
shorneboth telling all other cpus, go invalidate your pages and report back14:22
shorneit shouldn't be a deadlock though because the CPU requesting invalidattions should be allowed to handle IPI's while its waiting for its requests to complete14:23
shornebut its fishy14:23
shornehttps://gist.github.com/stffrdhrn/40515284d3cb7dea253441867b8336fd14:27
shorneanyway, ill got to bed and think about it, ill read more code in the morning14:27
shornefyi, these are the fixes I have so far, but not helping: https://gist.github.com/stffrdhrn/15e836f35dba4e6615b21a2a2550b0fc14:30
shorneI tried with a full local_irq_save()/local_irq_restore() rather than the preempt_enable/disable but that has more issues14:31
shorneay, in smp_call_function_many_cond, it explains there could be a deadlock situation here14:40
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has joined #openrisc19:37
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has quit IRC (Ping timeout: 268 seconds)19:44
shornewell, to make me feel good I ran with NR_CPUS=121:18
shorne[+] Tests successful! :-)21:18
shorne[  281.856000] reboot: Restarting system21:18
shorneFYI21:18
shornezx2c4: https://gist.github.com/stffrdhrn/32a7413bfd8c2abd676deea597547ea8 <-- fix result chardev to use virtio21:19
shorneon the virt platform I just setup 1 serial device, the rest are virtio21:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!