*** tpb <[email protected]> has joined #openrisc | 00:00 | |
shorne | zx2c4: alright, I might have a fix, the stack trace you had, I saw too it was always the same | 10:58 |
---|---|---|
shorne | it seems that it explained the whole thing, we were getting timer interrupts during cpu icache invalidation (which is done when user space executables are loaded) | 10:59 |
zx2c4 | oooo! | 10:59 |
zx2c4 | nice!! | 10:59 |
shorne | openrisc was not protecting against itnerrupts coming in during icache flushing | 10:59 |
shorne | so if we got a timer during icache flushing it caused some kind of nested interrupts that would lock things up | 11:00 |
shorne | i.e. while loading a program it, gets interrupted and might reschedule another user program | 11:00 |
shorne | anyway, posting a patch you might want to try, I will do some of my testing as well | 11:01 |
shorne | there might be more I can do for the icache flushing, but this basic patch seems to work | 11:01 |
zx2c4 | that makes sense | 11:05 |
zx2c4 | ill give it a spin | 11:05 |
zx2c4 | lemme know when youve pushed to your branches and ill try qemu+linux | 11:05 |
shorne | the kernel fix is here: https://github.com/stffrdhrn/linux/commits/or1k-virt | 11:06 |
shorne | no changes on qemu since last I posted or1k-virt-2 branch | 11:07 |
zx2c4 | (compiling...) | 11:12 |
zx2c4 | so far, still seems overwhelmingly slow | 11:14 |
zx2c4 | ah and i just got hte lockup | 11:14 |
shorne | ah.. must be more, its it in userspace? | 11:15 |
zx2c4 | shorne: https://א.cc/8ie4vC0F | 11:15 |
shorne | or during boot and selftests? | 11:15 |
zx2c4 | userspace | 11:15 |
zx2c4 | I picked ` openrisc: cache: Disable preemption when flushing icache pages` | 11:15 |
zx2c4 | this is rc3 + ` irqchip: or1k-pic: Undefine mask_ack for level triggered hardware` + ` openrisc: mm: Add support for multiple tlb ways` + ` openrisc: cache: Disable preemption when flushing icache pages` | 11:16 |
shorne | yeah, I think probably the issue is a bit higher | 11:16 |
shorne | I mean I need to also disable pre-emption in on the main cpu scheduling the IPI's to invalidate the icache | 11:17 |
shorne | I need to be able to run this same test | 11:17 |
zx2c4 | `git clone -b jd/openrisc https://git.zx2c4.com/linux-rng` | 11:17 |
shorne | I was planning to run the glibc test suite, but need to get the environment up again, but if you have something easy I can try yours | 11:18 |
zx2c4 | and then run | 11:18 |
zx2c4 | ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 | 11:18 |
zx2c4 | here's whats in that branch https://git.zx2c4.com/linux-rng/log/?h=jd/openrisc | 11:18 |
tpb | Title: linux-rng - Development tree for the kernel CSPRNG (at git.zx2c4.com) | 11:18 |
zx2c4 | successive runs of `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16` will skip building the kernel if nothing wireguard-related changes, so you might want to prefix that with `touch drivers/char/wireguard/device.c` or something | 11:19 |
zx2c4 | it uses qemu-system-or1k from PATH, so adjust your PATH to have your qemu build directory in front | 11:20 |
shorne | ok, this is cool | 11:21 |
shorne | let me give it a spin | 11:21 |
shorne | sorry, helping put kids to sleep now... | 11:21 |
zx2c4 | i wonder if the irq masking is broken? | 11:33 |
zx2c4 | im adding local_irq_save()/restore() inside of ipi_icache_page_inv, and i still get the same issue | 11:33 |
zx2c4 | i guess its tricky because this smp on_each_cpu stuff works with an interrupt | 11:34 |
zx2c4 | actually, i dunno if the timer interrupt is actually what's causing this. i suspect that's just what you see when RCU stalls, because that's how RCU knows it's stalled... | 11:36 |
shorne | I think we need to put the protection inside of smp_icache_page_inv, but then in there we can invalidate the current CPU cache first, then invalidate other CPUs | 11:59 |
shorne | we probably have to protect while waiting on smp_icache_page_inv to finish, and on each cpu | 11:59 |
shorne | thats what I was thining to do, but then I didn't have a great test case and the patch I posted seemed to help | 12:00 |
shorne | zx2c4: FYI, I will be moving to london in about 2 months | 12:01 |
zx2c4 | oh cool | 12:01 |
zx2c4 | short distance away! | 12:01 |
zx2c4 | the title of your blog will change i guess | 12:01 |
shorne | haha, yeah I will need to write some new entries for it | 12:31 |
shorne | I was planning on writing about qemu stuff | 12:31 |
shorne | zx2c4: are you using the toolchain downloaded from selftests? I am getting linker issues with it | 12:59 |
zx2c4 | yes. no modifications | 12:59 |
zx2c4 | can i see your output? | 12:59 |
zx2c4 | did you do `make mrproper` midway through the build process perchance? | 12:59 |
zx2c4 | run | 13:00 |
zx2c4 | ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean | 13:00 |
zx2c4 | followed by a fresh | 13:00 |
zx2c4 | ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 | 13:00 |
zx2c4 | shorne: ^ | 13:00 |
shorne | https://gist.github.com/e8456835b6eef8055dc695cbf5a42d3c | 13:00 |
zx2c4 | you ran `make mrproper` :) | 13:01 |
zx2c4 | run `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean` and then try again | 13:01 |
shorne | ok | 13:01 |
shorne | makes sense | 13:01 |
shorne | thats where all those .o files went :) | 13:02 |
zx2c4 | yulp | 13:02 |
shorne | its running now | 13:08 |
shorne | got some stalls | 13:11 |
zx2c4 | good! | 13:18 |
zx2c4 | well, not good. but good that it's reproducable | 13:18 |
shorne | yeah, well my fix is not working, I tried some different things, i still see the same rcu failure | 14:07 |
shorne | it is good I have a test though, sometimes I see there are just lockups and no progress | 14:07 |
shorne | I need to build a debug kernel to see what is going on in those cases via gdb | 14:07 |
shorne | interesting, It's two cpus running cache invalidation at the same time on differnt pages | 14:21 |
shorne | both telling all other cpus, go invalidate your pages and report back | 14:22 |
shorne | it shouldn't be a deadlock though because the CPU requesting invalidattions should be allowed to handle IPI's while its waiting for its requests to complete | 14:23 |
shorne | but its fishy | 14:23 |
shorne | https://gist.github.com/stffrdhrn/40515284d3cb7dea253441867b8336fd | 14:27 |
shorne | anyway, ill got to bed and think about it, ill read more code in the morning | 14:27 |
shorne | fyi, these are the fixes I have so far, but not helping: https://gist.github.com/stffrdhrn/15e836f35dba4e6615b21a2a2550b0fc | 14:30 |
shorne | I tried with a full local_irq_save()/local_irq_restore() rather than the preempt_enable/disable but that has more issues | 14:31 |
shorne | ay, in smp_call_function_many_cond, it explains there could be a deadlock situation here | 14:40 |
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has joined #openrisc | 19:37 | |
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has quit IRC (Ping timeout: 268 seconds) | 19:44 | |
shorne | well, to make me feel good I ran with NR_CPUS=1 | 21:18 |
shorne | [+] Tests successful! :-) | 21:18 |
shorne | [ 281.856000] reboot: Restarting system | 21:18 |
shorne | FYI | 21:18 |
shorne | zx2c4: https://gist.github.com/stffrdhrn/32a7413bfd8c2abd676deea597547ea8 <-- fix result chardev to use virtio | 21:19 |
shorne | on the virt platform I just setup 1 serial device, the rest are virtio | 21:20 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!