Tuesday, 2022-06-21

*** tpb <[email protected]> has joined #openrisc		00:00
shorne	zx2c4: alright, I might have a fix, the stack trace you had, I saw too it was always the same	10:58
shorne	it seems that it explained the whole thing, we were getting timer interrupts during cpu icache invalidation (which is done when user space executables are loaded)	10:59
zx2c4	oooo!	10:59
zx2c4	nice!!	10:59
shorne	openrisc was not protecting against itnerrupts coming in during icache flushing	10:59
shorne	so if we got a timer during icache flushing it caused some kind of nested interrupts that would lock things up	11:00
shorne	i.e. while loading a program it, gets interrupted and might reschedule another user program	11:00
shorne	anyway, posting a patch you might want to try, I will do some of my testing as well	11:01
shorne	there might be more I can do for the icache flushing, but this basic patch seems to work	11:01
zx2c4	that makes sense	11:05
zx2c4	ill give it a spin	11:05
zx2c4	lemme know when youve pushed to your branches and ill try qemu+linux	11:05
shorne	the kernel fix is here: https://github.com/stffrdhrn/linux/commits/or1k-virt	11:06
shorne	no changes on qemu since last I posted or1k-virt-2 branch	11:07
zx2c4	(compiling...)	11:12
zx2c4	so far, still seems overwhelmingly slow	11:14
zx2c4	ah and i just got hte lockup	11:14
shorne	ah.. must be more, its it in userspace?	11:15
zx2c4	shorne: https://א.cc/8ie4vC0F	11:15
shorne	or during boot and selftests?	11:15
zx2c4	userspace	11:15
zx2c4	I picked ` openrisc: cache: Disable preemption when flushing icache pages`	11:15
zx2c4	this is rc3 + ` irqchip: or1k-pic: Undefine mask_ack for level triggered hardware` + ` openrisc: mm: Add support for multiple tlb ways` + ` openrisc: cache: Disable preemption when flushing icache pages`	11:16
shorne	yeah, I think probably the issue is a bit higher	11:16
shorne	I mean I need to also disable pre-emption in on the main cpu scheduling the IPI's to invalidate the icache	11:17
shorne	I need to be able to run this same test	11:17
zx2c4	`git clone -b jd/openrisc https://git.zx2c4.com/linux-rng`	11:17
shorne	I was planning to run the glibc test suite, but need to get the environment up again, but if you have something easy I can try yours	11:18
zx2c4	and then run	11:18
zx2c4	ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16	11:18
zx2c4	here's whats in that branch https://git.zx2c4.com/linux-rng/log/?h=jd/openrisc	11:18
tpb	Title: linux-rng - Development tree for the kernel CSPRNG (at git.zx2c4.com)	11:18
zx2c4	successive runs of `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16` will skip building the kernel if nothing wireguard-related changes, so you might want to prefix that with `touch drivers/char/wireguard/device.c` or something	11:19
zx2c4	it uses qemu-system-or1k from PATH, so adjust your PATH to have your qemu build directory in front	11:20
shorne	ok, this is cool	11:21
shorne	let me give it a spin	11:21
shorne	sorry, helping put kids to sleep now...	11:21
zx2c4	i wonder if the irq masking is broken?	11:33
zx2c4	im adding local_irq_save()/restore() inside of ipi_icache_page_inv, and i still get the same issue	11:33
zx2c4	i guess its tricky because this smp on_each_cpu stuff works with an interrupt	11:34
zx2c4	actually, i dunno if the timer interrupt is actually what's causing this. i suspect that's just what you see when RCU stalls, because that's how RCU knows it's stalled...	11:36
shorne	I think we need to put the protection inside of smp_icache_page_inv, but then in there we can invalidate the current CPU cache first, then invalidate other CPUs	11:59
shorne	we probably have to protect while waiting on smp_icache_page_inv to finish, and on each cpu	11:59
shorne	thats what I was thining to do, but then I didn't have a great test case and the patch I posted seemed to help	12:00
shorne	zx2c4: FYI, I will be moving to london in about 2 months	12:01
zx2c4	oh cool	12:01
zx2c4	short distance away!	12:01
zx2c4	the title of your blog will change i guess	12:01
shorne	haha, yeah I will need to write some new entries for it	12:31
shorne	I was planning on writing about qemu stuff	12:31
shorne	zx2c4: are you using the toolchain downloaded from selftests? I am getting linker issues with it	12:59
zx2c4	yes. no modifications	12:59
zx2c4	can i see your output?	12:59
zx2c4	did you do `make mrproper` midway through the build process perchance?	12:59
zx2c4	run	13:00
zx2c4	ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean	13:00
zx2c4	followed by a fresh	13:00
zx2c4	ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16	13:00
zx2c4	shorne: ^	13:00
shorne	https://gist.github.com/e8456835b6eef8055dc695cbf5a42d3c	13:00
zx2c4	you ran `make mrproper` :)	13:01
zx2c4	run `ARCH=or1k make -C tools/testing/selftests/wireguard/qemu -j16 clean` and then try again	13:01
shorne	ok	13:01
shorne	makes sense	13:01
shorne	thats where all those .o files went :)	13:02
zx2c4	yulp	13:02
shorne	its running now	13:08
shorne	got some stalls	13:11
zx2c4	good!	13:18
zx2c4	well, not good. but good that it's reproducable	13:18
shorne	yeah, well my fix is not working, I tried some different things, i still see the same rcu failure	14:07
shorne	it is good I have a test though, sometimes I see there are just lockups and no progress	14:07
shorne	I need to build a debug kernel to see what is going on in those cases via gdb	14:07
shorne	interesting, It's two cpus running cache invalidation at the same time on differnt pages	14:21
shorne	both telling all other cpus, go invalidate your pages and report back	14:22
shorne	it shouldn't be a deadlock though because the CPU requesting invalidattions should be allowed to handle IPI's while its waiting for its requests to complete	14:23
shorne	but its fishy	14:23
shorne	https://gist.github.com/stffrdhrn/40515284d3cb7dea253441867b8336fd	14:27
shorne	anyway, ill got to bed and think about it, ill read more code in the morning	14:27
shorne	fyi, these are the fixes I have so far, but not helping: https://gist.github.com/stffrdhrn/15e836f35dba4e6615b21a2a2550b0fc	14:30
shorne	I tried with a full local_irq_save()/local_irq_restore() rather than the preempt_enable/disable but that has more issues	14:31
shorne	ay, in smp_call_function_many_cond, it explains there could be a deadlock situation here	14:40
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has joined #openrisc		19:37
*** littlebobeep <littlebobeep!~alMalsamo@gateway/tor-sasl/almalsamo> has quit IRC (Ping timeout: 268 seconds)		19:44
shorne	well, to make me feel good I ran with NR_CPUS=1	21:18
shorne	[+] Tests successful! :-)	21:18
shorne	[ 281.856000] reboot: Restarting system	21:18
shorne	FYI	21:18
shorne	zx2c4: https://gist.github.com/stffrdhrn/32a7413bfd8c2abd676deea597547ea8 <-- fix result chardev to use virtio	21:19
shorne	on the virt platform I just setup 1 serial device, the rest are virtio	21:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!