Re: [PATCH v1 1/8] tools/nolibc: add support for powerpc

From: Willy Tarreau
Date: Tue Jul 25 2023 - 14:28:02 EST


Hi Zhangjin,

On Tue, Jul 25, 2023 at 07:02:55PM +0800, Zhangjin Wu wrote:
> > > With this new syscall.h, to support my_syscall<N>, the arch-<ARCH>.h
> > > will only require to add ~10 lines to define their own syscall
> > > instructions, registers and clobberlist, which looks like this (for
> > > powerpc):
> > >
> > > #define _NOLIBC_SYSCALL_CALL "sc; bns+ 1f; neg %0, %0; 1:"
> > >
> > > /* PowerPC doesn't always restore r3-r12 for us */
> > > #define _NOLIBC_SYSCALL_CLOBBERLIST
> > > "memory", "cr0", "r12", "r11", "r10", "r9", "r8", "r7", "r6", "r5", "r4"
> > >
> > > /* PowerPC write GPRS in kernel side but not restore them */
> > > #define _NOLIBC_GPRS_AS_OUTPUT_OPERANDS
> > >
> > > #define _NOLIBC_REG_NUM "r0"
> > > #define _NOLIBC_REG_RET "r3"
> > > #define _NOLIBC_REG_arg1 "r3"
> > > #define _NOLIBC_REG_arg2 "r4"
> > > #define _NOLIBC_REG_arg3 "r5"
> > > #define _NOLIBC_REG_arg4 "r6"
> > > #define _NOLIBC_REG_arg5 "r7"
> > > #define _NOLIBC_REG_arg6 "r8"
> > >
> > > Before:
> > >
> > > $ ls tools/include/nolibc/arch-*.h | while read f; do git show dfef4fc45d5713eb23d87f0863aff9c33bd4bfaf:$f 2>/dev/null | wc -l | tr -d '\n'; echo " $f"; done
> > > 157 tools/include/nolibc/arch-aarch64.h
> > > 199 tools/include/nolibc/arch-arm.h
> > > 178 tools/include/nolibc/arch-i386.h
> > > 164 tools/include/nolibc/arch-loongarch.h
> > > 195 tools/include/nolibc/arch-mips.h
> > > 0 tools/include/nolibc/arch-powerpc.h
> > > 160 tools/include/nolibc/arch-riscv.h
> > > 186 tools/include/nolibc/arch-s390.h
> > > 176 tools/include/nolibc/arch-x86_64.h
> > >
> > > After:
> > >
> > > $ wc -l tools/include/nolibc/arch-*.h
> > > 54 tools/include/nolibc/arch-aarch64.h
> > > 84 tools/include/nolibc/arch-arm.h
> > > 90 tools/include/nolibc/arch-i386.h /* the last one use stack to pass arguments, reserve as-is */
> > > 59 tools/include/nolibc/arch-loongarch.h
> > > 120 tools/include/nolibc/arch-mips.h /* the last two use stack to pass arguments, reserve as-is */
> > > 73 tools/include/nolibc/arch-powerpc.h
> > > 58 tools/include/nolibc/arch-riscv.h
> > > 87 tools/include/nolibc/arch-s390.h
> > > 67 tools/include/nolibc/arch-x86_64.h
> > >
> > > syscall.h itself:
> > >
> > > $ wc -l tools/include/nolibc/syscall.h
> > > 112 tools/include/nolibc/syscall.h
> >
> > The important thing to consider is not the number of lines but the
> > *maintainability*.
>
> The original goal is not really the number of lines (only a
> 'side-effect'), but is exactly easier porting/maintainability with
> clearer code architecture,

I do feel the exact opposite. One is totally straightforward with
self-explanatory function names and their equivalent machine-specific
asm() statements, the other one involves countless cryptic macros for
which it is particularly difficult to figure what depends on what.

> > You factored the syscall code so much above with all
> > these macros that I don't even understand how they're going to interact
> > with each other, especially "%0".
>
> Yeah, it is my fault, this should be cleaned up with the return register
> directly:
>
> #define _NOLIBC_SYSCALL_CALL \
> "sc; bns+ 1f; neg 3, 3; 1:"

This doesn't change my point of view on it, really.

> > Also I don't know what the macro
> > _NOLIBC_GPRS_AS_OUTPUT_OPERANDS does.
>
> This is the root cause to inspire me to add the new syscall.h, let's
> explain the background step by step.
>
> All of the other architectures (except PowerPC) restore GPRS for us when
> return from syscall, so, their clobber list simply not include the GPRS
> and only need to add the input registers in the 'INPUT Operands' list.

I still have no idea what a GPRS is.

> But PowerPC doesn't restore such GPRS for us, I'm not sure if it is a
> feature (Maybe) or a bug.

We don't really care. The *exact* purpose of an asm() statement is to
write stuff that cannot be expressed at a higher level. Sure sometimes
you can abuse macros. But this should be extremely light. Here you seem
to be using a common asm statement for everyone and going to stuff the
combination of all these macros into it. asm() statements are already
quite cryptic for a lot of people, and the minimum required is that
they are easy to read so that the few who know what these are doing can
help debug them. When Ammar spotted the alignment bug in our _start
code, it didn't take long to figure the root cause of the issue nor to
fix it, precisely because that code was straightforward for someone with
a bit of asm skills. But how do you want anyone to figure what's happening
in something full of abstractions ? Look for example, in order to add
the stackprot support, Thomas just had to append a call at various
points. When you need to do that in factored code that's forcefully
arranged to try to suit all archs and toolchains at once, it may end up
being almost impossible without breaking the organization and starting
to create arch-specific definitions again.

> for PowerPC32, the following line will restore
> the GPRS for us, but may be not a good idea to do so for it may decrease
> the syscall performance although save some instructions in user-space
> and also, the other libcs also follow the current rule, so, this may be
> a design intention we must follow (welcome your suggestions).
>
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index fe27d41f9a3d..1ed535e9144c 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -155,6 +155,7 @@ syscall_exit_finish:
> bne 3f
> mtcr r5
>
> + REST_GPRS(3, 12, r1)
> 1: REST_GPR(2, r1)
> REST_GPR(1, r1)
> rfi

I don't know PPC and I have zero opinion there. For this we'll ask one
of the PPC maintainers for guidance, and since we have clean asm code,
they will easily be able to say "yes that's fine", "hmmm no that's not
the right way to do it", or "I suspect you forgot to save flags here",
or anything else, because the code will match a pattern they know well.
With all the macros hell, it will just be "hmmm good luck".

> For PowerPC, if with the previous method like the other architectures,
> the clobber list differs for every my_syscall<N> and all of the input
> registers must be put in the 'OUTPUT Operands' too to avoid compiler to
> save and resue a variable in such GPRS across my_syscall<N> calls.

But do you realize that you're proposing to write macros to factor things
between archs that are already different within a single arch ? How is
this supposed to help at doing anything ?

> Originally in my local new version of arch-powerpc.h, we got such code
> for every my_syscall<N>, use my_syscall6() as an example:
>
> #define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6) \
> ({ \
> register long _ret __asm__ ("r3"); \
> register long _num __asm__ ("r0") = (num); \
> register long _arg1 __asm__ ("r3") = (long)(arg1); \
> register long _arg2 __asm__ ("r4") = (long)(arg2); \
> register long _arg3 __asm__ ("r5") = (long)(arg3); \
> register long _arg4 __asm__ ("r6") = (long)(arg4); \
> register long _arg5 __asm__ ("r7") = (long)(arg5); \
> register long _arg6 __asm__ ("r8") = (long)(arg6); \
> \
> __asm__ volatile ( \
> " sc\n" \
> " bns+ 1f\n" \
> " neg %0, %0\n" \
> "1:\n" \
> : "=r"(_ret), \
> "+r"(_arg2), "+r"(_arg3), "+r"(_arg4), \
> "+r"(_arg5), "+r"(_arg6) \
> : "0"(_arg1), "r"(_num) \
> : _NOLIBC_SYSCALL_CLOBBERLIST \
> ); \
> _ret; \
> })
>
> It almost aligns with the other architectures, but the full clobber list
> differs for every my_syscall<N>, the basic one is:
>
> /* PowerPC kernel doesn't always restore r4-r12 for us */
> #define _NOLIBC_SYSCALL_CLOBBERLIST \
> "memory", "cr0", "r12", "r11", "r10", "r9",
>
> Use my_syscall0() as a further example, we need something like this:
>
> #define my_syscall0(num) \
> ({ \
> register long _ret __asm__ ("r3"); \
> register long _num __asm__ ("r0") = (num); \
> \
> __asm__ volatile ( \
> " sc\n" \
> " bns+ 1f\n" \
> " neg %0, %0\n" \
> "1:\n" \
> : "=r"(_ret) \
> : "r"(_num) \
> : _NOLIBC_SYSCALL_CLOBBERLIST, "r8", "r7", "r6", "r5", "r4" \
> ); \
> _ret; \
> })
>
> The additional "r8"..."r4" must be appended to the clobber list for they
> can not be put together for every my_syscall<N> due to conflicts between
> they between the clobber list and the "OUTPUT/INPUT operands".

Perfect, yet another example of the real purpose of asm() statements.
They're not generic and are made to finely tune what you're inserting.

> I found a solution to share the same _NOLIBC_SYSCALL_CLOBBERLIST, that
> is split the "OUTPUT/INPUT Operands" list out of the core syscall
> assembly:
>
> #define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6) \
> ({ \
> register long _ret __asm__ ("r3"); \
> register long _num __asm__ ("r0") = (num); \
> register long _arg1 __asm__ ("r3") = (long)(arg1); \
> register long _arg2 __asm__ ("r4") = (long)(arg2); \
> register long _arg3 __asm__ ("r5") = (long)(arg3); \
> register long _arg4 __asm__ ("r6") = (long)(arg4); \
> register long _arg5 __asm__ ("r7") = (long)(arg5); \
> register long _arg6 __asm__ ("r8") = (long)(arg6); \
> \
> __asm__ volatile ("": "+r"(_arg2), "+r"(_arg3), "+r"(_arg4), \
> "+r"(_arg5), "+r"(_arg6)::); \
> \
> __asm__ volatile ( \
> " sc\n" \
> " bns+ 1f\n" \
> " neg %0, %0\n" \
> "1:\n" \
> : "=r"(_ret) \
> : "0"(_arg1), "r"(_num) \
> : _NOLIBC_SYSCALL_CLOBBERLIST \
> ); \
> _ret; \
> })

So basically "it happens to work but we don't know why, but this is still
much more maintainable" ? Please no, really no, no, no. That's ugly,
tricky and you don't even know what the compiler will do between these
two statements.

> Note, a question here is, the above split still require more discussion
> to make sure it does work for different toolchains (only test for gcc
> currently) or even if this method is right from scratch, welcome your
> suggestion.

asm() statements are used to work around toolchain limitations/differences
and bugs. I've seen Ammar's response with the link to the gcc bug, that's
a good example as well of the reasons why we MUST NOT do these hacks.

> As a result, all of the my_syscall<N> are able to share the core syscall
> calling assembly block, so, here is what we at last have:
>
> #define _my_syscall_tail() \
> __asm__ volatile ( \
> _NOLIBC_SYSCALL_CALL \
> : "=r"(_ret) \
> : "r"(_num) \
> : _NOLIBC_SYSCALL_CLOBBERLIST \
> ); \
> _ret
>
> And further, we also found it was possible to share most of them among
> these not ugly but completely duplicated lines:

But please, from the beginning, all I understand is "it is possible to",
but I still fail to understand the ultimate goal. Making the code uglier
and unmaintainable because it is possible is not a valid argument. For
doing stuff like above, there must be a serious limitation to work around
that has no other solution, and even then a huge #if/#endif could possibly
do it.

> That's why at last we have such blocks (of course, for PowerPC itself
> it is a big change and not necessary):
>
> #define _my_syscall_head(num) \
> register long _ret __asm__ (_NOLIBC_REG_RET); \
> register long _num __asm__ (_NOLIBC_REG_NUM) = (num) \
>
> #ifdef _NOLIBC_REG_ERR
> #define _NOLIBC_REG_EXTRA _NOLIBC_REG_ERR
> #endif
>
> #ifdef _NOLIBC_REG_EXTRA
> #define _my_syscall_extra() \
> register long reg_extra __asm__ (_NOLIBC_REG_EXTRA); \
> __asm__ volatile ("": "=r"(reg_extra)::)
> #else
> #define _my_syscall_extra()
> #endif
>
> /* Architectures like PowerPC write GPRS in kernel side and not restore them */
> #ifndef _NOLIBC_GPRS_AS_OUTPUT_OPERANDS
> #define _my_syscall_argn(n, arg) \
> register long _arg##n __asm__ (_NOLIBC_REG_arg##n) = (long)(arg); \
> __asm__ volatile ("":: "r"(_arg##n):)
> #else
> #define _my_syscall_argn(n, arg) \
> register long _arg##n __asm__ (_NOLIBC_REG_arg##n) = (long)(arg); \
> __asm__ volatile ("": "+r"(_arg##n)::)
> #endif

And someone is able to help us work around a compiler or assembler bug
in this ? I can't even spend enough concentration on the whole block to
understand what it's trying to do or what interacts with what. I'm sorry,
that's not a way to deal with asm, nor code shared with multiple developers
in general.

(...)
> And at last:
>
> #define __my_syscall_args(N, ...) _my_syscall_args##N(__VA_ARGS__)
> #define _my_syscall_args(N, ...) __my_syscall_args(N, ##__VA_ARGS__)
>
> #define __my_syscall_narg(_0, _1, _2, _3, _4, _5, _6, N, ...) N
> #define _my_syscall_narg(...) __my_syscall_narg(__VA_ARGS__, 6, 5, 4, 3, 2, 1, 0)
>
> #define __my_syscall_argsn(N, argn, ...) \
> _my_syscall_args(_my_syscall_narg(NULL, ##__VA_ARGS__), ##__VA_ARGS__); \
> _my_syscall_argn(N, argn)
>
> #define _my_syscall_argsn(...) __my_syscall_argsn(_my_syscall_narg(NULL, ##__VA_ARGS__), ##__VA_ARGS__)
>
> /* Note, my_syscall0() has no argument, can not use my_syscalln() */
> #define my_syscall0(num) \
> ({ \
> _my_syscall_head(num); \
> _my_syscall_extra(); \
> _my_syscall_tail(); \
> })
>
> #define my_syscalln(num, ...) \
> ({ \
> _my_syscall_head(num); \
> _my_syscall_extra(); \
> _my_syscall_argsn(__VA_ARGS__); \
> _my_syscall_tail(); \
> })
>
> #define my_syscall1(num, arg1) my_syscalln(num, arg1)
> #define my_syscall2(num, arg1, arg2) my_syscalln(num, arg2, arg1)
> #define my_syscall3(num, arg1, arg2, arg3) my_syscalln(num, arg3, arg2, arg1)
> #define my_syscall4(num, arg1, arg2, arg3, arg4) my_syscalln(num, arg4, arg3, arg2, arg1)
>
> #ifndef my_syscall5
> #define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) my_syscalln(num, arg5, arg4, arg3, arg2, arg1)
> #endif
> #ifndef my_syscall6
> #define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6) my_syscalln(num, arg6, arg5, arg4, arg3, arg2, arg1)
> #endif
>
> At last, I found this worked on all of the supported architectures, so,
> the new syscall.h is proposed.

If the ultimate goal is *just* to provide my_syscalln(), it's not needed
to rework all archs like this. Just doing this does the job as well, it
will allow my_syscalln(syscall_num, ...) to call the respective
my_syscall0/1/2/3/4/5/6 depending on the number of arguments:

/* my_syscalln() will automatically map to my_syscall<n>() depending
* on the number of arguments after the syscall, from 0 to 6. It uses
* positional arguments after a VA_ARGS to resolve as an argument
* count that's then used to build the underlying macro's name.
*/
#define _my_syscall0(num, a, b, c, d, e, f) my_syscall0(num)
#define _my_syscall1(num, a, b, c, d, e, f) my_syscall1(num, a)
#define _my_syscall2(num, a, b, c, d, e, f) my_syscall2(num, a, b)
#define _my_syscall3(num, a, b, c, d, e, f) my_syscall3(num, a, b, c)
#define _my_syscall4(num, a, b, c, d, e, f) my_syscall4(num, a, b, c, d)
#define _my_syscall5(num, a, b, c, d, e, f) my_syscall5(num, a, b, c, d, e)
#define _my_syscall6(num, a, b, c, d, e, f) my_syscall6(num, a, b, c, d, e, f)
#define _my_syscalln(num, a, b, c, d, e, f, g, ...) _my_syscall##g(num, a, b, c, d, e, f)
#define my_syscalln(num, ...) _my_syscalln(num, ##__VA_ARGS__, 6, 5, 4, 3, 2, 1, 0)

This way there's no need to modify the arch-specific syscall definitions,
this will simply rely on them and preserve their maintainability.

> BTW, another question here is, to utilize the feature of __VA_ARGS__ to
> easier getting the last argument, the order of arguments are reversed
> during the declarations of the my_syscall<N>, any suggestion on this
> part? is it possible to not reverse the order?

There's no reverse order, it's a well-known method consisting in making
a number appear at a fixed position depending on the number of preceeding
arguments:

my_syscalln(n, a, b, c, d, e, f) becomes
_my_syscalln(n, a, b, c, d, e, f, 6, 5, 4, 3, 2, 1, 0)
^
This macro extracts this number above to build the next macro name:
_my_syscall6(n, a, b, c, d, e, f)
my_syscall6(n, a, b, c, d, e, f)

If you use less arguments, say 3, you get this:

my_syscalln(n, a, b, c)
_my_syscalln(n, a, b, c, 6, 5, 4, 3, 2, 1, 0)
_my_syscall3(n, a, b, c, d, e, f)
my_syscall3(n, a, b, c) // d, e, f are lost

The last level of macro is used to silently drop the extra args. When
target is already a macro, it's not even necessary as the macro
definition could already end with ", ...".

I've been using a similar one above in other projects for quite a while
and I know that it worked at least in gcc-3.4, so it's definitely safe.

> > And when someone reports a bug
> > like we had in the past with programs randomly crashing depending on
> > stack alignment and such, it becomes particularly tricky to figure what
> > is expected and how it differs from reality.
> >
>
> Macros are really hard to debug, the above code lines cost me two days ;-)

Someone once said that it requires a must stronger brain to solve a problem
than the one that created it. If it took you two days to arrange this,
imagine how long it will take to someone having not designed this to debug
it! The time it took you is definitely not an argument for adopting this,
quite the opposite. Instead it should have convinced you that this was
going to become unmaintainable.

> but after the common syscall.h, the left architecture specific parts are
> very few and easier to debug and even less copy and paste.

Copy-paste is a problem when bugs need to be fixed. Here there's almost no
copy-paste, copy-paste is done initially to create a new arch but in fact
we're reusing a skeletton to write completely different code. Because code
for PPC and MIPS are different, there's no point imagining that once a bug
affects MIPS we need to automatically apply the same fix to PPC, because
it will be different.

> > I don't know. I still tend to think that
> > this significantly complicates the understanding of the whole thing.
> >
>
> Willy, don't worry, I do think it make things easier, the worse case is
> using the old code or only use part of our new blocks helpers ;-)

Sorry and don't take it bad, I don't want you to feel it as being rude,
but for me the worst case would be to use the new method precisely
because for now only you probably know how that's supposed to work and
nobody can help us with side effects affecting it.

> For this new syscall.h, it mainly clear the inline assembly codes, as we
> know, inline assembly code is a very complicated thing,

That's why it must remain crystal clear.

> If we clear the
> logic carefully (as we target but not yet) in our common code,
> architecture developers only require to focus on the platform specific
> definitions, it should be better for portability, review and
> maintainability.

In order to work on assembly you first need to be able to locate it
and read it as a sequence of instructions. Here really, you need a
pen and paper to start to resolve it.

> It is very hard to learn the meanning of the OUTPUT operands, INPUT
> operands and even the clobber list and even the flags you mentioned
> below,

One more reason for not passing through this!

> > Also, looking at different archs' syscall code, they're not all defined
> > the same way. Some like i386 use "=a" for the return value. Others use
> > "+r" as an input/output value, others "=r",
>
> Agree, this is a very hard part of the inline assembly codes, clearing
> up the generic versions in syscall.h with additional comments may really
> help a lot.

No precisely not. It's a hard part for people who don't deal with *that*
arch. But you will not find a developer at ease with all archs. Each one
has its own specifics. However you will find one or a few developers that
are experts on each architecture and who will instantly be able to correct
some of our mistakes, or warn us against toolchain bugs they're aware of
that we need to take care of. They will also know which constraints to
use. The constraint definitions are per-architecture, and for example "a"
on x86 is the accumulator (eax/rax). On other archs it can be something
else. There are archs which support register pairs, some must be aligned
on an even number and depending on how the calls are declared, the compiler
may improperly assign them and emit code that is impossible to assemble
(I've met this several times with the initial ARM code that we managed
to stabilize).

> For the input registers used as "ERR" or "RET" output, "+r" is used
> before, but now, we split them to two parts, one is "=r"(_ret) in
> _my_syscall_tail(), another is in _my_syscall_argn(n, arg), they
> together work for "+r" = "=r" + "r"

Please do not generalize. The example I gave above indicate stuff that
was initially hard to adjust precisely to help the compiler emit correct
code with a wide enough range of tools. The example Ammar talked about
is a perfect such example.

> Even for s390, test shows, "r" instead of "d" works too

With your toolchain and the code you tested with. There might be a
particular reason, I don't know. Maybe the maintainer is used to
using this because it also works this way on another compiler and
he will be more fluent with this one. That's something important as
well when dealing with asm statements.

> (boot and test
> passed, not yet checked the size and codes generated), but I didn't find
> any document about "d" for s390 from the gcc online doc. This part
> (include all of the supported architectures) should be carefully checked
> if really adding our new syscall.h. add s390 nolibc committer to CC: list.

I do trust the s390 maintainers who contributed this code to know better
than either you and me, and it's certainly not up to us to ask them to
justify their choice. Actually it would be the other way around, you
would need a solid argument for changing code that works.

> But architectures like i386, If "=a", "=b", ... modifiers are necessary,
> new versions of the blocks should be added for these architectures.

You'll just end up with as many blocks at the end, but dealing only with
a union of exception. That's exactly the worst that can be imagined for
maintenance.

> And further, some architectures may resue some helpers from our new syscalls.h
> or at least learn something from what we have done for all of the supported
> architectures.

The arch-specific code is already minimal. We have 7 asm statements for
7 syscall conventions. That's ridiculously low and they contain what
any such arch maintainer would need to find to extend or fix them.

> > If you
> > make it very complicated, I suspect we won't get any such contributions
> > anymore just because nobody figures how to find their way through it,
> > or it would require to change again for everyone just to add one specific
> > case. I tend to think that the current situation is already fairly
> > minimal with quite readable and debuggable calls, and that it's what
> > *really* matters.
> >
>
> This may really influence the upstream and review flow:
>
> - If people may send a new architecture support, if just fits the new
> syscall.h, we may need to review carefully about the test results,
> especially for the input/output operands, error register.

First they'd need to be able to figure what to put in what. Look, they
know what are the 3 instructions they need to put in an asm statement and
the list of registers, and suddenly they'd need to figure how to spread
them into cryptic macros, some of which are sometimes used, others always
etc. That turns a 20-minute test into half a day, without big assurance
at the end of the day that everything is right.

> As tests for powerpc shows, the above issues should be covered by our
> nolibc-test.
>
> - If people may send a new architecture support as before, If we find it
> fits our new syscalls.h or it can apply part of the blocks, we can
> give some suggestions.
>
> - If people send something not just fit our new syscall.h, we may be
> able to think about if a new 'hook' is required, but it is not
> necessary, we can delay this change requirement to after a careful
> design (just like the argument passing via 'stack' case currently) .

That's already the case with i386, s390 and so on. Except that it adds
significant burden for that person.

> > When you're trying to reorganize code, it's important to ask yourself
> > whether you would prefer to debug the old one or the new one at 3am.
> > Hint: at 3am, the more abstractions there are, the less understandable
> > it becomes.
> >
>
> Interesting and agree, but this abstraction does clear something to be more
> undersatndable too ;-)

It's really the first time I hear that abstractions makes one-liner ASM
code clearer and more understandable. I'm sorry but not, really, that's
exactly the opposite.

> I do hate hard-debuggable macros, but as we mentioned above, the inline
> assembly code is another harder parts, the new carefully tuned blocks may really
> help us to avoid introduce bugs with manually wrotten new codes and also it may
> help us to avoid copy and paste multiple duplicated lines of the same codes.

No, the asm blocks are trivial for those who speak this language and are
hard for other ones. The macros are significantly harder and for everyone.
I prefer to ask an s390 or PPC maintainer when I need help with their code
rather than tweak the generic code adding a "+r" for every arch then read
about reports saying that this arch breaks with that version of the
compiler on that program with that version of the assembler.

Please again, don't take any of this personally, I'm just feeling that
you tried to address a difficulty to dig into some arch-specific code,
that you wanted to hide and that you feel like it is more maintainable,
but it's not. Maintainability in a shared project doesn't mean that you
are suddenly skilled on everything, but that you are able to find someone
skilled on your problem. It's not necessarily your task to debug an
architecture you don't know (though it's often very instructive), there
are other people for this and that's perfectly fine. We need to make the
task easy for them so that they don't have to learn all the nolibc tricks
to share their knowledge. In the current form with the asm statements
it's perfectly feasible and that's what matters.

Hoping this clarifies my position on this.

Thanks,
Willy