nsz Git - musl/log

disable MADV_FREE usage in mallocng

the entire intent of using madvise/MADV_FREE on freed slots is to
improve system performance by avoiding evicting cache of useful data,
or swapping useless data to disk, by marking any whole pages in the
freed slot as discardable by the kernel. in particular, unlike
unmapping the memory or replacing it with a PROT_NONE region, use of
MADV_FREE does not make any difference to memory accounting for commit
charge purposes, and so does not increase the memory available to
other processes in a non-overcommitted environment.

however, various measurements have shown that inordinate amounts of
time are spent performing madvise syscalls in processes which
frequently allocate and free medium sized objects in the size range
roughly between PAGESIZE and MMAP_THRESHOLD, to the point that the net
effect is almost surely significant performance degredation. so, turn
it off.

the code, which has some nontrivial logic for efficiently determining
whether there is a whole-page range to apply madvise to, is left in
place so that it can easily be re-enabled if desired, or later tuned
to only apply to certain sizes or to use additional heuristics.

remove LFS64 programming interfaces (macro-only) from _GNU_SOURCE

these badly pollute the namespace with macros whenever _GNU_SOURCE is
defined, which is always the case with g++, and especially tends to
interfere with C++ constructs.

as our implementation of these was macro-only, their removal cannot
affect any existing binaries. at the source level, portable software
should be prepared for them not to exist.

for now, they are left in place with explicit _LARGEFILE64_SOURCE.
this provides an easy temporary path for integrators/distributions to
get packages building again right away if they break while working on
a proper, upstreamable fix. the intent is that this be a very
short-term measure and that the macros be removed entirely in the next
release cycle.

remove LFS64 symbol aliases; replace with dynamic linker remapping

originally the namespace-infringing "large file support" interfaces
were included as part of glibc-ABI-compat, with the intent that they
not be used for linking, since our off_t is and always has been
unconditionally 64-bit and since we usually do not aim to support
nonstandard interfaces when there is an equivalent standard interface.

unfortunately, having the symbols present and available for linking
caused configure scripts to detect them and attempt to use them
without declarations, producing all the expected ill effects that
entails.

as a result, commit 2dd8d5e1b8ba1118ff1782e96545cb8a2318592c was made
to prevent this, using macros to redirect the LFS64 names to the
standard names, conditional on _GNU_SOURCE or _LARGEFILE64_SOURCE.
however, this has turned out to be a source of further problems,
especially since g++ defines _GNU_SOURCE by default. in particular,
the presence of these names as macros breaks a lot of valid code.

this commit removes all the LFS64 symbols and replaces them with a
mechanism in the dynamic linker symbol lookup failure path to retry
with the spurious "64" removed from the symbol name. in the future,
if/when the rest of glibc-ABI-compat is moved out of libc, this can be
removed.

dns query core: detect udp truncation at recv time

we already attempt to preclude this case by having res_send use a
sufficiently large temporary buffer even if the caller did not provide
one as large as or larger than the udp dns max of 512 bytes. however,
it's possible that the caller passed a custom-crafted query packet
using EDNS0, e.g. to get detailed DNSSEC results, with a larger udp
size allowance.

I have also seen claims that there are some broken nameservers in the
wild that do not honor the dns udp limit of 512 and send large answers
without the TC bit set, when the query was not using EDNS.

we generally don't aim to support broken nameservers, but in this case
both problems, if the latter is even real, have a common solution:
using recvmsg instead of recvfrom so we can examine the MSG_TRUNC
flag.

getaddrinfo dns lookup: use larger answer buffer to handle long CNAMEs

the size of 512 is not sufficient to get at least one address in the
worst case where the name is at or near max length and resolves to a
CNAME at or near max length. prior to tcp fallback, there was nothing
we could do about this case anyway, but now it's fixable.

the new limit 768 is chosen so as to admit roughly the number of
addresses with a worst-case CNAME as could fit for a worst-case name
that's not a CNAME in the old 512-byte limit. outside of this
worst-case, the number of addresses that might be obtained is
increased.

MAXADDRS (48) was originally chosen as an upper bound on the combined
number of A and AAAA records that could fit in 512-byte packets (31
and 17, respectively). it is not increased at this time.

so as to prevent a situation where the A records consume almost all of
these slots (at 768 bytes, a "best-case" name can fit almost 47 A
records), the order of parsing is swapped to process AAAA first. this
ensures roughly half of the slots are available to each address
family.

arpa/nameser.h: update RR types list

our RR type list in arpa/nameser.h was badly outdated, and missing
important types for DNSSEC and DANE use, among other things.

dns: implement tcp fallback in __res_msend query core

tcp fallback was originally deemed unwanted and unnecessary, since we
aim to return a bounded-size result from getaddrinfo anyway and
normally plenty of address records fit in the 512-byte udp dns limit.
however, this turned out to have several problems:

- some recursive nameservers truncate by omitting all the answers,
  rather than sending as many as can fit.

- a pathological worst-case CNAME for a worst-case name can fill the
  entire 512-byte space with just the two names, leaving no room for
  any addresses.

- the res_* family of interfaces allow querying of non-address records
  such as TLSA (DANE), TXT, etc. which can be very large. for many of
  these, it's critical that the caller see the whole RRset. also,
  res_send/res_query are specified to return the complete, untruncated
  length so that the caller can retry with an appropriately-sized
  buffer. determining this is not possible without tcp.

so, it's time to add tcp fallback.

the fallback strategy implemented here uses one tcp socket per
question (1 or 2 questions), initiated via tcp fastopen when possible.
the connection is made to the nameserver that issued the truncated
answer. right now, fallback happens unconditionally when truncation is
seen. this can, and may later be, relaxed for queries made by the
getaddrinfo system, since it will only use a bounded number of results
anyway.

retry is not attempted again after failure over tcp. the logic could
easily be adapted to do that, but it's of questionable value, since
the tcp stack automatically handles retransmission and the successs
answer with TC=1 over udp strongly suggests that the nameserver has
the full answer ready to give. further retry is likely just "take
longer to fail".

res_send: use a temp buffer if caller's buffer is under 512 bytes

for extremely small buffer sizes, the DNS query core in __res_msend
may malfunction completely, being unable to get even the headers to
determine the response code. but there is also a problem for
reasonable sizes under 512 bytes: __res_msend is unable to determine
if the udp answer was truncated at the recv layer, in which case it
may be incomplete, and res_send is then unable to honor its contract
to return the length of the full, non-truncated answer.

at present, res_send does not honor that contract anyway when the full
answer would exceed 512 bytes, since there is no tcp fallback, but
this change at least makes it consistent in a context where this is
the only "full answer" to be had.

adapt res_msend DNS query core for working with multiple sockets

this is groundwork for TCP fallback support, but does not itself
change behavior in any way.

getaddrinfo: add EAI_NODATA error code to distinguish NODATA vs NxDomain

this was apparently omitted long ago out of a lack of understanding of
its importance and the fact that POSIX doesn't specify it. despite not
being officially standardized, however, it turns out that at least
AIX, glibc, NetBSD, OpenBSD, QNX, and Solaris document and support it.

in certain usage cases, such as implementing a DNS gateway on top of
the stub resolver interfaces, it's necessary to distinguish the case
where a name does not exit (NxDomain) from one where it exists but has
no addresses (or other records) of the requested type (NODATA). in
fact, even the legacy gethostbyname API had this distinction, which we
were previously unable to support correctly because the backend lacked
it.

apart from fixing an important functionality gap, adding this
distinction helps clarify to users how search domain fallback works
(falling back in cases corresponding to EAI_NONAME, not in ones
corresponding to EAI_NODATA), a topic that has been a source of
ongoing confusion and frustration.

as a result of this change, EAI_NONAME is no longer a valid universal
error code for getaddrinfo in the case where AI_ADDRCONFIG has
suppressed use of all address families. in order to return an accurate
result in this case, getaddrinfo is modified to still perform at least
one lookup. this will almost surely fail (with a network error, since
there is no v4 or v6 network to query DNS over) unless a result comes
from the hosts file or from ip literal parsing, but in case it does
succeed, the result is replaced by EAI_NODATA.

glibc has a related error code, EAI_ADDRFAMILY, that could be used for
the AI_ADDRCONFIG case and certain NODATA cases, but distinguishing
them properly in full generality seems to require additional DNS
queries that are otherwise not useful. on glibc, it is only used for
ip literals with mismatching family, not for DNS or hosts file results
where the name has addresses only in the opposite family. since this
seems misleading and inconsistent, and since EAI_NODATA already covers
the semantic case where the "name" exists but doesn't have any
addresses in the requested family, we do not adopt EAI_ADDRFAMILY at
this time. this could be changed at some point if desired, but the
logic for getting all the corner cases with AI_ADDRCONFIG right is
slightly nontrivial.

fix error cases in gethostbyaddr_r

EAI_MEMORY is not possible (but would not provide errno if it were)
and EAI_FAIL does not provide errno. treat the latter as EBADMSG to
match how it's handled in gethostbyname2_r (it indicates erroneous or
failure response from the nameserver).

remove impossible error case from gethostbyname2_r

EAI_MEMORY is not possible because the resolver backend does not
allocate. if it did, it would be necessary for us to explicitly return
ENOMEM as the error, since errno is not guaranteed to reflect the
error cause except in the case of EAI_SYSTEM, so the existing code was
not correct anyway.

fix return value of gethostnbyname[2]_r on result not found

these functions are horribly underspecified, inconsistent between
historical systems, and should never have been included. however, the
signatures we have match the glibc ones, and the glibc behavior is to
treat NxDomain and NODATA results as a success condition, not an
ENOENT error.

dns: treat names rejected by res_mkquery as nonexistent rather than error

this distinction only affects search, but allows search to continue
when concatenating one of the search domains onto the requested name
produces a result that's not valid. this can happen when the
concatenation is too long, or one of the search list entries is
itself not valid.

as a consequence of this change, having "." in the search domains list
will now be ignored/skipped rather than making the lookup abort with
no results (due to producing a concatenation ending in ".."). this
behavior could be changed later if needed.

res_mkquery: error out on consecutive final dots in name

the main loop already errors out on zero-length labels within the
name, but terminates before having a chance to check for an erroneous
final zero-length label, instead producing a malformed query packet
with a '.' byte instead of the terminating zero.

rather than poke at the look logic, simply detect this condition early
and error out without doing anything.

this also fixes behavior of getaddrinfo when "." appears in the search
domain list, which produces a name ending in ".." after concatenation,
at least in the sense of no longer emitting malformed packets on the
network. however, due to other issues, the lookup will still fail.

fix thread leak on timer_create(SIGEV_THREAD) failure

After commit 5b74eed3b301e2227385f3bf26d3bb7c2d822cf8 the timer thread
doesn't check whether timer_create() actually created the timer,
proceeding to wait for a signal that might never arrive. We can't fix
this by simply checking for a negative timer_id after
pthread_barrier_wait() because we have no way to distinguish a timer
creation failure and a request to delete a timer with INT_MAX id if it
happens to arrive quickly (a variation of this bug existed before
5b74eed3b301e2227385f3bf26d3bb7c2d822cf8, where the timer would be
leaked in this case). So (ab)use cancel field of pthread_t instead.

re-enable vdso clock_gettime on arm (32-bit) with workaround

commit 4486c579cbf0d989080705f515d08cb48636ba88 disabled vdso
clock_gettime on arm due to a Linux kernel bug that was not understood
at the time, whereby the vdso function silently produced
catastrophically wrong results on some systems.

since then, the bug was tracked down to the way the arm kernel
disabled use of vdso clock_gettime on kernels where the necessary
timer was not available or was disabled. it simply patched out the
symbols, but it only did this for the legacy time32 functions, and
left the time64 function in place but non-operational. kernel commit
4405bdf3c57ec28d606bdf5325f1167505bfdcd4 (first present in 5.8)
provided the fix.

if this were a bug that impacted all users of the broken kernel
versions, we could probably ignore it and assume it had been patched
or replaced. however, it's very possible that these kernels appear in
the wild in devices running time32 userspace (glibc, musl 1.1.x, or
some other environment) where they appear to work fine, but where our
new binaries would fail catastrophically if we used the time64 vdso
function.

since the kernel has not (yet?) given us a way to probe for the
working time64 vdso function semantically, we work around the problem
by refusing to use the time64 one unless the time32 one is also
present. this will revert to not using vdso at all if the time32 one
is ever removed, but at least that's safe against wrong results and is
just a missed optimization.

process DT_RELR relocations in ldso-startup/static-pie

commit d32dadd60efb9d3b255351a3b532f8e4c3dd0db1 added DT_RELR
processing for programs and shared libraries processed by the dynamic
linker, but left them unsupported in the dynamic linker itseld and in
static pie binaries, which self-relocate via code in dlstart.c.

add the equivalent processing to this code path so that there are not
arbitrary restrictions on where the new packed relative relocation
form can be used.

fix fwprintf missing output to open_wmemstream FILEs

open_wmemstream's write method was written assuming no buffering,
since it sets the FILE up with buf_len of zero in order to avoid
issues with position/seeking. however, as a consequence of commit
bd57e2b43a5b56c00a82adbde0e33e5820c81164, a FILE being written to by
the printf core has a temporary local buffer for the duration of the
operation if it was unbuffered to begin with. since this was
disregarded by the wide memstream's write method, output produced
through this code path, particularly numeric fields, was missing from
the output wchar buffer.

copy the equivalent logic for using the buffered data from the
byte-oriented open_memstream.

dns: fail if ipv6 is disabled and resolv.conf has only v6 nameserves

if resolv.conf lists no nameservers at all, the default of 127.0.0.1
is used. however, another "no nameservers" case arises where the
system has ipv6 support disabled/configured-out and resolv.conf only
contains v6 nameservers. this caused the resolver to repeat socket
operations that will necessarily fail (sending to one or more
wrong-family addresses) while waiting for a timeout.

it would be contrary to configured intent to query 127.0.0.1 in this
case, but the current behavior is not conducive to diagnosing the
configuration problem. instead, fail immediately with EAI_SYSTEM and
errno==EAFNOSUPPORT so that the configuration error is reportable.

use kernel-provided AT_MINSIGSTKSZ for sysconf(_SC_[MIN]SIGSTKSZ)

use the legacy constant values if the kernel does not provide
AT_MINSIGSTKSZ (__getauxval will return 0 in this case) and as a
safety check if something is wrong and the provided value is less than
the legacy constant.

sysconf(_SC_SIGSTKSZ) returns SIGSTKSZ adjusted for the difference
between the legacy constant MINSIGSTKSZ and the runtime value, so that
the working space the application has on top of the minimum remains
invariant under changes to the minimum.

add sysconf keys/values for signal stack size

as a result of ISA extensions exploding register file sizes on some
archs, using a constant for minimum signal stack size no longer seems
viably future-proof. add sysconf keys allowing the kernel to provide a
machine-dependent minimum applications can query to ensure they
allocate sufficient space for stacks. the key names and indices align
with the same functionality in glibc.

see commit d5a5045382315e36588ca225889baa36ed0ed38f for previous
action on this subject.

ultimately, the macros MINSIGSTKSZ and SIGSTKSZ probably need to be
deprecated, but that is standards-amendment work outside the scope of
a single implementation.

fix fallback when ipv6 is disabled but resolv.conf has v6 nameserves

apparently this code path was never tested, as it's not usual to have
v6 nameservers listed on a system without v6 networking support. but
it was always intended to work.

when reverting to binding a v4 address, also revert the family in the
sockaddr structure and the socklen for it. otherwise bind will just
fail due to mismatched family/sockaddr size.

fix dns resolver fallback when v6 nameservers are listed by

epoll_create: fail with EINVAL if size is non-positive

This is a part of the interface contract defined in the Linux man
page (official for a Linux-specific interface) and asserted by test
cases in the Linux Test Project (LTP).

use alt signal stack when present for implementation-internal signals

a request for this behavior has been open for a long time. the
motivation is that application code, particularly under some language
runtimes designed around very-low-footprint coroutine type constructs,
may be operating with extremely small stack sizes unsuitable for
receiving signals, using a separate signal stack for any signals it
might handle.

progress on this was blocked at one point trying to determine whether
the implementation is actually entitled to clobber the alt stack, but
the phrasing "available to the implementation" in the POSIX spec for
sigaltstack seems to make it clear that the application cannot rely on
the contents of this memory to be preserved in the absence of signal
delivery (on the abstract machine, excluding implementation-internal
signals) and that we can therefore use it for delivery of signals that
"don't exist" on the abstract machine.

no change is made for SIGTIMER since it is always blocked when used,
and accepted via sigwaitinfo rather than execution of the signal
handler.

ldso: make exit condition clearer in fixup_rpath

breaking out of the switch-case when l==-1 means the conditional below
will necessarily be true (-1 >= buf_size, a size_t variable) and the
function will return 0. it is, however, somewhat unclear that that's
what's happening. simply returning there is simpler

freopen: reset stream orientation (byte/wide) and encoding rule

this is a requirement of the C language (orientation) and POSIX
(encoding rule) that was somehow overlooked.

we rely on the fact that the buffer pointers have been reset by
fflush, so that any future stdio operations on the stream will go
through the same code paths they would on a newly-opened file without
an orientation set, thereby setting the orientation as they should.

ldso: process RELR only for non-FDPIC archs

the way RELR is applied is not a meaningful operation for FDPIC (there
is no single "base" address). it seems unlikely RELR would ever be
added for FDPIC, but if it ever is, the behavior and possibly data
format will need to be different, so guard against calling the
non-FDPIC code.

ldso: support DT_RELR relative relocation format

this resolves DT_RELR relocations in non-ldso, dynamic-linked objects.

use syscall_arg_t and __scc macro for arguments to __alt_socketcall

otherwise, pointer arguments are sign-extended on x32, resulting in
EFAULT.

fix strings.h feature test macro usage due to missing features.h

fix ESRCH error handling for clock_getcpuclockid

the syscall used to probe availability of the clock fails with EINVAL
when the requested pid does not exist, but clock_getcpuclockid is
specified to use ESRCH for this purpose.

aarch64: add vfork

The generic vfork implementation uses clone(SIGCHLD) which has fork
semantics.

Implement vfork as clone(SIGCHLD|CLONE_VM|CLONE_VFORK, 0) instead which
has vfork semantics. (stack == 0 means sp is unchanged in the child.)

Some users rely on vfork semantics when memory overcommit is disabled
or when the vfork child runs code that synchronizes with the parent
process (non-conforming).

fix mishandling of errno in getaddrinfo AI_ADDRCONFIG logic

this code attempts to use the value of errno from failure of socket or
connect to infer availability of the requested address family (v4 or
v6). however, in the case where connect failed, there is an
intervening call to close between connect and the use of errno. close
is not required to preserve errno on success, and in fact the
__aio_close code, which is called whenever aio is linked and thus
always called in dynamic-linked programs, unconditionally clobbers
errno. as a result, getaddrinfo fails with EAI_SYSTEM and errno=ENOENT
rather than correctly determining that the address family was
unavailable.

this fix is based on report/patch by Jussi Nieminen, but simplified
slightly to avoid breaking the case where socket, not connect, failed.

early stage ldso: remove symbolic references via error handling function

while the error handling function should not be reached in stage 2
(assuming ldso itself was linked correctly), this was not statically
determinate from the compiler's perspective, and in theory a compiler
performing LTO could lift the TLS references (errno and other things)
out of the printf-family functions called in a stage where TLS is not
yet initialized.

instead, perform the call via a static-storage, internal-linkage
function pointer which will be set to a no-op function until the stage
where the real error handling function should be reachable.

inspired by commit 63c67053a3e42e9dff788de432f82ff07d4d772a.

in early stage ldso before __dls2b, call mprotect with __syscall

if LTO is enabled, gcc hoists the call to ___errno_location outside the
loop even though the access to errno is gated behind head != &ldso
because ___errno_location is marked __attribute__((const)). this causes
the program to crash because TLS is not yet initialized when called from
__dls2. this is also possible if LTO is not enabled; even though gcc 11
doesn't do it, it is still wrong to use errno here.

since the start and end are already aligned, we can simply call
__syscall instead of using global errno.

Fixes: e13a2b8953ef ("implement PT_GNU_RELRO support")

avoid limited space of random temp file names if clock resolution is low

this is not an issue that was actually hit, but I noticed it during
previous changes to __randname: if the resolution of tv_nsec is too
low, the space of temp file names obtainable by a thread could
plausibly be exhausted. mixing in tv_sec avoids this.

remove random filename obfuscation that leaks ASLR information

the __randname function is used by various temp file creation
interfaces as a backend to produce a name to attempt using. it does
not have to produce results that are safe against guessing, and only
aims to avoid unintentional collisions.

mixing the address of an object on the stack in a reversible manner
leaked ASLR information, potentially allowing an attacker who can
observe the temp files created and their creation timestamps to narrow
down the possible ASLR state of the process that created them. there
is no actual value in mixing these addresses in; it was just
obfuscation. so don't do it.

instead, mix the tid, just to avoid collisions if multiple
processes/threads stampede to create temp files at the same moment.
even without this measure, they should not collide unless the clock
source is very low resolution, but it's a cheap improvement.

if/when we have a guaranteed-available userspace csprng, it could be
used here instead. even though there is no need for cryptographic
entropy here, it would avoid having to reason about clock resolution
and such to determine whether the behavior is nice.

ensure distinct query id for parallel A and AAAA queries in resolver

assuming a reasonable realtime clock, res_mkquery is highly unlikely
to generate the same query id twice in a row, but it's possible with a
very low-resolution system clock or under extreme delay of forward
progress. when it happens, res_msend fails to wait for both answers,
and instead stops listening after getting two answers to the same
query (A or AAAA).

to avoid this, increment one byte of the second query's id if it
matches the first query's. don't bother checking if the second byte is
also equal, since it doesn't matter; we just need to ensure that at
least one byte is distinct.

mntent: fix potential mishandling of extremely long lines

commit 05973dc3bbc1aca9b3c8347de6879ed72147ab3b made it so that lines
longer than INT_MAX can in theory be read, but did not use a suitable
type for the positions determined by sscanf. we could change to using
size_t, but since the signature for getmntent_r does not admit lines
longer than INT_MAX, it does not make sense to support them in the
legacy thread-unsafe form either -- the principle here is that there
should not be an incentive to use the unsafe function to get added
functionality.

mntent: fix parsing lines with optional fields

According to fstab(5), the last two fields are optional, but this
wasn't accepted. After this change, only the first field is required,
which matches glibc's behaviour.

Using sscanf as before, it would have been impossible to differentiate
between 0 fields and 4 fields, because sscanf would have returned 0 in
both cases due to the use of assignment suppression and %n for the
string fields (which is important to avoid copying any strings). So
instead, before calling sscanf, initialize every string to the empty
string, and then we can check which strings are empty afterwards to
know how many fields were matched.

fix constraint violation in qsort wrapper around qsort_r

function pointer types do not implicitly convert to void *. a cast is
required here.

use __fstat instead of __fstatat with AT_EMPTY_PATH in __map_file

this isolates knowledge of the nonstandard AT_EMPTY_PATH extension to
one place and returns __map_file to its prior simplicity.

provide an internal namespace-safe __fstat

this avoids the need for implementation-internal callers to depend on
the nonstandard AT_EMPTY_PATH extension to use __fstatat and isolates
knowledge of that extension to the implementation of __fstat.

only use fstatat and others legacy stat syscalls if they exist

riscv32 and future architectures only provide statx.

drop direct use of stat syscalls in internal __map_file

this function is used to implement some baseline ISO C interfaces, so
it cannot call any of the stat functions by their public names. use
the namespace-safe __fstatat instead.

provide an internal namespace-safe __fstatat

this makes it so we can drop direct stat syscall use in interfaces
that can't use the POSIX namespace.

drop direct use of stat syscalls in fchmodat

instead, use the fstatat/stat functions, so that the logic for which
syscalls are present and usable is all in fstatat.

this results in a slight increase in cost for old kernels on 32-bit
archs: now statx will be attempted first rather than just using the
legacy time32 syscalls, despite us not caring about timestamps.
however, it's not even clear that the legacy syscalls *should* succeed
if the timestamps are out of range; arguably they should fail with
EOVERFLOW. as such, paying a small cost here on old kernels seems
well-motivated.

with this change, fchmodat itself is no longer blocking ports to new
archs that lack the legacy syscalls.

drop use of stat operation in temporary file name generation

this change serves two purposes:

1. it eliminates one of the few remaining uses of the kernel stat
structure which will not be present in future archs, avoiding the need
for growing ifdef logic here.

2. it potentially makes the operations less expensive when the
candidate exists as a non-symlink by avoiding the need to read the
inode (assuming the directory tables suffice to distinguish symlinks).

this uses the idiom I discovered while rewriting realpath for commit
29ff7599a448232f2527841c2362643d246cee36 of being able to use the
readlink operation as an inexpensive probe for file existence that
doesn't following symlinks.

only fallback to gettimeofday/settimeofday syscalls if they exist

riscv32 and future architectures only provide the clock_ functions.

only use getrlimit/setrlimit syscalls if they exist

riscv32 and future architectures only provide prlimit64.

don't remap internal-use syscall macros to nonexistent time32 syscalls

riscv32 and future architectures lack the _time32 variants entirely,
so don't try to use their numbers. instead, reflect that they're not
present.

remove ARMSUBARCH relic from configure

commit 0f814a4e57e80d2512934820b878211e9d71c93e removed its use.

add missing POSIX confstr keys for pthread CFLAGS/LDFLAGS

_CS_POSIX_V7_THREADS_CFLAGS and _CS_POSIX_V7_THREADS_LDFLAGS have been
missing for a long time, which is a conformance defect. we were
waiting on glibc to add them or at least agree on the numeric values
they will have so as to keep the numbering aligned. it looks like they
will be added to glibc with these numbers, and in any case, this list
does not have any significant churn that would result in the numbers
getting taken.

fix incorrect parameter name in internal netlink.h RTA_OK macro

the wrong name works only by accident.

release 1.2.3

accept null pointer as message argument to gettext functions

the change to support passing null was rejected in the past on the
grounds that GNU gettext documented it as undefined, on an assumption
that only glibc accepted it and that the standalone GNU gettext did
not. but it turned out that both explicitly accept it.

in light of this, since some software assumes null can be passed
safely, allow it.

fix invalid free of duplocale object when malloc has been replaced

newlocale and freelocale use __libc_malloc and __libc_free, but
duplocale used malloc. If malloc was replaced, this resulted in
invalid free using the wrong allocator when passing the result of
duplocale to freelocale.

Instead, use libc-internal malloc for duplocale.

This bug was introduced by commit
1e4204d522670a1d8b8ab85f1cfefa960547e8af.

fix __WORDSIZE on x32 sys/user.h

sys/reg.h already had it right as 32, to which it was explicitly
changed when commit 664cd341921007cea52c8891f27ce35927dca378 derived
x32 from x86_64. but the copy exposed in sys/user.h was missed.

sys/ptrace.h: add PTRACE_GET_RSEQ_CONFIGURATION from linux v5.13

see

linux commit 90f093fa8ea48e5d991332cee160b761423d55c1
rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

the struct type got __ prefix to follow existing practice.

sys/prctl.h: add PR_PAC_{SET,GET}_ENABLED_KEYS from linux v5.13

see

linux commit 201698626fbca1cf1a3b686ba14cf2a056500716
arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)

elf.h: add NT_ARM_PAC_ENABLED_KEYS from linux v5.13

see

linux commit 201698626fbca1cf1a3b686ba14cf2a056500716
arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)

netinet/in.h: add INADDR_DUMMY from linux v5.13

see

linux commit 321827477360934dc040e9d3c626bf1de6c3ab3c
icmp: don't send out ICMP messages with a source address of 0.0.0.0

"RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails."

bits/syscall.h: add landlock syscalls from linux v5.13

see

  linux commit a49f4f81cb48925e8d7cbd9e59068f516e984144
  arch: Wire up Landlock syscalls

  linuxcommit 17ae69aba89dbfa2139b7f8024b757ab3cc42f59
  Merge tag 'landlock_v34' of ... jmorris/linux-security

Landlock provides for unprivileged application sandboxing. The goal of
Landlock is to enable to restrict ambient rights (e.g. global filesystem
access) for a set of processes. Landlock is inspired by seccomp-bpf but
instead of filtering syscalls and their raw arguments, a Landlock rule
can restrict the use of kernel objects like file hierarchies, according
to the kernel semantic.

netinet/tcp.h: add tcp_zerocopy_receive fields from linux v5.12

see

  linux commit 7eeba1706eba6def15f6cb2fc7b3c3b9a2651edc
  tcp: Add receive timestamp support for receive zerocopy.

  linux commit 3c5a2fd042d0bfac71a2dfb99515723d318df47b
  tcp: Sanitize CMSG flags and reserved args in tcp_zerocopy_receive.

netinet/tcp.h: add TCP_NLA_* values up to linux v5.12

TCP_NLA_EDT was new in v5.9, see

  linux commit 48040793fa6003d211f021c6ad273477bcd90d91
  tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS

TCP_NLA_TTL is new in v5.12, see

  linux commit e7ed11ee945438b737e2ae2370e35591e16ec371
  tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS

s390x: add ptrace requests from linux v5.12

PTRACE_OLDSETOPTIONS is old, but it was missing, PTRACE_SYSEMU and
PTRACE_SYSEMU_SINGLESTEP are new, see

linux commit 56e62a73702836017564eaacd5212e4d0fa1c01d
s390: convert to generic entry

bits/syscall.h: add mount_setattr from linux v5.12

new syscall to change the properties of a mount or a mount tree using
file descriptors which the new mount api is based on, see

linux commit 2a1867219c7b27f928e2545782b86daaf9ad50bd
fs: add mount_setattr()

signal.h: add new sa_flags from linux v5.11

see

  linux commit a54f0dfda754c5cecc89a14dab68a3edc1e497b5
  signal: define the SA_UNSUPPORTED bit in sa_flags

  linux commit 6ac05e832a9e96f9b1c42a8917cdd317d7b6c8fa
  signal: define the SA_EXPOSE_TAGBITS bit in sa_flags

Note: SA_ is in the posix reserved namespace so these linux specific flags
can be exposed when compiling for posix.

signal.h: add SYS_USER_DISPATCH si_code value from linux v5.11

see

linux commit 1d7637d89cfce54a4f4a41c2325288c2f47470e8
signal: Expose SYS_USER_DISPATCH si_code type

signal.h: add si_code values for SIGSYS

unlike other si_code defines, SYS_ is not in the posix reserved namespace
which is likely the reason why SYS_SECCOMP was previously missing (was new
in linux v3.5).

netinet/tcp.h: add tcp zerocopy related changes from linux v5.11

see

  linux commit 18fb76ed53865c1b5d5f0157b1b825704590beb5
  net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.

  linux commit 94ab9eb9b234ddf23af04a4bc7e8db68e67b8778
  net-zerocopy: Defer vm zap unless actually needed.

netinet/if_ether.h: add ETH_P_CFM from linux v5.11

see

linux commit fbaedb4129838252570410c65abb2036b5505cbd
bridge: uapi: cfm: Added EtherType used by the CFM protocol.

sys/socket.h: add new SO_ socket options from linux v5.11

see

  linux commit 7fd3253a7de6a317a0683f83739479fb880bffc8
  net: Introduce preferred busy-polling

  linux commit 7c951cafc0cb2e575f1d58677b95ac387ac0a5bd
  net: Add SO_BUSY_POLL_BUDGET socket option

sys/prctl.h: add PR_SET_SYSCALL_USER_DISPATCH from linux v5.11

see

  linux commit 1446e1df9eb183fdf81c3f0715402f1d7595d4cb
  kernel: Implement selective syscall userspace redirection

  linux commit 36a6c843fd0d8e02506681577e96dabd203dd8e8
  entry: Use different define for selector variable in SUD

redirect syscalls to a userspace handler via SIGSYS, except for a specific
range of code. can be toggled via a memory write to a selector variable.
mainly for wine.

bits/syscall.h: add epoll_pwait2 from linux v5.11

see

  linux commit b0a0c2615f6f199a656ed8549d7dce625d77aa77
  epoll: wire up syscall epoll_pwait2

  linux commit 58169a52ebc9a733aeb5bea857bc5daa71a301bb
  epoll: add syscall epoll_pwait2

epoll_wait with struct timespec timeout instead of int. no time32 variant.

nice: return EPERM instead of EACCES

To comply with POSIX, change errno from EACCES to EPERM
when the caller did not have the required privilege.

protect stack canary from leak via read-as-string by zeroing second byte

This reduces entropy of the canary from 64-bit to 56-bit in exchange
for mitigating non-terminated C string overflows by setting the second
byte of the canary to nul, so that off-by-one write overflow with a
nul byte can still be detected.

Idea from GrapheneOS bionic commit 7024d880b51f03a796ff8832f1298f2f1531fd7b

math: avoid runtime conversions of floating-point constants

gcc-12 with -frounding-mode will do inexact constant conversions at
runtime according to the runtime rounding mode.

in the math library we want constants to be rounding mode independent
so this patch fixes cases where new runtime conversions happen with
gcc-12.

fortunately this only affects two minor cases, the fix uses global
initializers where rounding mode does not apply.

after the patch the same amount of conversions happen with gcc-12 as
with gcc-11.

fix spurious failures by fgetws when buffer ends with partial character

commit a90d9da1d1b14d81c4f93e1a6d1a686c3312e4ba made fgetws look for
changes to errno by fgetwc to detect encoding errors, since ISO C did
not allow the implementation to set the stream's error flag in this
case, and the fgetwc interface did not admit any other way to detect
the error. however, the possibility of fgetwc setting errno to EILSEQ
in the success path was overlooked, and in fact this can happen if the
buffer ends with a partial character, causing mbtowc to be called with
only part of the character available.

since that change was made, the C standard was amended to specify that
fgetwc set the stream error flag on encoding errors, and commit
511d70738bce11a67219d0132ce725c323d00e4e made it do so. thus, there is
no longer any need for fgetws to poke at errno to handle encoding
errors.

this commit reverts commit a90d9da1d1b14d81c4f93e1a6d1a686c3312e4ba
and thereby fixes the problem.

add missing strerror text for key management

fix out-of-bound read processing time zone data with distant-past dates

this bug goes back to commit 1cc81f5cb0df2b66a795ff0c26d7bbc4d16e13c6
where zoneinfo file support was first added. in scan_trans, which
searches for the appropriate local time/dst rule in effect at a given
time, times prior to the second transition time caused the -1 slot of
the index to be read to determine the previous rule in effect. this
memory was always valid (part of another zoneinfo table in the mapped
file) but the byte value read was then used to index another table,
possibly going outside the bounds of the mmap. most of the time, the
result was limited to misinterpretation of the rule in effect at that
time (pre-1900s), but it could produce a crash if adjacent memory was
not readable.

the root cause of the problem, however, was that the logic for this
code path was all wrong. as documented in the comment, times before
the first transition should be treated as using the lowest-numbered
non-dst rule, or rule 0 if no non-dst rules exist. if the argument is
in units of local time, however, the rule prior to the first
transition is needed to determine if it falls before or after it, and
that's where the -1 index was wrongly used.

instead, use the documented logic to find out what rule would be in
effect before the first transition, and apply it as the offset if the
argument was given in local time.

the new code has not been heavily tested, but no longer performs
potentially out-of-bounds accesses, and successfully handles the 1883
transition from local mean time to central standard time in the test
case the error was reported for.

fix potentially wrong-sign zero in cproj functions at infinity

these are specified to use the sign of the imaginary part of the input
as the sign of zero in the result, but wrongly copied the sign of the
real part.

make fseek detect and produce an error for invalid whence arguments

this is a POSIX requirement. we previously relied on the underlying fd
(or other backend) seek operation to produce the error, but since
linux lseek now supports other seek modes (SEEK_DATA and SEEK_HOLE)
which do not interact well with stdio buffering, this is insufficient.
instead, explicitly check whence before performing any operations.

add SEEK_DATA and SEEK_HOLE to unistd.h

these are linux specific constants. glibc exposes them behind
_GNU_SOURCE, but, since SEEK_* is reserved for the implementation, we
can simply define them. furthermore, since they can't be used with
fseek() and other functions that deal with FILE, we don't add them to
stdio.h.

fix failure to use add-cfi scripts on asm when building out-of-tree

use $srcdir in configure test for add-cfi script.

fix wcwidth of hangul combining (vowel/final) letters

these characters combine onto a base character (initial) and therefore
need to have width 0. the original binary-search implementation of
wcwidth handled them correctly, but a regression was introduced in
commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd by generating the new
tables from unicode without noticing that the classification logic in
use (unicode character category Mn/Me/Cf) was insufficient to catch
these characters.

fix mismatched signatures for strtod_l family

strtod_l, strtof_l, and strtold_l originally existed only as
glibc-ABI-compat symbols. as noted in the commit which added them,
17a60f9d327c6f8b5707a06f9497d846e75c01f2, making them aliases for the
non-_l functions was a hack and not appropriate if they ever became
public API.

unfortunately, commit 35eb1a1a9b97577e113240cd65bf9fc44b8df030 did
make them public without undoing the hack. fix that now by moving the
the _l functions to their own file as wrappers that just throw away
the locale_t argument.

define NULL as nullptr when used in C++11 or later

This should be safer for casting and more compatible with existing code
bases that wrongly assume it must be defined as a pointer.

fix hwcap access in powerpc-sf setjmp/longjmp

commit 7be59733d71ada3a32a98622507399253f1d5e48 introduced the
hwcap-based branches to support the SPE FPU, but wrongly coded them as
bitwise tests on the computed address of __hwcap, not a value loaded
from that address. replace the add with indexed load to fix it.

fix struct layout mismatch in sound ioctl time32 fallback conversion

the snd_pcm_mmap_control struct used with SNDRV_PCM_IOCTL_SYNC_PTR was
mistakenly defined in the kernel uapi with "before u32" padding both
before and after the first u32 member. our conversion between the
modern struct and the legacy time32 struct was written without
awareness of that mistake, and assumed the time64 version of the
struct was the intended form with padding to match the layout on
64-bit archs. as a result, the struct was not converted correctly when
running on old kernels, with audio glitches as the likely result.

this was discovered thanks to a related bug in the kernel, whereby
32-bit userspace running on a 64-bit kernel also suffered from the
types mismatching. the mistaken layout is now the ABI and can't be
changed -- or at least making a new ioctl to change it would just
result in a worse situation.

our conversion here is changed to treat the snd_pcm_mmap_control
substruct as two separate substructs at locations dependent on
endianness (since the displacement depends on endianness), using the
existing conversion framework.

add qsort_r and make qsort a wrapper around it

we make qsort a wrapper by providing a wrapper_cmp function that uses
the extra argument as a function pointer. should be optimized to a tail
call on most architectures, as long as it's built with
-fomit-frame-pointer, so the performance impact should be minimal.

to keep the git history clean, for now qsort_r is implemented in qsort.c
and qsort is implemented in qsort_nr.c. qsort.c also received a few
trivial cleanups, including replacing (*cmp)() calls with cmp().
qsort_nr.c contains only wrapper_cmp and qsort as a qsort_r wrapper
itself.

add SPE FPU support to powerpc-sf

When the soft-float ABI for PowerPC was added in commit
5a92dd95c77cee81755f1a441ae0b71e3ae2bcdb, with Freescale cpus using
the alternative SPE FPU as the main use case, it was noted that we
could probably support hard float on them, but that it would involve
determining some difficult ABI constraints. This commit is the
completion of that work.

The Power-Arch-32 ABI supplement defines the ABI profiles, and indeed
ATR-SPE is built on ATR-SOFT-FLOAT. But setjmp/longjmp compatibility
are problematic for the same reason they're problematic on ARM, where
optional float-related parts of the register file are "call-saved if
present". This requires testing __hwcap, which is now done.

In keeping with the existing powerpc-sf subarch definition, which did
not have fenv, the fenv macros are not defined for SPE and the SPEFSCR
control register is left (and assumed to start in) the default mode.

fix undefined behavior in getdelim via null pointer arithmetic and memcpy

both passing a null pointer to memcpy with length 0, and adding 0 to a
null pointer, are undefined. in some sense this is 'benign' UB, but
having it precludes use of tooling that strictly traps on UB. there
may be better ways to fix it, but conditioning the operations which
are intended to be no-ops in the k==0 case on k being nonzero is a
simple and safe solution.

fix excessively slow TLS performance on some mips models

commit 6d99ad91e869aab35a4d76d34c3c9eaf29482bad introduced this
regression as part of a larger change, based on an incorrect
assumption that rdhwr being part of the mips r2 ISA level meant that
the TLS register, known in the mips documentation as UserLocal, was
unconditionally present on chips providing this ISA level and would
not need trap-and-emulate. this turns out to be false.

based on research by Stanislav Kljuhhin and Abilio Marques, who
reported the problem as a performance regression on certain routers
using OpenWRT vs older uclibc-based versions, it turns out the mips
manuals document the UserLocal register as a feature that might or
might not be implemented or enabled, reflected by a cpu capability bit
in the CONFIG3 register, and that Linux checks for this and has to
explicitly enable it on models that have it.

thus, it's indeed possible that r2+ chips can lack the feature,
bringing us back to the situation where Linux only has a fast
trap-and-emulate path for the case where the destination register is
$3. so, always read the thread pointer through $3. this may incur a
gratuitous move to the desired final register on chips where it's not
needed, but it really doesn't matter.

fix error checking in pthread_getname_np

len is unsigned and can never be smaller than 0. though unlikely, an
error in read() would have lead to an out of bounds write to name.

Reported-by: Michael Forney <mforney@mforney.org>

fix libc-internal signal blocking on mips archs

due to historical reasons, the mips signal set has 128 bits rather
than 64 like on every other arch. this was special-cased correctly, at
least for 32-bit mips, at one time, but was inadvertently broken in
commit 7c440977db9444d7e6b1c3dcb1fdf4ee49ca4158, and seems never to
have been right on mips64/n32.

as consequenct of this bug, applications making use of high realtime
signal numbers on mips may have been able to execute application code
in contexts where doing so was unsafe.

fix broken struct shmid_ds on powerpc (32-bit)

the kernel structure has padding of the shm_segsz member up to 64
bits, as well as 2 unused longs at the end. somehow that was
overlooked when the powerpc port was added, and it has been broken
ever since; applications compiled with the wrong definition do not
correctly see the shm_segsz, shm_cpid, and shm_lpid members.

fixing the definition just by adding the missing padding would break
the ABI size of the structure as well as the position of the time64
shm_atime and shm_dtime members we added at the end. instead, just
move one of the unused padding members from the original end (before
time64) of the structure to the position of the missing padding. this
preserves size and preserves correct behavior of any compiled code
that was already working. programs affected by the wrong definition
need to be recompiled with the correct one.

math: fix fmaf not to depend on FE_TOWARDZERO

fix TZ parsing logic for identifying POSIX-form strings

previously, the contents of the TZ variable were considered a
candidate for a file/path name only if they began with a colon or
contained a slash before any comma. the latter was very sloppy logic
to avoid treating any valid POSIX TZ string as a file name, but it
also triggered on values that are not valid POSIX TZ strings,
including 3-letter timezone names without any offset.

instead, only treat the TZ variable as POSIX form if it begins with a
nonzero standard time name followed by +, -, or a digit.

also, special case GMT and UTC to always be treated as POSIX form
(with implicit zero offset) so that a stray file by the same name
cannot break software that depends on setting TZ=GMT or TZ=UTC.