nsz Git - musl/log

sys/random.h: add GRND_INSECURE from linux v5.6

added in

linux commit 75551dbf112c992bc6c99a972990b3f272247e23
random: add GRND_INSECURE to return best-effort non-cryptographic bytes

sys/prctl.h: add PR_{SET,GET}_IO_FLUSHER from linux v5.6

needed for storage drivers with userspace component that may
run in the IO path, see

linux commit 8d19f1c8e1937baf74e1962aae9f90fa3aeab463
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim

netinet/udp.h: add TCP_ENCAP_ESPINTCP from linux v5.6

The use of TCP_ in udp.h is not known, fortunately udp.h is not
specified by posix so there are no strict namespace rules, added in

linux commit e27cca96cd68fa2c6814c90f9a1cfd36bb68c593
xfrm: add espintcp (RFC 8229)

netinet/tcp.h: update for linux v5.6

TCP_NLA_TIMEOUT_REHASH queries timeout-triggered rehash attempts,
tcpm_ifindex limits the scope of TCP_MD5SIG* sockopt to a device.

see

  linux commit 32efcc06d2a15fa87585614d12d6c2308cc2d3f3
  tcp: export count for rehash attempts

  linux commit 6b102db50cdde3ba2f78631ed21222edf3a5fb51
  net: Add device index to tcp_md5sig

netinet/in.h: add IPPROTO_ macros from linux v5.6

add IPPROTO_ETHERNET and IPPROTO_MPTCP, see

  linux commit 2677625387056136e256c743e3285b4fe3da87bb
  seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number

  linux commit faf391c3826cd29feae02078ca2022d2f912f7cc
  tcp: Define IPPROTO_MPTCP

add pidfd_getfd and openat2 syscall numbers from linux v5.6

also added clone3 on sh and m68k, on sh it's still missing (not
yet wired up), but reserved so safe to add.

see

  linux commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179
  open: introduce openat2(2) syscall

  linux commit 9a2cef09c801de54feecd912303ace5c27237f12
  arch: wire up pidfd_getfd syscall

  linux commit 8649c322f75c96e7ced2fec201e123b2b073bf09
  pid: Implement pidfd_getfd syscall

  linux commit e8bb2a2a1d51511e6b3f7e08125d52ec73c11139
  m68k: Wire up clone3() syscall

netinet/tcp.h: update tcp_info for linux v5.5

see

linux commit 480274787d7e3458bc5a7cfbbbe07033984ad711
tcp: add TCP_INFO status for failed client TFO

use generic bits/fcntl.h for x86_64 and riscv64

these were only using a custom version because they needed the
"non-64" variants of the file locking command macros.

make generic bits/fcntl.h shareable with 64-bit archs

the fcntl file locking command macro values in the existing generic
bits/fcntl.h were the "64" variants, requiring 64-bit archs that use
the "plain" variants to have their own bits/fcntl.h, even if they
otherwise use the common definitions for everything.

since commit 7cc79d10afd43811a486fd5e9fcdf8e45ac599e0 exposed
__LONG_MAX to all bits headers, we can now make the generic one common
between 32- and 64-bit archs.

fix missing O_LARGEFILE values on x86_64, x32, and mips64

prior to commit 685e40bb09f5f24a2af54ea09c97328808f76990, x86_64 was
correctly passing O_LARGEFILE to SYS_open; it was removed (defined to
0 in the public header, and changed to use the public definition) as
part of that change, probably out of a mistaken belief that it's not
needed.

however, on a mixed system with 32-bit and 64-bit binaries, it's
important that all files be opened with O_LARGEFILE, even if the
opening process is 64-bit, in case a descriptor is passed to a 32-bit
process. otherwise, attempts to access past 2GB in the 32-bit process
could produce EOVERFLOW.

most 64-bit archs added later got this right alread, except for
mips64. x32 was also affected. there are now fixed.

fix missing newline in herror output

fix i386 __set_thread_area fallback

this code is only needed for pre-2.6 kernels, which are not actually
supported anyway, and was never tested. the fallback path using
SYS_modify_ldt failed to clear the upper bits of %eax (all ones due to
SYS_set_thread_area's return value being an error) before modifying
%al to attempt a new syscall.

restore h_errno ABI compatibility with ancient binaries

prior to commit e68c51ac46a9f273927aef8dcebc89912ab19ece, h_errno was
actually an external data object not a macro. bring back the symbol,
and use it as the storage for the main thread's h_errno.

technically this still doesn't provide full compatibility if the
application was multithreaded, but at the time there were no res_*
functions (and they did not set h_errno anyway), so any use of h_errno
would have been via thread-unsafe functions. thus a solution that just
fixes single-threaded applications seems acceptable.

clean up overinclusion in files using TIOCGWINSZ

now that struct winsize is available via sys/ioctl.h once again,
including termios.h is not needed.

fix regression with applications that expect struct winsize in ioctl.h

putting the (simple) definition in alltypes.h seems like the best
solution here. making sys/ioctl.h implicitly include termios.h is
probably excess namespace pollution.

configure: enable warnings by default

now that -Wall is not used and we control which warnings are enabled,
it makes sense to have the wanted ones on by default. hopefully this
will also discourage manually adding -Wall to CFLAGS and making
incorrect changes or bug reports based on the compiler's output.

configure: use additive warnings instead of subtracting from -Wall

-Wall varies too much by compiler and version. rather than trying to
track all the unwanted style warnings that need to be subtracted, just
enable wanted warnings.

also, move -Wno-pointer-to-int-cast outside --enable-warnings
conditional so that it always applies, since it's turning off a
nuisance warning that's on-by-default with most compilers.

configure: add further -Werror=... options to detected CFLAGS

these four warning options were overlooked previously, likely because
they're not part of GCC's -Wall. they all detect constraint violations
(invalid C at the source level) and should always be on in -Werror
form.

remove redundant pthread struct members repeated for layout purposes

dtv_copy, canary2, and canary_at_end existed solely to match multiple
ABI and asm-accessed layouts simultaneously. now that pthread_arch.h
can be included before struct __pthread is defined, the struct layout
can depend on macros defined by pthread_arch.h.

deduplicate __pthread_self thread pointer adjustment out of each arch

the adjustment made is entirely a function of TLS_ABOVE_TP and
TP_OFFSET. aside from avoiding repetition of the TP_OFFSET value and
arithmetic, this change makes pthread_arch.h independent of the
definition of struct __pthread from pthread_impl.h. this in turn will
allow inclusion of pthread_arch.h to be moved to the top of
pthread_impl.h so that it can influence the definition of the
structure.

previously, arch files were very inconsistent about the type used for
the thread pointer. this change unifies the new __get_tp interface to
always use uintptr_t, which is the most correct when performing
arithmetic that may involve addresses outside the actual pointed-to
object (due to TP_OFFSET).

deduplicate TP_ADJ logic out of each arch, replace with TP_OFFSET

the only part of TP_ADJ that was not uniquely determined by
TLS_ABOVE_TP was the 0x7000 adjustment used mainly on mips and powerpc
variants.

report res_query failures, including nxdomain/nodata, via h_errno

while it's not clearly documented anywhere, this is the historical
behavior which some applications expect. applications which need to
see the response packet in these cases, for example to distinguish
between nonexistence in a secure vs insecure zone, must already use
res_mkquery with res_send in order to be portable, since most if not
all other implementations of res_query don't provide it.

make h_errno thread-local

the framework to do this always existed but it was deemed unnecessary
because the only [ex-]standard functions using h_errno were not
thread-safe anyway. however, some of the nonstandard res_* functions
are also supposed to set h_errno to indicate the cause of error, and
were unable to do so because it was not thread-safe. this change is a
prerequisite for fixing them.

add tcgetwinsize and tcsetwinsize functions, move struct winsize

these have been adopted for future issue of POSIX as the outcome of
Austin Group issue 1151, and are simply functions performing the roles
of the historical ioctls. since struct winsize is being standardized
along with them, its definition is moved to the appropriate header.

there is some chance this will break source files that expect struct
winsize to be defined by sys/ioctl.h without including termios.h. if
this happens, further changes will be needed to have sys/ioctl.h
expose it too.

fix MUSL_LOCPATH search

all path elements but the last had the final byte truncated.

add gettid function

this is a prerequisite for addition of other interfaces that use
kernel tids, including futex and SIGEV_THREAD_ID.

there is some ambiguity as to whether the semantic return type should
be int or pid_t. either way, futex API imposes a contract that the
values fit in int (excluding some upper reserved bits). glibc used
pid_t, so in the interest of not having gratuitous mismatch (the
underlying types are the same anyway), pid_t is used here as well.

while conceptually this is a syscall, the copy stored in the thread
structure is always valid in all contexts where it's valid to call
libc functions, so it's used to avoid the syscall.

aarch64: fix setjmp return value

longjmp should set the return value of setjmp, but 64bit
registers were used for the 0 check while the type is int.

use the code that gcc generates for return val ? val : 1;

setjmp: optimize longjmp prologues

Use a branchless sequence that is one byte shorter on 64-bit, same size
on 32-bit. Thanks to Pete Cawley for suggesting this variant.

setjmp: optimize x86 longjmp epilogues

setjmp: avoid useless REX-prefix on xor %eax, %eax

setjmp: fix x86-64 longjmp argument adjustment

longjmp 'val' argument is an int, but the assembly is referencing 64-bit
registers as if the argument was a long, or the caller was responsible
for extending the argument. Though the psABI is not clear on this, the
interpretation in GCC is that high bits may be arbitrary and the callee
is responsible for sign/zero-extending the value as needed (likewise for
return values: callers must anticipate that high bits may be garbage).

Therefore testing %rax is a functional bug: setjmp would wrongly return
zero if longjmp was called with val==0, but high bits of %rsi happened
to be non-zero.

Rewrite the prologue to refer to 32-bit registers. In passing, change
'test' to use %rsi, as there's no advantage to using %rax and the new
form is cheaper on processors that do not perform move elimination.

prefer new socket syscalls, fallback to SYS_socketcall only if needed

a number of users performing seccomp filtering have requested use of
the new individual syscall numbers for socket syscalls, rather than
the legacy multiplexed socketcall, since the latter has the arguments
all in memory where they can't participate in filter decisions.

previously, some archs used the multiplexed socketcall if it was
historically all that was available, while other archs used the
separate syscalls. the intent was that the latter set only include
archs that have "always" had separate socket syscalls, at least going
back to linux 2.6.0. however, at least powerpc, powerpc64, and sh were
wrongly included in this set, and thus socket operations completely
failed on old kernels for these archs.

with the changes made here, the separate syscalls are always
preferred, but fallback code is compiled for archs that also define
SYS_socketcall. two such archs, mips (plain o32) and microblaze,
define SYS_socketcall despite never having needed it, so it's now
undefined by their versions of syscall_arch.h to prevent inclusion of
useless fallback code.

some archs, where the separate syscalls were only added after the
addition of SYS_accept4, lack SYS_accept. because socket calls are
always made with zeros in the unused argument positions, it suffices
to just use SYS_accept4 to provide a definition of SYS_accept, and
this is done to make happy the macro machinery that concatenates the
socket call name onto __SC_ and SYS_.

math: new software sqrtl

same approach as in sqrt.

sqrtl was broken on aarch64, riscv64 and s390x targets because
of missing quad precision support and on m68k-sf because of
missing ld80 sqrtl.

this implementation is written for quad precision and then
edited to make it work for both m68k and x86 style ld80 formats
too, but it is not expected to be optimal for them.

note: using fp instructions for the initial estimate when such
instructions are available (e.g. double prec sqrt or rsqrt) is
avoided because of fenv correctness.

math: add __math_invalidl

for targets where long double is different from double.

math: new software sqrtf

same method as in sqrt, this was tested on all inputs against
an sqrtf instruction. (the only difference found was that x86
sqrtf does not signal the x86 specific input-denormal exception
on negative subnormal inputs while the software sqrtf does,
this is fine as it was designed for ieee754 exceptions only.)

there is known faster method:
"Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation"
that computes sqrtf directly via pipelined polynomial evaluation
which allows more parallelism, but the design does not generalize
easily to higher precisions.

math: new software sqrt

approximate 1/sqrt(x) and sqrt(x) with goldschmidt iterations.
this is known to be a fast method for computing sqrt, but it is
tricky to get right, so added detailed comments.

use a lookup table for the initial estimate, this adds 256bytes
rodata but it can be shared between sqrt, sqrtf and sqrtl.
this saves one iteration compared to a linear estimate.

this is for soft float targets, but it supports fenv by using a
floating-point operation to get the final result. the result
is correctly rounded in all rounding modes. if fenv support is
turned off then the nearest rounded result is computed and
inexact exception is not signaled.

assumes fast 32bit integer arithmetics and 32 to 64bit mul.

in hosts file lookups, honor first canonical name regardless of family

prior to this change, the canonical name came from the first hosts
file line matching the requested family, so the canonical name for a
given hostname could differ depending on whether it was requested with
AF_UNSPEC or a particular family (AF_INET or AF_INET6). now, the
canonical name is deterministically the first one to appear with the
requested name as an alias.

in hosts file lookups, use only first match for canonical name

the existing code clobbered the canonical name already discovered
every time another matching line was found, which will necessarily be
the case when a hostname has both IPv4 and v6 definitions.

patch by Wolf.

release 1.2.1

add m68k sqrtl using native instruction

this is actually a functional fix at present, since the C sqrtl does
not support ld80 and just wraps double sqrt. once that's fixed it will
just be an optimization.

getentropy: fix UB if len==0

if len==0, an uninitalized variable would be returned

fix async-cancel-safety of pthread_cancel

the previous commit addressing async-signal-safety issues around
pthread_kill did not fully fix pthread_cancel, which is also required
(albeit rather irrationally) to be async-cancel-safe.

without blocking implementation-internal signals, it's possible that,
when async cancellation is enabled, a cancel signal sent by another
thread interrupts pthread_kill while the killlock for a targeted
thread is held. as a result, the calling thread will terminate due to
cancellation without ever unlocking the targeted thread's killlock,
and thus the targeted thread will be unable to exit.

make thread killlock async-signal-safe for pthread_kill

pthread_kill is required to be AS-safe. that requirement can't be met
if the target thread's killlock can be taken in contexts where
application-installed signal handlers can run.

block signals around use of this lock in all pthread_* functions which
target a tid, and reorder blocking/unblocking of signals in
pthread_exit so that they're blocked whenever the killlock is held.

fix C implementation of a_clz_32

this broke mallocng size_to_class on archs without a native
implementation of a_clz_32. the incorrect logic seems to have been
something i derived from a related but distinct log2-type operation.
with the change made here, it passes an exhaustive test.

as this function is new and presently only used by mallocng, no other
functionality was affected.

vfscanf: fix possible invalid free due to uninitialized variable use

vfscanf() may use the variable 'alloc' uninitialized when taking the
branch introduced by commit b287cd745c2243f8e5114331763a5a9813b5f6ee.
Spotted by clang.

make mallocng the default malloc implementation

add malloc implementation selection to configure

the intent here is to keep oldmalloc as an option, at least for the
short term, in case any users are negatively impacted in some way by
mallocng and need to fallback until their issues are resolved.

import mallocng

the files added come from the mallocng development repo, commit
2ed58817cca5bc055974e5a0e43c280d106e696b. they comprise a new malloc
implementation, developed over the past 9 months, to replace the old
allocator (since dubbed "oldmalloc") with one that retains low code
size and minimal baseline memory overhead while avoiding fundamental
flaws in oldmalloc and making significant enhancements. these include
highly controlled fragmentation, fine-grained ability to return memory
to the system when freed, and strong hardening against dynamic memory
usage errors by the caller.

internally, mallocng derives most of these properties from tightly
structuring memory, creating space for allocations as uniform-sized
slots within individually mmapped (and individually freeable)
allocation groups. smaller-than-pagesize groups are created within
slots of larger ones. minimal group size is very small, and larger
sizes (in geometric progression) only come into play when usage is
high.

all data necessary for maintaining consistency of the allocator state
is tracked in out-of-band metadata, reachable via a validated path
from minimal in-band metadata. all pointers passed (to free, etc.) are
validated before any stores to memory take place. early reuse of freed
slots is avoided via approximate LRU order of freed slots. further
hardening against use-after-free and double-free, even in the case
where the freed slot has been reused, is made by cycling the offset
within the slot at which the allocation is placed; this is possible
whenever the slot size is larger than the requested allocation.

add glue code for mallocng merge

this includes both an implementation of reclaimed-gap donation from
ldso and a version of mallocng's glue.h with namespace-safe linkage to
underlying syscalls, integration with AT_RANDOM initialization, and
internal locking that's optimized out when the process is
single-threaded.

add optimized aarch64 memcpy and memset

these are based on the ARM optimized-routines repository v20.05
(ef907c7a799a), with macro dependencies flattened out and memmove code
removed from memcpy. this change is somewhat unfortunate since having
the branch for memmove support in the large n case of memcpy is the
performance-optimal and size-optimal way to do both, but it makes
memcpy alone (static-linked) about 40% larger and suggests a policy
that use of memcpy as memmove is supported.

tabs used for alignment have also been replaced with spaces.

add big-endian support to ARM assembler memcpy

Allow the existing ARM assembler memcpy implementation to be used for
both big and little endian targets.

clear need_locks in child after fork

the child is single-threaded, but may still need to synchronize with
last changes made to memory by another thread in the parent, so set
need_locks to -1 whereby the next lock-taker will drop to 0 and
prevent further barriers/locking.

only use memcpy realloc to shrink if an exact-sized free chunk exists

otherwise, shrink in-place. as explained in the description of commit
3e16313f8fe2ed143ae0267fd79d63014c24779f, the split here is valid
without holding split_merge_lock because all chunks involved are in
the in-use state.

fix memset overflow in oldmalloc race fix overhaul

commit 3e16313f8fe2ed143ae0267fd79d63014c24779f introduced this bug by
making the copy case reachable with n (new size) smaller than n0
(original size). this was left as the only way of shrinking an
allocation because it reduces fragmentation if a free chunk of the
appropriate size is available. when that's not the case, another
approach may be better, but any such improvement would be independent
of fixing this bug.

fix invalid use of access function in nftw

access always computes result with real ids not effective ones, so it
is not a valid means of determining whether the directory is readable.
instead, attempt to open it before reporting whether it's readable,
and then use fdopendir rather than opendir to open and read the
entries.

effort is made here to keep fd_limit behavior the same as before even
if it was not correct.

add fallback a_clz_32 implementation

some archs already have a_clz_32, used to provide a_ctz_32, but it
hasn't been mandatory because it's not used anywhere yet. mallocng
will need it, however, so add it now. it should probably be optimized
better, but doesn't seem to make a difference at present.

only disable aligned_alloc if malloc was replaced but it wasn't

it both malloc and aligned_alloc have been replaced but the internal
aligned_alloc still gets called, the replacement is a wrapper of some
sort. it's not clear if this usage should be officially supported, but
it's at least a plausibly interesting debugging usage, and easy to do.
it should not be relied upon unless it's documented as supported at
some later time.

have ldso track replacement of aligned_alloc

this is in preparation for improving behavior of malloc interposition.

reintroduce calloc elison of memset for direct-mmapped allocations

a new weak predicate function replacable by the malloc implementation,
__malloc_allzerop, is introduced. by default it's always false; the
default version will be used when static linking if the bump allocator
was used (in which case performance doesn't matter) or if malloc was
replaced by the application. only if the real internal malloc is
linked (always the case with dynamic linking) does the real version
get used.

if malloc was replaced dynamically, as indicated by __malloc_replaced,
the predicate function is ignored and conditional-memset is always
performed.

move __malloc_replaced to a top-level malloc file

it's not part of the malloc implementation but glue with musl dynamic
linker.

switch to a common calloc implementation

abstractly, calloc is completely malloc-implementation-independent;
it's malloc followed by memset, or as we do it, a "conditional memset"
that avoids touching fresh zero pages.

previously, calloc was kept separate for the bump allocator, which can
always skip memset, and the version of calloc provided with the full
malloc conditionally skipped the clearing for large direct-mmapped
allocations. the latter is a moderately attractive optimization, and
can be added back if needed. however, further consideration to make it
correct under malloc replacement would be needed.

commit b4b1e10364c8737a632be61582e05a8d3acf5690 documented the
contract for malloc replacement as allowing omission of calloc, and
indeed that worked for dynamic linking, but for static linking it was
possible to get the non-clearing definition from the bump allocator;
if not for that, it would have been a link error trying to pull in
malloc.o.

the conditional-clearing code for the new common calloc is taken from
mal0_clear in oldmalloc, but drops the need to access actual page size
and just uses a fixed value of 4096. this avoids potentially needing
access to global data for the sake of an optimization that at best
marginally helps archs with offensively-large page sizes.

move oldmalloc to its own directory under src/malloc

this sets the stage for replacement, and makes it practical to keep
oldmalloc around as a build option for a while if that ends up being
useful.

only the files which are actually part of the implementation are
moved. memalign and posix_memalign are entirely generic. in theory
calloc could be pulled out too, but it's useful to have it tied to the
implementation so as to optimize out unnecessary memset when
implementation details make it possible to know the memory is already
clear.

move __expand_heap into malloc.c

this function is no longer used elsewhere, and moving it reduces the
number of source files specific to the malloc implementation.

rename memalign source file back to its proper name

rename aligned_alloc source file back to its proper name

reverse dependency order of memalign and aligned_alloc

this change eliminates the internal __memalign function and makes the
memalign and posix_memalign functions completely independent of the
malloc implementation, written portably in terms of aligned_alloc.

rename aligned_alloc source file

this is the first step of swapping the name of the actual
implementation to aligned_alloc while preserving history follow.

remove stale document from malloc src directory

this was an unfinished draft document present since the initial
check-in, that was never intended to ship in its current form. remove
it as part of reorganizing for replacement of the allocator.

rewrite bump allocator to fix corner cases, decouple from expand_heap

this affects the bump allocator used when static linking in programs
that don't need allocation metadata due to not using realloc, free,
etc.

commit e3bc22f1eff87b8f029a6ab31f1a269d69e4b053 refactored the bump
allocator to share code with __expand_heap, used by malloc, for the
purpose of fixing the case (mainly nommu) where brk doesn't work.
however, the geometric growth behavior of __expand_heap is not
actually well-suited to the bump allocator, and can produce
significant excessive memory usage. in particular, by repeatedly
requesting just over the remaining free space in the current
mmap-allocated area, the total mapped memory will be roughly double
the nominal usage. and since the main user of the no-brk mmap fallback
in the bump allocator is nommu, this excessive usage is not just
virtual address space but physical memory.

in addition, even on systems with brk, having a unified size request
to __expand_heap without knowing whether the brk or mmap backend would
get used made it so the brk could be expanded twice as far as needed.
for example, with malloc(n) and n-1 bytes available before the current
brk, the brk would be expanded by n bytes rounded up to page size,
when expansion by just one page would have sufficed.

the new implementation computes request size separately for the cases
where brk expansion is being attempted vs using mmap, and also
performs individual mmap of large allocations without moving to a new
bump area and throwing away the rest of the old one. this greatly
reduces the need for geometric area size growth and limits the extent
to which free space at the end of one bump area might be unusable for
future allocations.

as a bonus, the resulting code size is somewhat smaller than the
combined old version plus __expand_heap.

move malloc_impl.h from src/internal to src/malloc

this reflects that it is no longer intended for consumption outside of
the malloc implementation.

move declaration of interfaces between malloc and ldso to dynlink.h

this eliminates consumers of malloc_impl.h outside of the malloc
implementation.

reformat clock_adjtime with always-true condition removed

always use time64 syscall first for clock_adjtime

clock_adjtime always returns the current clock setting in struct
timex, so it's always possible that the time64 version is needed.

fix broken time64 clock_adjtime

the 64-bit time code path used the wrong (time32) syscall. fortunately
this code path is not yet taken unless attempting to set a post-Y2038
time.

fix unbounded heap expansion race in malloc

this has been a longstanding issue reported many times over the years,
with it becoming increasingly clear that it could be hit in practice.
under concurrent malloc and free from multiple threads, it's possible
to hit usage patterns where unbounded amounts of new memory are
obtained via brk/mmap despite the total nominal usage being small and
bounded.

the underlying cause is that, as a fundamental consequence of keeping
locking as fine-grained as possible, the state where free has unbinned
an already-free chunk to merge it with a newly-freed one, but has not
yet re-binned the combined chunk, is exposed to other threads. this is
bad even with small chunks, and leads to suboptimal use of memory, but
where it really blows up is where the already-freed chunk in question
is the large free region "at the top of the heap". in this situation,
other threads momentarily see a state of having almost no free memory,
and conclude that they need to obtain more.

as far as I can tell there is no fix for this that does not harm
performance. the fix made here forces all split/merge of free chunks
to take place under a single lock, which also takes the place of the
old free_lock, being held at least momentarily at the time of free to
determine whether there are neighboring free chunks that need merging.

as a consequence, the pretrim, alloc_fwd, and alloc_rev operations no
longer make sense and are deleted. simplified merging now takes place
inline in free (__bin_chunk) and realloc.

as commented in the source, holding the split_merge_lock precludes any
chunk transition from in-use to free state. for the most part, it also
precludes change to chunk header sizes. however, __memalign may still
modify the sizes of an in-use chunk to split it into two in-use
chunks. arguably this should require holding the split_merge_lock, but
that would necessitate refactoring to expose it externally, which is a
mess. and it turns out not to be necessary, at least assuming the
existing sloppy memory model malloc has been using, because if free
(__bin_chunk) or realloc sees any unsynchronized change to the size,
it will also see the in-use bit being set, and thereby can't do
anything with the neighboring chunk that changed size.

suppress unwanted warnings when configuring with clang

coding style warnings enabled by default in clang have long been a
source of spurious questions/bug-reports. since clang provides a -w
that behaves differently from gcc's, and that lets us enable any
warnings we may actually want after turning them all off to start with
a clean slate, use it at configure time if clang is detected.

restore lock-skipping for processes that return to single-threaded state

the design used here relies on the barrier provided by the first lock
operation after the process returns to single-threaded state to
synchronize with actions by the last thread that exited. by storing
the intent to change modes in the same object used to detect whether
locking is needed, it's possible to avoid an extra (possibly costly)
memory load after the lock is taken.

cut down size of some libc struct members

these are all flags that can be single-byte values.

don't use libc.threads_minus_1 as relaxed atomic for skipping locks

after all but the last thread exits, the next thread to observe
libc.threads_minus_1==0 and conclude that it can skip locking fails to
synchronize with any changes to memory that were made by the
last-exiting thread. this can produce data races.

on some archs, at least x86, memory synchronization is unlikely to be
a problem; however, with the inline locks in malloc, skipping the lock
also eliminated the compiler barrier, and caused code that needed to
re-check chunk in-use bits after obtaining the lock to reuse a stale
value, possibly from before the process became single-threaded. this
in turn produced corruption of the heap state.

some uses of libc.threads_minus_1 remain, especially for allocation of
new TLS in the dynamic linker; otherwise, it could be removed
entirely. it's made non-volatile to reflect that the remaining
accesses are only made under lock on the thread list.

instead of libc.threads_minus_1, libc.threaded is now used for
skipping locks. the difference is that libc.threaded is permanently
true once an additional thread has been created. this will produce
some performance regression in processes that are mostly
single-threaded but occasionally creating threads. in the future it
may be possible to bring back the full lock-skipping, but more care
needs to be taken to produce a safe design.

reorder thread list unlink in pthread_exit after all locks

since the backend for LOCK() skips locking if single-threaded, it's
unsafe to make the process appear single-threaded before the last use
of lock.

this fixes potential unsynchronized access to a linked list via
__dl_thread_cleanup.

fix incorrect SIGSTKFLT on all mips archs

signal 7 is SIGEMT on Linux mips* ABI according to the man pages and
kernel. it's not clear where the wrong name came from but it dates
back to original mips commit.

handle possibility that SIGEMT replaces SIGSTKFLT in strsignal

presently all archs define SIGSTKFLT but this is not correct. change
strsignal as a prerequisite for fixing that.

fix return value of res_send, res_query on errors from nameserver

the internal __res_msend returns 0 on timeout without having obtained
any conclusive answer, but in this case has not filled in meaningful
anslen. res_send wrongly treated that as success, but returned a zero
answer length. any reasonable caller would eventually end up treating
that as an error when attempting to parse/validate it, but it should
just be reported as an error.

alternatively we could return the last-received inconclusive answer
(typically servfail), but doing so would require internal changes in
__res_msend. this may be considered later.

fix handling of errors resolving one of paired A+AAAA query

the old logic here likely dates back, at least in inspiration, to
before it was recognized that transient errors must not be allowed to
reflect the contents of successful results and must be reported to the
application.

here, the dns backend for getaddrinfo, when performing a paired query
for v4 and v6 addresses, accepted results for one address family even
if the other timed out. (the __res_msend backend does not propagate
error rcodes back to the caller, but continues to retry until timeout,
so other error conditions were not actually possible.)

this patch moves the checks to take place before answer parsing, and
performs them for each answer rather than only the answer to the first
query. if nxdomain is seen it's assumed to apply to both queries since
that's how dns semantics work.

set AD bit in dns queries, suppress for internal use

the AD (authenticated data) bit in outgoing dns queries is defined by
rfc3655 to request that the nameserver report (via the same bit in the
response) whether the result is authenticated by DNSSEC. while all
results returned by a DNSSEC conforming nameserver will be either
authenticated or cryptographically proven to lack DNSSEC protection,
for some applications it's necessary to be able to distinguish these
two cases. in particular, conforming and compatible handling of DANE
(TLSA) records requires enforcing them only in signed zones.

when the AD bit was first defined for queries, there were reports of
compatibility problems with broken firewalls and nameservers dropping
queries with it set. these problems are probably a thing of the past,
and broken nameservers are already unsupported. however, since there
is no use in the AD bit with the netdb.h interfaces, explicitly clear
it in the queries they make. this ensures that, even with broken
setups, the standard functions will work, and at most the res_*
functions break.

fix undefined behavior from signed overflow in strstr and memmem

unsigned char promotes to int, which can overflow when shifted left by
24 bits or more. this has been reported multiple times but then
forgotten. it's expected to be benign UB, but can trap when built with
explicit overflow catching (ubsan or similar). fix it now.

note that promotion to uint32_t is safe and portable even outside of
the assumptions usually made in musl, since either uint32_t has rank
at least unsigned int, so that no further default promotions happen,
or int is wide enough that the shift can't overflow. this is a
desirable property to have in case someone wants to reuse the code
elsewhere.

remove arm (32-bit) support for vdso clock_gettime

it's been reported that the vdso clock_gettime64 function on (32-bit)
arm is broken, producing erratic results that grow at a rate far
greater than one reported second per actual elapsed second. the vdso
function seems to have been added sometime between linux 5.4 and 5.6,
so if there's ever been a working version, it was only present for a
very short window.

it's not clear what the eventual upstream kernel solution will be, but
something needs to be done on the libc side so as not to be producing
binaries that seem to work on older/existing/lts kernels (which lack
the function and thus lack the bug) but will break fantastically when
moving to newer kernels.

hopefully vdso support will be added back soon, but with a new symbol
name or version from the kernel to allow continued rejection of broken
ones.

fix undefined behavior in wcsto[ld] family functions

analogous to commit b287cd745c2243f8e5114331763a5a9813b5f6ee but for
the custom FILE stream type the wcstol and wcstod family use. __toread
could be used here as well, but there's a simple direct fix to make
the buffer pointers initially valid for subtraction, so just do that
to avoid pulling in stdio exit code in programs that don't use stdio.

fix sh fesetround failure to clear old mode

the sh version of fesetround or'd the new rounding mode onto the
control register without clearing the old rounding mode bits, making
changes sticky. this was the root cause of multiple test failures.

move __string_read into vsscanf source file

apparently this function was intended at some point to be used by
strto* family as well, and thus was put in its own file; however, as
far as I can tell, it's only ever been used by vsscanf. move it to the
same file to reduce the number of source files and external symbols.

remove spurious repeated semicolon in fmemopen

combine two calls to memset in fmemopen

this idea came up when I thought we might need to zero the UNGET
portion of buf as well, but it seems like a useful improvement even
when that turned out not to be necessary.

fix possible access to uninitialized memory in shgetc (via scanf)

shgetc sets up to be able to perform an "unget" operation without the
caller having to remember and pass back the character value, and for
this purpose used a conditional store idiom:

if (f->rpos[-1] != c) f->rpos[-1] = c

to make it safe to use with non-writable buffers (setup by the
sh_fromstring macro or __string_read with sscanf).

however, validity of this depends on the buffer space at rpos[-1]
being initialized, which is not the case under some conditions
(including at least unbuffered files and fmemopen ones).

whenever data was read "through the buffer", the desired character
value is already in place and does not need to be written. thus,
rather than testing for the absence of the value, we can test for
rpos<=buf, indicating that the last character read could not have come
from the buffer, and thereby that we have a "real" buffer (possibly of
zero length) with writable pushback (UNGET bytes) below it.

fix undefined behavior in scanf core

as reported/analyzed by Pascal Cuoq, the shlim and shcnt
macros/functions are called by the scanf core (vfscanf) with f->rpos
potentially null (if the FILE is not yet activated for reading at the
time of the call). in this case, they compute differences between a
null pointer (f->rpos) and a non-null one (f->buf), resulting in
undefined behavior.

it's unlikely that any observably wrong behavior occurred in practice,
at least without LTO, due to limits on what's visible to the compiler
from translation unit boundaries, but this has not been checked.

fix is simply ensuring that the FILE is activated for read mode before
entering the main scanf loop, and erroring out early if it can't be.

math: add x86_64 remquol

math: move x87-family fmod functions to C with inline asm

math: move x87-family remainder functions to C with inline asm

math: move x87-family rint functions to C with inline asm

math: move x87-family lrint functions to C with inline asm

math: move x86_64 (l)lrint(f) functions to C with inline asm