runtime: FreeBSD memory corruption involving fork system call #15658

Open
derekmarcotte opened this Issue May 12, 2016 · 203 comments

Comments

Projects
None yet

derekmarcotte commented May 12, 2016 edited

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

  • fatal error: runtime.SetFinalizer: finalizer already set
  • runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty

1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

bradfitz added this to the Go1.7Maybe milestone May 12, 2016

Owner

bradfitz commented May 12, 2016

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

aclements was assigned by bradfitz May 12, 2016

Member

aclements commented May 12, 2016

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

Member

aclements commented May 12, 2016

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

Owner

bradfitz commented May 12, 2016

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

derekmarcotte commented May 13, 2016 edited

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

Contributor

RLH commented May 14, 2016

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

derekmarcotte commented May 15, 2016 edited

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

derekmarcotte commented May 16, 2016 edited

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

Member

aclements commented May 16, 2016

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

Member

aclements commented May 16, 2016

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

derekmarcotte referenced this issue in derekmarcotte/hfm May 16, 2016

Open

FreeBSD instability: go runtime panics #1

derekmarcotte commented May 17, 2016 edited

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

Member

aclements commented May 17, 2016

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

derekmarcotte commented May 19, 2016 edited

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
Member

aclements commented May 19, 2016

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

That's correct! Thanks. Anyone else able to reproduce?

Member

aclements commented May 26, 2016

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

Member

aclements commented May 31, 2016

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

Contributor

josharian commented Jun 1, 2016

derekmarcotte commented Jun 2, 2016 edited

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}

@alements, after the above change, it ran for 4 days without a panic.

Member

aclements commented Jun 6, 2016

@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, and the finalizer, so there isn't really anything left. It needs to be something like:

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start() still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.

I'm sorry I didn't catch that earlier.

@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.

I've recompiled with the above, and crashed after ~32 hours. Log attached:

doesnotexists.txt

Member

aclements commented Jun 9, 2016

@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?

Member

aclements commented Jun 9, 2016

Also, 1.6.2 or master?

@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.

Trying to narrow things down as much as possible. So, this is with go build using master (as of 7af2ce3), with this diff applied:

diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
 package os

 import (
-       "runtime"
        "sync/atomic"
        "syscall"
 )
@@ -19,7 +18,6 @@ type Process struct {

 func newProcess(pid int, handle uintptr) *Process {
        p := &Process{Pid: pid, handle: handle}
-       runtime.SetFinalizer(p, (*Process).Release)
        return p
 }

diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
                return nil
        }
        f := &File{&file{fd: fdi, name: name}}
-       runtime.SetFinalizer(f.file, (*file).close)
        return f
 }

with source:

package main

/* stdlib includes */
import (
    "fmt"
    "os/exec"
)

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

func main() {
    fmt.Println("Starting a bunch of goroutines...")

    // 8 & 16 are arbitrary
    done := make(chan struct{}, 16)

    for i := 0; i < 8; i++ {
        go run(done)
    }

    for {
        select {
        case <-done:
            go run(done)
        }
    }
}
Contributor

ianlancetaylor commented Jun 14, 2016

It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.

ianlancetaylor changed the title from runtime: garbage collector race condition (SetFinalizer, addfinalizer, addspecial, removespecial) to runtime: FreeBSD memory corruption involving finalizers Jun 14, 2016

@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.

@adg adg modified the milestone: Go1.8, Go1.7Maybe Jun 20, 2016

kaey commented Jul 5, 2016

Looks like i'm having the same issue with github.com/influxdb/telegraf. Panic log attached.

acquirep: p->m=859531515904(14) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=859535287296(8) p->status=1
fatal error: acquirep: invalid p state
% uname -a
FreeBSD 9.3-RELEASE-p5 FreeBSD 9.3-RELEASE-p5 #0: Mon Nov  3 22:38:58 UTC 2014     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
% cat /var/run/dmesg.boot | grep -i cpu
CPU: Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz (3492.14-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
go version go1.6.2 freebsd/amd64

I'll try to rebuild with go tip.

panic.txt

Contributor

stevenh commented Jul 18, 2016

@derekmarcotte are your test machines metal or VM's?

I've just kicked off a test on 10.2-RELEASE box here to see if I can repro locally.

Contributor

stevenh commented Jul 18, 2016 edited

It took many hours but I did get a panic in the end.

This is from go 1.6.2 (no patches) on 10.2-RELEASE amd64 (metal) no Virtualisation

./test 
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.SetFinalizer.func2()
        /usr/local/go/src/runtime/mfinal.go:372 +0x6f

goroutine 23221055 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc8202eca58 sp=0xc8202eca50
runtime.SetFinalizer(0x4f2f00, 0xc8204a19e0, 0x4d68a0, 0x54f550)
        /usr/local/go/src/runtime/mfinal.go:374 +0x4b5 fp=0xc8202ecbd8 sp=0xc8202eca58
os.NewFile(0x3, 0x51efd0, 0x9, 0x323200000000)
        /usr/local/go/src/os/file_unix.go:57 +0xfc fp=0xc8202ecc30 sp=0xc8202ecbd8
os.OpenFile(0x51efd0, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/file_unix.go:123 +0x1bd fp=0xc8202ecca8 sp=0xc8202ecc30
os.Open(0x51efd0, 0x9, 0xdeaddeaddeaddead, 0x0, 0x0)
        /usr/local/go/src/os/file.go:244 +0x48 fp=0xc8202ecce8 sp=0xc8202ecca8
os/exec.(*Cmd).stdin(0xc8203eac80, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:171 +0x6e fp=0xc8202ecd80 sp=0xc8202ecce8
os/exec.(*Cmd).Start(0xc8203eac80, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:316 +0x2f4 fp=0xc8202ecf68 sp=0xc8202ecd80
main.run(0xc82006c060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x50 fp=0xc8202ecfa8 sp=0xc8202ecf68
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8202ecfb0 sp=0xc8202ecfa8
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:31 +0x19b

kaey commented Jul 19, 2016 edited

are your test machines metal or VM's?

For me, telegraf crashes on both vmware VMs and baremetal machines.

@stevenh: So it's random, and I think the probability of crashing is a function of the number of cpus being used, and while I say 8 is arbitrary, it's also the number of cpus I had on hand. If you have more, by all means, set it higher.

Try it a few times, and you'll see. I just stuck it in a loop like while true; do date; ./ex-from-comment-225006179; done, and then you can see them over a day period.

Anecdotally, I've been able to run a longer-lived process with GOMAXPROCS=2, but will be doing more thorough testing.

Most of those dumps are from a jail running on bare metal. The Xeon was straight on the jail host (although there were jails running).

@stevenh: Oh, and the one that uses /doesnotexist crashes after 32 hours, but the one that uses true is much sooner (because it appears there are multiple issues going on).

Contributor

stevenh commented Jul 20, 2016

With GOMAXPROC=1 and "true" its been running for 20+ hours so far, so it does indeed to seem to be effected by the number of procs available.

I've been running /doesnotexist with stock go 1.6.2 for just shy of a week with GOMAXPROCS="2". Interesting to note, there are 11 OS-threads, which is surprising to me, but may be expected?

I've been able to cause the panic much quicker being more aggressive with the garbage collector. I've been using this script to test:

while true; do 
  echo "==== NEW RUN $(date) ===="
  echo
  time sh -c 'export GOGC=5; export GODEBUG=gctrace=2,schedtrace=100; ./15658-doesnotexit'
  echo
  echo
done >> ../logs/15658-doesnotexit-logs-gogc5-gtrace2 2>> ../logs/15658-doesnotexit-logs-gogc5-gtrace2

Here's a sample of the crashing interval:

==== NEW RUN Tue Aug 30 16:36:11 EDT 2016 ====
==== NEW RUN Tue Aug 30 16:37:18 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:13:02 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:28:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 17:54:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:04:29 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:36:36 EDT 2016 ====
==== NEW RUN Tue Aug 30 18:57:48 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:09:12 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:37:33 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:42:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 19:52:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:03:22 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:07:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:10:31 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:25:37 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:00 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:31:28 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:34:30 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:38:16 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:40:09 EDT 2016 ====
==== NEW RUN Tue Aug 30 20:56:57 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:31:54 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:39:05 EDT 2016 ====
==== NEW RUN Tue Aug 30 21:51:06 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:28:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:32:35 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:53:50 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:55:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:57:44 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:58:08 EDT 2016 ====
==== NEW RUN Tue Aug 30 22:59:58 EDT 2016 ====
==== NEW RUN Tue Aug 30 23:52:15 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:28:37 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:46:25 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:53:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 00:57:34 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:06:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:13:20 EDT 2016 ====
==== NEW RUN Wed Aug 31 01:21:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:17:49 EDT 2016 ====
==== NEW RUN Wed Aug 31 02:55:35 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:43:36 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:47:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:48:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:50:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 03:59:05 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:43:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 04:55:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:04:41 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:34:44 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:50:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:53:55 EDT 2016 ====
==== NEW RUN Wed Aug 31 05:58:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:19:07 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:34:13 EDT 2016 ====
==== NEW RUN Wed Aug 31 06:41:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:52:08 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:54:32 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:56:02 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:04 EDT 2016 ====
==== NEW RUN Wed Aug 31 07:59:48 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:01:46 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:06:30 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:23:03 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:29:57 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:41:58 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:43:14 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:45:00 EDT 2016 ====
==== NEW RUN Wed Aug 31 08:59:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 09:24:11 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:01:22 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:05:45 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:09:24 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:12:51 EDT 2016 ====
==== NEW RUN Wed Aug 31 10:23:35 EDT 2016 ====
^^^^ process was idle/wedged, sent kill manually
==== NEW RUN Thu Sep 1 06:47:27 EDT 2016 ====
==== NEW RUN Thu Sep 1 06:49:27 EDT 2016 ====

$ go version
go version go1.7 freebsd/amd64

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH="/home/derek/dev/gopath"
GORACE=""
GOROOT="/home/derek/go"
GOTOOLDIR="/home/derek/go/pkg/tool/freebsd_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches"
CXX="clang++"
CGO_ENABLED="1"

reezer commented Sep 2, 2016

I am trying to reproduce this on DragonFly BSD. So far I was able to reproduce the idle state:

localhost% uname -a
DragonFly localhost.localdomain 4.6-RELEASE DragonFly v4.6.0-RELEASE #0: Mon Aug  1 12:46:25 EDT 2016     root@www.shiningsilence.com:/usr/obj/build/sys/X86_64_GENERIC  x86_64
localhost% go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="dragonfly"
GOOS="dragonfly"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/dragonfly_amd64"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build276771230=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
localhost% go version
go version go1.7 dragonfly/amd64

SCHED 783590ms: gomaxprocs=2 idleprocs=0 threads=8 spinningthreads=0 idlethreads=5 runqueue=8 [0 1]

I have been running this test inside Virtualbox, with 2 virtual CPUs.

reezer commented Oct 2, 2016

On DragonFly it's actually causing kernel panic. Therefor I created a bug report there. It includes the dump.

https://bugs.dragonflybsd.org/issues/2949

Contributor

rsc commented Oct 20, 2016

Cannot reproduce on freebsd-amd64-gce101 gomote running FreeBSD 10.1.
The Dragonfly panic is suggestive that there may be a similar kernel problem in FreeBSD.
Unlikely we are going to solve this.

@rsc rsc modified the milestone: Go1.8Maybe, Go1.8 Oct 20, 2016

derekmarcotte commented Oct 20, 2016 edited

Thanks very much for taking a look. I'm very grateful for your time.

Are you able to elaborate more fully about what you mean by "Unlikely we are going to solve this." Are you referring to the issue in general, or the DragonFly kernel crash? Is the "we" The Go Team?

I believe the implication of a kernel problem in FreeBSD, because of a panic in DragonFly, simply doesn't follow. I can give a lot more detail about this, but I fear it's redundant, unless requested.

For what it's worth, this issue is reliably reproducible enough that I've been able to bisect down to a commit that makes sense (golang/go@e6d8bfe), in my first go. Additionally, the commits leading up to the implicated commit in the bisect are documentation-only, and don't reproduce the behaviour. I have very high confidence in the bisect.

I haven't yet published this work yet, as I've been asking some new questions based on the bisect. I was hoping to be able to speak intelligently about this when I reported results, but I feel now the pressure has increased. I've been making progress slowly, but I am making progress (this isn't my daytime gig). I will post a repo later today with the work and the bisec log, so that others can reference it.

Here's my thoughts:

  • at first glance, the implicated commit seems to change how far down in the exec call tree the software gets in the /doesnotexist case
    • that raised the question, did it ever work?
      • starting with the tag go1 the /bin/true case has goruntime panic'd, and the issue therefore has always existed

New lines of thought are:

  • This suggests that the conventions around how something is actually being exec'd are incorrect, for at least the FreeBSD case.
  • Presumably, other syscalls work fine, so the syscall interface is likely correct.
  • Does darwin exhibit the problem? (I will test, although I don't have access to an 8-thread darwin machine), this shares a lot of go runtime code with FreeBSD in the exec codepath
  • How far down the exec call tree does this commit push the test, in comparison with the older code

I plan to investigate these questions, but again, this isn't my daytime gig.

As I mentioned earlier, I'm very happy to work with others on this.

I'd be very interested in more details in the environment you were unable to reproduce in. Specifically, how many CPUs? Is the gomote a virtualized environment, or do you have direct access to hardware somewhere? Perhaps we can compare build artefacts? i.e. does your artefact crash in my environment, and vice versa

Thanks again for taking a look.

Owner

bradfitz commented Oct 20, 2016

@derekmarcotte, thanks for investigating. gomote is our interface to freshly-created FreeBSD VMs on GCE.

See
https://github.com/golang/build/blob/master/env/freebsd-amd64/make.bash
https://github.com/golang/build/blob/master/dashboard/builders.go#L81

They're both n1-highcpu-4, which is https://cloud.google.com/compute/docs/machine-types#standard_machine_types ... "Standard 4 CPU machine type with 4 virtual CPUs and 15 GB of memory. ... 1For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), or 2.2 GHz Intel Xeon E5 v4 (Broadwell)"

So it's at least 2 cores.

Contributor

ianlancetaylor commented Oct 20, 2016

@derekmarcotte Thanks for continuing to investigate. I'm fairly convinced that this is a kernel problem, and your identification of that CL as the cause just makes that seems more likely to me. I have never been able to recreate the problem myself. I suspect that it requires real hardware, not a VM.

reezer commented Oct 20, 2016

@rsc Actually the kernel bug causing the panic existed only for a small amount of time on DragonFly. It most likely was introduced with DragonFly 4.6.0. Using 4.6.1 I fail to reproduce it. So I am sorry for laying a false trail there.

I will keep it running though, to make sure nothing happens, but I already tested it far longer than originally.

Also as a side note, since the topic of VMs came up: All my tests have been done using Virtualbox.

Here's the repo, as promised: https://github.com/derekmarcotte/goruntime-testing

@ianlancetaylor: Thanks for checking it out. Would you be interested in running 15658-doesnotexist.go to completion? Should be on the order of 8-12 hours, maybe more? If I could suggest pulling the command line from the bisect-15658, to get the environment a bit more hostile. Check the logs directory for the bisect.

If there were a kernel problem, I would expect others to have problems with say, inetd (because it is a glorified fork/exec machine), in the wild. This would be a pretty hot code path for many production FreeBSD uses. Additionally, when looking at whether this is a kernel problem, the kernel data structures themselves would become corrupt (and panic like DragonFly), but I see no evidence of this. My main testing machine has a current uptime of 105 days, with probably hundreds of runtime panics, yet no other services have faltered, and everything behaves flawlessly.

Contributor

stevenh commented Oct 21, 2016 edited

I've been running this now on 1.7.1 for 12h+ now on an 24 core E5-2640 running 8.2-RELEASE no hang or panic:

package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
        "runtime"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}
Contributor

rsc commented Nov 2, 2016

It's not uncommon for Go programs to behave significantly different from C programs, to the extent that they expose problems that have been latent in kernels for many years. I'm not saying I'm 100% sure that's going on here, but it's definitely possible, especially given that fork/exec seems to be required.

@rsc rsc modified the milestone: Go1.9, Go1.8Maybe Nov 11, 2016

reezer commented Nov 23, 2016

Using stevenh's code I received this seemingly new panic.

OS: FreeBSD 11.0
CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz (2394.61-MHz K8-class CPU)

GOGC=5 ./test
Starting a bunch of goroutines...
GOMAXPROCS: 8
runtime: failed MSpanList_Remove 0x800528320 0x8005c7d20 0x53b670 0x53b660
fatal error: MSpanList_Remove

runtime stack:
runtime.throw(0x4c973e, 0x10)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*mSpanList).remove(0x53b660, 0x800528320)
	/usr/local/go/src/runtime/mheap.go:1001 +0x19d
runtime.(*mcentral).cacheSpan(0x53b650, 0xc42001b228)
	/usr/local/go/src/runtime/mcentral.go:55 +0x3d0
runtime.(*mcache).refill(0x800524e10, 0xc40000001e, 0xc4200b82e0)
	/usr/local/go/src/runtime/mcache.go:121 +0xae
runtime.(*mcache).nextFree.func1()
	/usr/local/go/src/runtime/malloc.go:505 +0x33
runtime.systemstack(0xc42007cee0)
	/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1079

goroutine 643553 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc4201e68e0 sp=0xc4201e68d8
runtime.(*mcache).nextFree(0x800524e10, 0x1e, 0x0, 0x40f354, 0x80064429b)
	/usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc4201e6938 sp=0xc4201e68e0
runtime.mallocgc(0x380, 0x4a8020, 0x1, 0xc4200b82e0)
	/usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc4201e69d8 sp=0xc4201e6938
runtime.makeslice(0x4a8020, 0x338, 0x338, 0xc4200b82e0, 0x1f, 0x40e996)
	/usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc4201e6a30 sp=0xc4201e69d8
syscall.ByteSliceFromString(0xc420010000, 0x337, 0xc4201064d8, 0xc4200b82e0, 0x1f, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc4201e6a90 sp=0xc4201e6a30
syscall.BytePtrFromString(0xc420010000, 0x337, 0xc4200b82e0, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc4201e6ad8 sp=0xc4201e6a90
syscall.SlicePtrFromStrings(0xc42000c840, 0x15, 0x15, 0xc4200881f0, 0x2, 0x2, 0x0, 0x0)
	/usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc4201e6b48 sp=0xc4201e6ad8
syscall.forkExec(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0xc420162018, 0x40e996, 0xc4201e6cc8)
	/usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc4201e6c78 sp=0xc4201e6b48
syscall.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0x2, 0x4, 0x44ab20, 0xc4200f0280)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc4201e6cd0 sp=0xc4201e6c78
os.startProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0xc42000c840, 0x15, 0x15)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc4201e6db8 sp=0xc4201e6cd0
os.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64 fp=0xc4201e6e10 sp=0xc4201e6db8
os/exec.(*Cmd).Start(0xc42000c2c0, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc4201e6f60 sp=0xc4201e6e10
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68 fp=0xc4201e6fa8 sp=0xc4201e6f60
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4201e6fb0 sp=0xc4201e6fa8
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 1 [runnable]:
main.main()
	/home/reezer/test.go:30 +0x222

goroutine 643497 [runnable]:
sync.(*RWMutex).RUnlock(0x550950)
	/usr/local/go/src/sync/rwmutex.go:55
syscall.Environ(0xc4200fc000, 0x15, 0x15)
	/usr/local/go/src/syscall/env_unix.go:148 +0x210
os.Environ(0x0, 0x0, 0x0)
	/usr/local/go/src/os/env.go:116 +0x22
os/exec.(*Cmd).envv(0xc420076160, 0xc420126080, 0x0, 0x1)
	/usr/local/go/src/os/exec/exec.go:171 +0x38
os/exec.(*Cmd).Start(0xc420076160, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:356 +0x2d9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 643540 [runnable]:
syscall.Syscall(0x3, 0x1c, 0xc4201fabc8, 0x8, 0x0, 0x8, 0x0)
	/usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0x1c, 0xc4201fabc8, 0x8, 0x2, 0xc42010e000, 0x16)
	/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0xc4200ea020, 0x40e996, 0xc4201facc8)
	/usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0xc420076420, 0x15, 0x15)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c580, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 643472 [runnable]:
syscall.Syscall6(0x214, 0x0, 0xeb54, 0x0, 0x18, 0x0, 0x0, 0xeb54, 0x0, 0x0)
	/usr/local/go/src/syscall/asm_freebsd_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc420166090, 0xc420014060, 0xc420014060, 0x4)
	/usr/local/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420166090, 0x0, 0xc4201ea160, 0x0)
	/usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc420166090, 0xc4201e8f08, 0x0, 0x0)
	/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201ea160, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
	/home/reezer/test.go:13 +0x76
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 643498 [semacquire]:
sync.runtime_Semacquire(0x550934)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x550930)
	/usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x550930)
	/usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0xc420080030, 0x40e996, 0xc4201f7cc8)
	/usr/local/go/src/syscall/exec_unix.go:185 +0x25e
syscall.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0x2, 0x4, 0x44ab20, 0xc420108100)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0xc420132000, 0x15, 0x15)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044420, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 643440 [runnable]:
syscall.Syscall(0x3, 0xc, 0xc4201e7bc8, 0x8, 0x0, 0x8, 0x0)
	/usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0xc, 0xc4201e7bc8, 0x8, 0x2, 0xc420106160, 0x16)
	/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0xc420080018, 0x40e996, 0xc4201e7cc8)
	/usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0xc420132420, 0x15, 0x15)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c420, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 643499 [runnable]:
syscall.fcntl(0x4, 0x2, 0x1, 0x0, 0x0, 0xc4c0000000)
	/usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:255
syscall.forkExecPipe(0xc4201fbc18, 0x2, 0x2, 0xc42018e8f0, 0x16)
	/usr/local/go/src/syscall/exec_bsd.go:264 +0x87
syscall.forkExec(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0xc420148040, 0x40e996, 0xc4201fbcc8)
	/usr/local/go/src/syscall/exec_unix.go:188 +0x281
syscall.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0x2, 0x4, 0x44ab20, 0xc4200f00c0)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0xc4200446e0, 0x15, 0x15)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044580, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d
Contributor

ianlancetaylor commented Nov 23, 2016

@reezer Which version of Go?

reezer commented Nov 23, 2016

go1.7.1 from the FreeBSD package

Owner

bradfitz commented Nov 23, 2016

Go 1.7.1 has two notable bugs you could be hitting. See https://github.com/golang/go/issues?q=milestone%3AGo1.7.3

Can you try Go 1.7.3 or tip instead?

reezer commented Nov 26, 2016 edited

I will still test it longer, but so far it appears to be fixed.

Again, I am sorry for the noise. It was a test system I quickly set up and should have taken greater care to make sure it runs the latest version.

reezer commented Nov 27, 2016 edited

Okay, it takes a while on go1.7.3 freebsd/amd64 (FreeBSD 11.0), but I again receive one of the original errors. For people who want to try to reproduce it. GOGC=5 makes a big difference in the time it takes until the error appears.

cat out.log
Starting a bunch of goroutines...
GOMAXPROCS: 8
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4ca671, 0x14)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
	/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x80052a260)
	/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.handoff(0x800585000, 0xc41fff49b0)
	/usr/local/go/src/runtime/mgcwork.go:419 +0x26
runtime.(*gcWork).balance(0xc420019c28)
	/usr/local/go/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc420019c28, 0x10000, 0xc4fffffffe)
	/usr/local/go/src/runtime/mgcmark.go:1042 +0x195
runtime.gcAssistAlloc.func1()
	/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc42017c380)
	/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1079

goroutine 681244 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc42014b8b0 sp=0xc42014b8a8
runtime.gcAssistAlloc(0xc420129520)
	/usr/local/go/src/runtime/mgcmark.go:499 +0x194 fp=0xc42014b938 sp=0xc42014b8b0
runtime.mallocgc(0x7, 0x4a80a0, 0x1, 0xc42000e220)
	/usr/local/go/src/runtime/malloc.go:563 +0x9a4 fp=0xc42014b9d8 sp=0xc42014b938
runtime.makeslice(0x4a80a0, 0x7, 0x7, 0xc42000e220, 0x16, 0x40e996)
	/usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc42014ba30 sp=0xc42014b9d8
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200a01a0, 0xc42000e220, 0x16, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc42014ba90 sp=0xc42014ba30
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc42000e220, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc42014bad8 sp=0xc42014ba90
syscall.SlicePtrFromStrings(0xc42012c300, 0x18, 0x18, 0xc42013c050, 0x2, 0x2, 0x0, 0x0)
	/usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc42014bb48 sp=0xc42014bad8
syscall.forkExec(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0xc42010a008, 0x40e996, 0xc42014bcc8)
	/usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc42014bc78 sp=0xc42014bb48
syscall.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0x2, 0x4, 0x44ab30, 0xc4201ce280)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc42014bcd0 sp=0xc42014bc78
os.startProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0xc42012c300, 0x18, 0x18)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc42014bdb8 sp=0xc42014bcd0
os.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64 fp=0xc42014be10 sp=0xc42014bdb8
os/exec.(*Cmd).Start(0xc4201962c0, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc42014bf60 sp=0xc42014be10
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68 fp=0xc42014bfa8 sp=0xc42014bf60
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42014bfb0 sp=0xc42014bfa8
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 1 [chan receive]:
main.main()
	/home/reezer/test.go:30 +0x222

goroutine 681182 [semacquire]:
syscall.forkAndExecInChild(0xc42012e150, 0xc42012e160, 0x2, 0x2, 0xc420194000, 0x19, 0x19, 0x0, 0x0, 0xc4201c8d60, ...)
	/usr/local/go/src/syscall/exec_bsd.go:54 +0x5e
syscall.forkExec(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0xc42013a010, 0x40e996, 0xc4201c8cc8)
	/usr/local/go/src/syscall/exec_unix.go:193 +0x334
syscall.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0x2, 0x4, 0x44ab30, 0xc420134000)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0xc420076180, 0x18, 0x18)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2420, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681184 [semacquire]:
os/exec.(*Cmd).Start(0xc4201d2000, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:367 +0x466
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681213 [semacquire]:
os.(*Process).wait(0xc42010e150, 0x0, 0xc4201b22c0, 0x0)
	/usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(*Process).Wait(0xc42010e150, 0xc4201adf08, 0x0, 0x0)
	/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201b22c0, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
	/home/reezer/test.go:13 +0x76
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681183 [semacquire]:
os.newProcess(0x14bf2, 0x0, 0xc4200ec010)
	/usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0xc42000c780, 0x18, 0x18)
	/usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420196160, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681161 [running]:
	goroutine running on other thread; stack unavailable
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681214 [semacquire]:
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x16, 0x16, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc420180180, 0x18, 0x18, 0xc420198290, 0x2, 0x2, 0x0, 0x0)
	/usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0xc4201b8068, 0x40e996, 0xc4201c7cc8)
	/usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0x2, 0x4, 0x44ab30, 0xc42004e4c0)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0xc420180180, 0x18, 0x18)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0x0, 0x0, 0xd)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2160, 0x4, 0x0)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
	/home/reezer/test.go:12 +0x68
created by main.main
	/home/reezer/test.go:31 +0x24d

goroutine 681162 [running]:
	goroutine running on other thread; stack unavailable
created by main.main
	/home/reezer/test.go:31 +0x24d

runtime stack:
runtime.throw(0x4ca671, 0x14)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
	/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
	/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc420017228)
	/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc420017228, 0xc42017d380)
	/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d380, 0x0, 0x0, 0xc41fff4163, 0xc400000000, 0x800765860, 0xc420017228, 0xc)
	/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d380)
	/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200944a0, 0xc42017d380)
	/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
	/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc42003de70)
	/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200944a0, 0xc42017d380)
	/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200944a0, 0xc42017d380)
	/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d380, 0x0)
	/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420129040)
	/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0xc4200001a0)
	/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
	/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
	/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc420018728)
	/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).tryGet(0xc420018728, 0x5364d0)
	/usr/local/go/src/runtime/mgcwork.go:144 +0xd9
runtime.gcDrainN(0xc420018728, 0x10000, 0xfffffffd)
	/usr/local/go/src/runtime/mgcmark.go:1052 +0x145
runtime.gcAssistAlloc.func1()
	/usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc420018a00)
	/usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1079

runtime stack:
runtime.throw(0x4ca671, 0x14)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
	/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
	/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc42001b228)
	/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001b228, 0xc420153860)
	/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc420153860, 0x0, 0x0, 0xc41fff563c, 0xc400000000, 0x80052b2a0, 0xc42001b228, 0xf)
	/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc420153860)
	/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200950a0, 0xc420153860)
	/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
	/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc4200ade70)
	/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200950a0, 0xc420153860)
	/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200950a0, 0xc420153860)
	/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc420153860, 0x0)
	/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420001380)
	/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
	/usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
	/usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
	/usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc42001c728)
	/usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001c728, 0xc42017d040)
	/usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d040, 0x0, 0x0, 0xc41fff417d, 0xc400000000, 0x800765860, 0xc42001c728, 0xa)
	/usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d040)
	/usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc42002c8a0, 0xc42017d040)
	/usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
	/usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc420039e70)
	/usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc42002c8a0, 0xc42017d040)
	/usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc42002c8a0, 0xc42017d040)
	/usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d040, 0x0)
	/usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420045040)
	/usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
Contributor

stevenh commented Dec 1, 2016

@reezer how long did it take to reproduce for you with GOGC=5?

reezer commented Dec 1, 2016 edited

It varied quite a bit, sometimes minutes, but up to about two hours. It does really needs the forking. What also speeds it up (very slightly) is using an absolute path for the binary. Go otherwise will iterate over all of the the PATH directories, trying to execute true, so you have a few unnecessary syscalls. Of course you should also make sure that you really have GOMAXPROCS set to something > 1.

See also the timings that derekmarcotte posted. Mine have always been higher, but not that much.

It looks that this bug doesn't appear in virtual environments. I agree that there is quite a bit of evidence that it's a bug somewhere outside of Go (ie. kernel). Another hint here is that it apparently takes longest the first time you run this. However take this statement with a grain of salt, cause I didn't really sample this enough yet.

Contributor

stevenh commented Dec 1, 2016

Thanks, reproduced now :)

Contributor

stevenh commented Dec 1, 2016

Here's a possibly interesting one:-

GOMAXPROCS: 24
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
main.main()
        /data/go/src/github.com/multiplay/go/apps/test/main.go:29 +0x222

goroutine 3560607 [semacquire]:
syscall.ByteSliceFromString(0x4c8a13, 0x9, 0x41aaf7, 0x800574199, 0x800574120, 0x4d, 0x4d)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0x4c8a13, 0x9, 0x10, 0x4b0f60, 0x4aedc0)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Open(0x4c8a13, 0x9, 0x100001, 0xc400000000, 0x40d8de, 0xc420326000, 0x8)
        /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:820 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0x800000000, 0x0, 0x0, 0xc42029ee00)
        /usr/local/go/src/os/file_unix.go:97 +0xa7
os/exec.(*Cmd).writerDescriptor(0xc420258000, 0x0, 0x0, 0x1, 0xc42026a4d0, 0xc420347e48)
        /usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(*Cmd).stdout(0xc420258000, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc420258000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560699 [semacquire]:
os/exec.(*Cmd).Start(0xc42031a160, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:348 +0x7ac
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560559 [semacquire]:
os.newProcess(0xfc37, 0x0, 0xc4202ce450)
        /usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0xc4201843c0, 0x1d, 0x1d)
        /usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0x0, 0x0, 0xd)
        /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420240000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560649 [semacquire]:
os/exec.(*Cmd).writerDescriptor(0xc4202c4000, 0x0, 0x0, 0x1, 0xc4201188f0, 0xc42003b648)
        /usr/local/go/src/os/exec/exec.go:237 +0x665
os/exec.(*Cmd).stdout(0xc4202c4000, 0xc4202a6030, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc4202c4000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560772 [semacquire]:
syscall.ByteSliceFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x11, 0x11, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc4202565a0, 0x1d, 0x1d, 0xc4202e4390, 0x2, 0x2, 0x0, 0x0)
        /usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0xc420278028, 0x40e996, 0xc4202a1cc8)
        /usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0x2, 0x4, 0x44ab30, 0xc42017a180)
        /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0xc4202565a0, 0x1d, 0x1d)
        /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0x0, 0x0, 0xd)
        /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42024a000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560475 [semacquire]:
os.(*Process).wait(0xc42028c000, 0x0, 0xc42031a000, 0x0)
        /usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(*Process).Wait(0xc42028c000, 0xc420310f08, 0x0, 0x0)
        /usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42031a000, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:12 +0x76
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560701 [semacquire]:
syscall.ByteSliceFromString(0xc4203321e0, 0xa, 0x4bb0c0, 0xc420332101, 0x100000000000a, 0x38, 0xb0)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc4203321e0, 0xa, 0x8005260d0, 0xc420159d40, 0xa)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Stat(0xc4203321e0, 0xa, 0xc420328038, 0xc4203321e0, 0xa)
        /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1130 +0x3f
os.Stat(0xc4203321e0, 0xa, 0x2, 0x2, 0xc4203321e0, 0xa)
        /usr/local/go/src/os/file_unix.go:169 +0x7d
os/exec.findExecutable(0xc4203321e0, 0xa, 0x2, 0xc4203321e0)
        /usr/local/go/src/os/exec/lp_unix.go:20 +0x39
os/exec.LookPath(0x4c8239, 0x4, 0x4c8239, 0x4, 0x1, 0x0)
        /usr/local/go/src/os/exec/lp_unix.go:53 +0x188
os/exec.Command(0x4c8239, 0x4, 0x0, 0x0, 0x0, 0xc420232420)
        /usr/local/go/src/os/exec/exec.go:135 +0x1ff
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:10 +0x55
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560650 [semacquire]:
os.NewFile(0x12, 0x4c8a13, 0x9, 0x0)
        /usr/local/go/src/os/file_unix.go:56 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0xc400000000, 0xc420374d70, 0x40eb85, 0xc420254230)
        /usr/local/go/src/os/file_unix.go:123 +0xff
os/exec.(*Cmd).writerDescriptor(0xc420178000, 0x0, 0x0, 0xc420254238, 0xc420268020, 0xc420268020)
        /usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(*Cmd).stderr(0xc420178000, 0xc420268020, 0x1, 0x1)
        /usr/local/go/src/os/exec/exec.go:228 +0x4e
os/exec.(*Cmd).Start(0xc420178000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
     4006.16 real      3508.55 user     29207.58 sys```
Contributor

stevenh commented Dec 1, 2016

Here's the selection of errors it triggered:

fatal error: all goroutines are asleep - deadlock!

fatal error: workbuf is empty

runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531
fatal error: sweep increased allocation count

runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0
fatal error: MSpanList_Remove

This is results from 1.7.3 on FreeBSD 11.0

Contributor

stevenh commented Dec 2, 2016

Had it hang on a run last night here's a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use:
https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae

derekmarcotte commented Dec 2, 2016 edited

Just wanted to give a quick status-update:

As has been mentioned the fork appears to be critical. In testing, I've removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.

What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.

I can make this work available if anyone is interested.

I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.

Contributor

stevenh commented Dec 2, 2016

I'd add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I've never had it happen with GOGC=off.

I'm going to leave a GOGC=off test running over the weekend to try and confirm that.

Perhaps the next step is to stop garbage collection as part of ForkLock? Ima play in here for a bit.

Contributor

stevenh commented Dec 4, 2016

I thought the same so I'm running the forks with GC disabled using GOGC=off and then manually calling GC via debug.FreeOSMemory().

No results yet, but here's the code for reference.

This adds additional synchronisation due to the use of constant goroutines, which might also have an impact but we'll have to see. It will be an interesting data point either way.

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "runtime/debug"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(work, done chan struct{}) {
        for range work {
                cmd := exec.Command("/usr/bin/true")
                cmd.Start()
                cmd.Wait()

                done <- struct{}{}
        }
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines)
        work := make(chan struct{}, forkRoutines)

        for i := 0; i < forkRoutines; i++ {
                work <- struct{}{}
                go run(work, done)
        }

        for {
                start := time.Now()
                forks := 0
                active := forkRoutines
        forking:
                for range done {
                        forks++
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                work <- struct{}{}
                        }
                }

                fmt.Println("forks:", forks)
                debug.FreeOSMemory()

                for i := 0; i < forkRoutines; i++ {
                        work <- struct{}{}
                }
        }
}
Contributor

stevenh commented Dec 5, 2016 edited

Here's an alternative version which is as close to the known bad version as I think we can get without editing the runtime:

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(done chan struct{}) {
        cmd := exec.Command("/usr/bin/true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for {
                start := time.Now()
                active := forkRoutines
        forking:
                for range done {
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                go run(done)
                        }
                }

                runtime.GC()

                for i := 0; i < forkRoutines; i++ {
                        go run(done)
                }
        }
}

I've been running these now here for a few hours with no errors and I usually see an error within 20m - 1 hour with GOGC=5

This seems to indicate the issue maybe centered around GC kicking in while there is a fork in progress.

Contributor

stevenh commented Dec 5, 2016 edited

Both sets of code have now been running for 12 hours with no errors, so looking more like this is indeed to do with the GC interacting with active forks causing corruption.

Owner

bradfitz commented Dec 5, 2016

@stevenh,

and then manually calling GC via debug.FreeOSMemory()

That doesn't do what you think it does. That's not how you manually call a GC. That would be runtime.GC(), but calling runtime.GC() can be super slow. (it does a full GC while blocking the process)

Just disable & reenable with:

   old := debug.SetGCPercent(0) // now disabled.
   // forklock..fork..exec...unlock
   debug.SetGCPercent(old) // now GC is re-enabled.
Contributor

stevenh commented Dec 5, 2016

I did actually check that and saw debug.FreeOSMemory() is freeOSMemory which in turn is:

func runtime_debug_freeOSMemory() {
        gcStart(gcForceBlockMode, false)
        systemstack(func() { mheap_.scavenge(-1, ^uint64(0), 0) })
}

Where as runtime.GC() is:

        gcStart(gcForceBlockMode, false)
}

So it looks like FreeOSMemory does do a full GC + trying to get the OS to cleanup. This is reinforced by the docs for FreeOSMemory:

FreeOSMemory forces a garbage collection followed by an attempt to return as much memory to the operating system as possible. (Even if this is not called, the runtime gradually returns memory to the operating system in a background task.)

I've also confirmed that memory usage is maintained which is not the case with just GOGC=off.

Am I really mistaken?

I could use

old := debug.SetGCPercent(0)
...
// Force GC
debug.SetGCPercent(old)
debug.SetGCPercent(0)

Its also worth noting that SetGCPercent directly calls runtime.GC() so would still result in a stop the world GC?

Contributor

stevenh commented Dec 5, 2016 edited

Also worth noting @bradfitz that while the first used debug.FreeOSMemory() the second uses runtime.GC() as I was interested in the difference.

Owner

bradfitz commented Dec 5, 2016

I guess I missed or don't understand why you want to force a GC. Or why you'd want to return memory to the operating system.

I thought the whole point was to stop the GC so it's not running during a fork. That's much lighter weight than running a GC.

Contributor

stevenh commented Dec 5, 2016 edited

The idea of the tests is to confirm that we can fork without error if the GC was disabled.

As the original test used multiple routines doing the forking, to force the issue quicker, I wanted to maintain that. Given this it seemed easier to just do a stop the world GC than to try and coordinate ensuring that GC was stopped across all the routines as they fork.

The first version used debug.FreeOSMemory() in case the freeing the memory to the OS was a contributing factor to the issue.

The second version I switched to runtime.GC() to see if removing this fact and allowing background free made any difference.

I did consider using debug.SetGCPercent(X) but after looking at it and noticing it actually calls runtime.GC() anyway, I concluded it would be almost identical to calling runtime.GC() while all the forking routines where blocked which was as mentioned previously was easier to coordinate.

Both the test versions have now been running for 24hours so it seems highly likely that the GC while forking is the cause of this issue.

I'll leave them running for now, but I need a way to confirm for sure and also to narrow down the exact piece of code which is causing the issue.

Any ideas gratefully received.

Contributor

ianlancetaylor commented Dec 5, 2016

Given that we believe that this only occurs on FreeBSD, and only occurs when running on real hardware, I would guess that the problem is not the GC as such, but rather than fact that another thread in the parent process is doing a lot of reads and writes to memory pages that are also being read and written by the child process.

Contributor

stevenh commented Dec 6, 2016 edited

Hmm why do you say it only occurs on real hardware @ianlancetaylor?

Reading back through the thread I see @kaey mention it telegraph crashes for him on vmware too, so interested to know if there's another piece of the puzzle there?

I'm wondering if PCID may be at play here...

Member

minux commented Dec 6, 2016

Contributor

stevenh commented Dec 6, 2016

That's exactly what I just tried unfortunately not :(

Contributor

ianlancetaylor commented Dec 6, 2016

@stevenh I said "real hardware" based on @reezer 's comment in #15658 (comment). My apologies if I misunderstood.

Contributor

stevenh commented Dec 6, 2016

In an attempt to produce a minimal version I've tried creating a C version but currently no joy.

Capturing here for reference.

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>

#define MEMSIZE 10240
#define RANDMAX 4
#define THREADS 16

#define handle_error_en(en, msg) \
        do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct thread_info {
        pthread_t thread_id;
        int thread_num;
};

void
check(int val, int *i, int *mem)
{
        int j;

        if (*i != val) {
                printf("i: %d\n", *i);
                exit(EXIT_FAILURE);
        }

        for (j = 0; j < MEMSIZE; j++) {
                if (mem[j] != val) {
                        printf("mem: %d = %d\n", j, mem[j]);
                        exit(EXIT_FAILURE);
                }
        }
}

void
randomise(int val, int *i, int *mem)
{
        struct timespec ts;

        ts.tv_sec = 0;
        srandom((unsigned long)val);

        for (; val < RANDMAX; val+=2) {
                *i = val;
                for (int j = 0; j < MEMSIZE; j++) {
                        mem[j] = val;
                }

                check(val, i, mem);
        }
}

void
dofork()
{
        pid_t pid;
        int i;
        int mem[MEMSIZE];

        i = 0;
        for (int j = 0; j < MEMSIZE; j++) {
                mem[j] = 0;
        }

        pid = fork();
        switch (pid) {
        case -1:
                handle_error("fork");
        case 0: {
                // child
                randomise(1, &i, mem);
                exit(execv("/usr/bin/true", NULL));
        }
        default: {
                int status;
                struct rusage ru;
                // parent
                randomise(2, &i, mem);

                // wait6 followed by wait4 as that's what golang does        
                wait6(P_PID, pid, 0, WNOWAIT|WEXITED, 0, 0);
                wait4(pid, &status, 0, &ru);
        }}
}

static void *
thread_start(void *arg)
{
        struct thread_info *tinfo = arg;
        char info[1024], *ret;

        sprintf(info, "thread %d\n", tinfo->thread_num);

        ret = strdup(info);
        if (ret == NULL) {
                handle_error("strdup");
        }

        while (1==1) {
                dofork();
        }

        return ret;
}

int
main(int argc, char**argv)
{
        int s, tnum;
        struct thread_info *tinfo;
        pthread_attr_t attr;
        void *res;

        s = pthread_attr_init(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_init");
        }

        tinfo = calloc(THREADS, sizeof(struct thread_info));
        if (tinfo == NULL) {
                handle_error("calloc");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                tinfo[tnum].thread_num = tnum + 1;
                s = pthread_create(&tinfo[tnum].thread_id, &attr, &thread_start, &tinfo[tnum]);
                if (s != 0) {
                        handle_error_en(s, "pthread_create");
                }
        }

        s = pthread_attr_destroy(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_destroy");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                s = pthread_join(tinfo[tnum].thread_id, &res);
                if (s != 0) {
                        handle_error_en(s, "pthread_join");
                }
                printf("joined with thread %d; returned value was %s\n", tinfo[tnum].thread_num, (char *)res);
                free(res);
        }

        free(tinfo);
        exit(EXIT_SUCCESS);
}
Contributor

stevenh commented Dec 6, 2016

I've been looking through the FreeBSD specific code in this area and I noticed that newosproc uses raw sigprocmask.

If its possible that newosproc can be called from multiple threads at the same time then its possible that signals which are meant to be blocked during thr_new could be re-enabled unexpectedly causing undefined behaviour.

This is is why man SIGPROCMASK(2) states:
In threaded applications, pthread_sigmask(3) must be used instead of sigprocmask().

This may be unrelated but...

sean- commented Mar 8, 2017 edited

Well, hold on. The call to cpu.Info() does fork/exec 3x times, but cpu.Info() isn't being called from the goroutine doit(), it's only being used before the goroutine is spawned.

@tmm1 does removing the call to cpu.Info() before doit change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.

What @tmm1 is reporting is interesting to me because his test case is reproducible within seconds versus days. Without digging in, what stands out to me is this test case has two concurrent fork calls happening at the start of the program and it's calling a real program.

I'm not sure how this is different than @stevenh 's snippet, but there's probably something in syscall.ForkOnlyBSDTest() that's different than the fork/exec call to /sbin/sysctl.

Contributor

stevenh commented Mar 8, 2017

The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59

The frequency of the panic is interesting.

@tmm1 could post your spawntest script please so we can try to repo.

Contributor

stevenh commented Mar 8, 2017

Ignore the above I've produced it and its is very quick, so definitely very interesting.

tmm1 commented Mar 8, 2017 edited

Your actually hitting the fork issue using gopsutil as it uses the cmd sysctl instead of just the syscall, which is silly, so for your case switching to use raw syscalls would likely fix your issue.

Indeed, I'm not sure why its using fork/exec instead of the syscall. Switching to the syscall should be an effective workaround.

@tmm1 does removing the call to cpu.Info() before doit change the abort behavior in your environment? My suspicion it will and the program will take a while to crash.

Removing the cpu.Info() makes the program take longer to crash, to a few minutes instead of seconds.

The trace indicates the issue is in the init() of https://github.com/fancybits/gopsutil/blob/arm-mhz-fix/cpu/cpu.go#L59

Indeed, it almost always happens in cpu.init()

@tmm1 could post your spawntest script please so we can try to repo.

spawntest is what I called the compiled golang binary given the source posted above.

I also tried this morning to run the same binary on a fresh freebsd VM, using the freebsd/FreeBSD-10.3-RELEASE vagrant box. I cannot seem to reproduce the crash there (atleast not in the same short amount of time that I tried). The kernel there is:

FreeBSD fbsd 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64

However on the FreeNAS 9.10 VM (running on the same macOS host) crashes readily:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 root@build.ixsystems.com:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64

My golang app which I originally saw this memory corruption bug in is designed to be run on FreeNAS appliances, and atleast half a dozen users have reported the same crash on various pieces of hardware all running FreeNAS 9.10. (The golang program auto-updates itself by downloading a new version, then syscall.Exec to replace itself. The crash always happens during cpu.init() after the new process takes over. Here's a gdb trace from a production crash, which usually exhibits as a SIGILL: https://gist.github.com/tmm1/86cd869d1bfdae66e4058698dd30aee9)

Contributor

rsc commented Mar 8, 2017

Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error) to have as its very first line:

return false, nil

That will fall back to not using the wait6 system call at all. If the crash still happens, then we know that wait6 is not the problem. If the crash stops, then we have somewhere to look. Thanks.

Contributor

rsc commented Mar 8, 2017

(And then of course 'go install os')

tmm1 commented Mar 8, 2017

Anyone who can make it crash reliably, please try changing go/src/os/wait_wait6.go's func (p *Process) blockUntilWaitable() (bool, error) to have as its very first line:

Made this change and my repro still crashed quickly:

fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]

runtime stack:
runtime.startpanic_m()
	/home/vagrant/go/src/runtime/panic.go:653 +0x19a
runtime.systemstack(0x525fc0)
	/home/vagrant/go/src/runtime/asm_amd64.s:343 +0xab
runtime.startpanic()
	/home/vagrant/go/src/runtime/panic.go:569 +0x1e
runtime.throw(0x5248d9, 0x2a)
	/home/vagrant/go/src/runtime/panic.go:595 +0x88
runtime.sigpanic()
	/home/vagrant/go/src/runtime/signal_unix.go:274 +0x2db
runtime.usleep(0x6e69622f7273752f, 0x2f3a6e6962732f3a, 0x6e6962732f727375, 0xc420070750, 0x2d, 0xc4200707b0, 0x2d, 0xc420070810, 0x21, 0xc420094820, ...)
	/home/vagrant/go/src/runtime/sys_freebsd_amd64.s:276 +0x40

goroutine 1 [runnable]:
syscall.Syscall6(0x7, 0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x0, 0x0, 0x13612, 0x0, 0x0)
	/home/vagrant/go/src/syscall/asm_unix_amd64.s:42 +0x5
syscall.wait4(0x13612, 0xc420035a7c, 0x0, 0xc420074090, 0x90, 0x516b00, 0x4bc401)
	/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:34 +0x88
syscall.Wait4(0x13612, 0xc420035acc, 0x0, 0xc420074090, 0xc420035b80, 0xc420014360, 0x0)
	/home/vagrant/go/src/syscall/syscall_bsd.go:129 +0x51
os.(*Process).wait(0xc420014360, 0x40e042, 0xc42000e660, 0xc420014300)
	/home/vagrant/go/src/os/exec_unix.go:38 +0xbb
os.(*Process).Wait(0xc420014360, 0x0, 0x0, 0x525c00)
	/home/vagrant/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc4200686e0, 0x0, 0x0)
	/home/vagrant/go/src/os/exec/exec.go:435 +0x62
os/exec.(*Cmd).Run(0xc4200686e0, 0xc42004a0f0, 0x2)
	/home/vagrant/go/src/os/exec/exec.go:280 +0x5c
os/exec.(*Cmd).Output(0xc4200686e0, 0xc, 0xc420035cb8, 0x2, 0x2, 0xc4200686e0)
	/home/vagrant/go/src/os/exec/exec.go:474 +0x11c
github.com/fancybits/gopsutil/internal/common.DoSysctrl(0x51e5bc, 0x7, 0x40, 0x40a1ec0000000000, 0x0, 0x0, 0x0)
	/home/vagrant/.go/src/github.com/fancybits/gopsutil/internal/common/common_freebsd.go:22 +0x120
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
	/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:118 +0x1e4
main.main()
	/home/vagrant/spawntest/main.go:28 +0x67

goroutine 8 [runnable]:
syscall.Syscall(0x3, 0x6, 0xc4200da000, 0x8000, 0x0, 0x8000, 0x0)
	/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x80059e3a0, 0x402001, 0x10100c42001de08)
	/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x6, 0xc4200da000, 0x8000, 0x8000, 0x800598000, 0x0, 0xc42001de28)
	/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x4f45e0, 0xc42001de01, 0xc4200da000)
	/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0x8000, 0x8000, 0x403fc0)
	/home/vagrant/go/src/os/file.go:101 +0x76
io.copyBuffer(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0xc4200da000, 0x8000, 0x8000, 0xc4200a44d8, 0x0, 0x0)
	/home/vagrant/go/src/io/io.go:390 +0x100
io.Copy(0x599620, 0xc42004a0f0, 0x5994a0, 0xc42000c088, 0x4f3660, 0xc4200a4480, 0xc42001dfb0)
	/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
	/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a600)
	/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
	/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4

goroutine 7 [runnable]:
syscall.Syscall(0x3, 0x4, 0xc420072400, 0x200, 0x1, 0x200, 0x0)
	/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0x4, 0xc420072400, 0x200, 0x200, 0xc420024400, 0x8005984b0, 0x0)
	/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:893 +0x55
syscall.Read(0x4, 0xc420072400, 0x200, 0x200, 0x42561f, 0x5259e8, 0xc42001e5c0)
	/home/vagrant/go/src/syscall/syscall_unix.go:162 +0x49
os.(*File).read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x4a4e2b, 0x4f45e0, 0x200)
	/home/vagrant/go/src/os/file_unix.go:165 +0x6a
os.(*File).Read(0xc42000c070, 0xc420072400, 0x200, 0x200, 0x42e5be, 0xc4200a8680, 0xc42001e660)
	/home/vagrant/go/src/os/file.go:101 +0x76
bytes.(*Buffer).ReadFrom(0xc420048230, 0x5994a0, 0xc42000c070, 0x800638028, 0xc420048230, 0xc42001e701)
	/home/vagrant/go/src/bytes/buffer.go:179 +0x160
io.copyBuffer(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x0, 0x0, 0x0, 0xc4200a44d8, 0x0, 0x0)
	/home/vagrant/go/src/io/io.go:384 +0x2cb
io.Copy(0x5992a0, 0xc420048230, 0x5994a0, 0xc42000c070, 0x4f3660, 0xc4200a4480, 0xc42001e7b0)
	/home/vagrant/go/src/io/io.go:360 +0x68
os/exec.(*Cmd).writerDescriptor.func1(0x4f3660, 0xc4200a4480)
	/home/vagrant/go/src/os/exec/exec.go:254 +0x4d
os/exec.(*Cmd).Start.func1(0xc4200686e0, 0xc42000a580)
	/home/vagrant/go/src/os/exec/exec.go:371 +0x27
created by os/exec.(*Cmd).Start
	/home/vagrant/go/src/os/exec/exec.go:372 +0x4e4
Contributor

rsc commented Mar 8, 2017

OK, that's good progress. Thanks. @tmm1, in the trace it says pc=0x451730. What is that pc in the binary? (for example, go tool objdump yourprog | grep -C 10 0x451730)

tmm1 commented Mar 8, 2017

$ go tool objdump spawntest | grep -C 10 0x451730
	sys_freebsd_amd64.s:267	0x45170b	48890424	MOVQ AX, 0(SP)
	sys_freebsd_amd64.s:268	0x45170f	b8e8030000	MOVL $0x3e8, AX
	sys_freebsd_amd64.s:269	0x451714	f7e2		MULL DX
	sys_freebsd_amd64.s:270	0x451716	4889442408	MOVQ AX, 0x8(SP)
	sys_freebsd_amd64.s:272	0x45171b	4889e7		MOVQ SP, DI
	sys_freebsd_amd64.s:273	0x45171e	31f6		XORL SI, SI
	sys_freebsd_amd64.s:274	0x451720	b8f0000000	MOVL $0xf0, AX
	sys_freebsd_amd64.s:275	0x451725	0f05		SYSCALL
	sys_freebsd_amd64.s:276	0x451727	488b6c2410	MOVQ 0x10(SP), BP
	sys_freebsd_amd64.s:276	0x45172c	4883c418	ADDQ $0x18, SP
	sys_freebsd_amd64.s:276	0x451730	c3		RET

TEXT runtime.settls(SB) /home/vagrant/go/src/runtime/sys_freebsd_amd64.s
	sys_freebsd_amd64.s:279	0x451740	4883ec10		SUBQ $0x10, SP
	sys_freebsd_amd64.s:279	0x451744	48896c2408		MOVQ BP, 0x8(SP)
	sys_freebsd_amd64.s:279	0x451749	488d6c2408		LEAQ 0x8(SP), BP
	sys_freebsd_amd64.s:280	0x45174e	4883c708		ADDQ $0x8, DI
	sys_freebsd_amd64.s:281	0x451752	48893c24		MOVQ DI, 0(SP)
	sys_freebsd_amd64.s:282	0x451756	4889e6			MOVQ SP, SI
	sys_freebsd_amd64.s:283	0x451759	48c7c781000000		MOVQ $0x81, DI
	sys_freebsd_amd64.s:284	0x451760	48c7c0a5000000		MOVQ $0xa5, AX

tmm1 commented Mar 8, 2017

Ignore the above I've produced it and its is very quick, so definitely very interesting.

What version of FreeBSD did you repro on?

I'm still confused as to why my FreeBSD VM does not reproduce, but my FreeNAS VM does.

I noticed there was an update available to FreeNAS (9.10 -> 9.10.2). I updated, and the crash still happens.

before:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 455d13d(9.10-STABLE): Sun Jun 26 22:47:03 PDT 2016 root@build.ixsystems.com:/tank/home/nightlies/build-freenas9/_BE/objs/tank/home/nightlies/build-freenas9/_BE/trueos/sys/FreeNAS.amd64 amd64

after:

FreeBSD freenas.local 10.3-STABLE FreeBSD 10.3-STABLE #0 41eb257(9.10.2-STABLE): Mon Mar 6 17:03:14 UTC 2017 root@gauntlet:/freenas-9.10-releng/_BE/objs/freenas-9.10-releng/_BE/os/sys/FreeNAS.amd64 amd64

Contributor

stevenh commented Mar 8, 2017

@tmm1 FreeBSD 11.0-RELEASE no VM.

tmm1 commented Mar 8, 2017

Here's a different trace I got just now.. Clearly some nasty memory corruption going on. It's almost as-if the RawSyscall wrapper is wrong or we're using vfork instead of fork (neither of which appears to be the case).

runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x5231bf, 0x20)
	/home/vagrant/go/src/runtime/panic.go:596 +0x95
runtime.(*mspan).sweep(0x80059cda8, 0x80059cd00, 0xc420000b60)
	/home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786
runtime.sweepone(0x42d876)
	/home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad
runtime.gosweepone.func1()
	/home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b
runtime.systemstack(0xc420016000)
	/home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/home/vagrant/go/src/runtime/proc.go:1132

goroutine 3 [running]:
runtime.systemstack_switch()
	/home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
	/home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420048000)
	/home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
	/home/vagrant/go/src/runtime/mgc.go:212 +0x61

goroutine 1 [runnable]:
regexp.(*bitState).reset(0xc4203737a0, 0x38, 0xc)
	/home/vagrant/go/src/regexp/backtrack.go:95 +0xff
regexp.(*machine).backtrack(0xc420101680, 0x59a460, 0xc420101758, 0x0, 0x38, 0xc, 0x4f)
	/home/vagrant/go/src/regexp/backtrack.go:321 +0xa1
regexp.(*Regexp).doExecute(0xc42036fcc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc420098440, 0x38, 0x0, 0xc, ...)
	/home/vagrant/go/src/regexp/exec.go:439 +0x33a
regexp.(*Regexp).FindStringSubmatch(0xc42036fcc0, 0xc420098440, 0x38, 0x0, 0x0, 0x0)
	/home/vagrant/go/src/regexp/regexp.go:919 +0xec
github.com/fancybits/gopsutil/cpu.Info(0x51f162, 0xb, 0x0, 0x0, 0x0)
	/home/vagrant/.go/src/github.com/fancybits/gopsutil/cpu/cpu_freebsd.go:133 +0x3ef
main.main()
	/home/vagrant/spawntest/main.go:28 +0x67
Contributor

stevenh commented Mar 8, 2017

Ok this is interesting, just got a crash even with GOGC=off and ktrace e.g.
time sh -c 'export GOGC=off; export GOTRACEBACK=crash; ktrace -i -f ktrace-test.out ./test '

Contributor

rsc commented Mar 8, 2017

And?

tmm1 commented Mar 8, 2017 edited

Here's an updated version of the reproduction program, now with no external dependencies or goroutines:

package main

import (
	"bufio"
	"log"
	"os"
	"os/exec"
	"regexp"
	"syscall"
	"time"
)

func init() {
	exec.Command("true").Run()

	// do some work after the fork/exec, to expose memory corruption if it occurred
	f, _ := os.Open("/var/run/dmesg.boot")
	defer f.Close()
	scanner := bufio.NewScanner(f)
	for scanner.Scan() {
		line := scanner.Text()
		regexp.MustCompile(`Origin\s*=\s*"(.+)"\s+Id\s*=\s*(.+)\s+Family\s*=\s*(.+)\s+Model\s*=\s*(.+)\s+Stepping\s*=\s*(.+)`).FindStringSubmatch(line)
	}
}

func main() {
	log.SetFlags(log.LstdFlags)
	log.Printf("started process")
	time.Sleep(1 * time.Millisecond)
	log.Printf("running exec...")
	syscall.Exec(os.Args[0], os.Args, os.Environ())
	log.Printf("exec failed!")
}

tmm1 commented Mar 8, 2017

This is interesting:

If I comment out the time.Sleep(50 * time.Millisecond) (or use time.Sleep(0)), I can no longer reproduce the crash.

If I change it to time.Sleep(1 * time.Millisecond), the crash happens reliably within a couple seconds of running.

Contributor

rsc commented Mar 8, 2017

@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.

I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)

Some things to try. First rename init to xinit, main to xmain. Then:

(1) Make sure this still crashes:

func init() { xinit() }
func main() { xmain() }

(2) Confirm that this doesn't crash:

func main() { xinit(); xmain() }

(3) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); xmain() }

(4) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); runtime.UnlockOSThread(); xmain() }

(5) See if this crashes:

func init() { runtime.LockOSThread() }
func main() { xinit(); go xmain(); select{} }

tmm1 commented Mar 8, 2017

@tmm1, nice. Sadly, still no crash for me with FreeBSD 10.1-RELEASE on a Google Cloud VM; tried both 1ms and 50ms sleeps.

Strange. One thing I just remembered is that I had trouble booting the FreeNAS VM in virtualbox and had to change some settings before it worked (drajen/freenas-vagrant#1). I will try making that same change on the FreeBSD VM to see if it causes the repro to work there.

I'm starting to wonder about threading behavior, especially since it seems like init is important. (I assume if it wasn't, you'd have simplified further already.)

Looks like the init is irrelevant, I just didn't get that far in the simplification process.

Here are the results from your tests:

  1. crashes
  2. crashes
  3. does not crash
  4. crashes
  5. crashes
Contributor

rsc commented Mar 8, 2017

Very interesting that (3) does not crash. That variant makes sure the fork and the final exec happen in the same thread (the main thread, it turns out).

Does (5) crash more reliably than the others? It forces the fork to happen in the main thread and then forces the exec to happen in a non-main thread.

Also interesting would be the reverse (fork forced in non-main thread, final exec forced in main thread):

(6)

func init() { runtime.LockOSThread() }
func main() {
    ch := make(chan int)
    go func() {
        xinit(); 
        ch <- 1
    }()
    <-ch
    xmain()
}

It's really tempting to think there is some kind of race in exec where the OS doesn't quite manage to stop all the threads in the old process before creating the new one, hence the memory fault or SIGILL trying to execute a return instruction. But I don't see where the race could be: the final exec doesn't happen until the first fork+exec is over and cleaned up. It's almost like a thread in the new process starts running and then accidentally gets torn down as if it were part of the old process.

tmm1 commented Mar 8, 2017

(6) does not crash.

(4) and (5) appear to crash more quickly than the rest.

Random sampling of timings (ofcourse these are all highly variable):

[vagrant@freenas ~]$ time ./rsctest1
Illegal instruction

real	0m0.685s
user	0m0.213s
sys	0m0.427s

[vagrant@freenas ~]$ time ./rsctest2
Illegal instruction

real	0m8.332s
user	0m3.093s
sys	0m6.164s

[vagrant@freenas ~]$ time ./rsctest3
^C

real	0m25.811s
user	0m10.192s
sys	0m18.873s

[vagrant@freenas ~]$ time ./rsctest4
Illegal instruction

real	0m0.026s
user	0m0.010s
sys	0m0.020s

[vagrant@freenas ~]$ time ./rsctest5
Illegal instruction

real	0m0.146s
user	0m0.051s
sys	0m0.111s

[vagrant@freenas ~]$ time ./rsctest6
^C

real	0m33.222s
user	0m12.663s
sys	0m26.550s
Contributor

rsc commented Mar 8, 2017 edited

If you can, please try one more:

(7)

func init() { runtime.LockOSThread() }
func main() {
	xinit()
	ch := make(chan int)
	go func() {
		runtime.LockOSThread()
		ch <- 1
		<-ch
		xmain()
	}()
	<-ch
	runtime.UnlockOSThread()
	ch <- 1
	select {}
}
run xinit xmain outcome
(1) locked to main thread on any thread crash
(2) on any thread on any thread crash
(3) locked to main thread locked to main thread success
(4) locked to main thread on any thread crash
(5) locked to main thread on any non-main thread crash
(6) on any non-main thread locked to main thread success
(7) locked to main thread locked to non-main thread ???

Clearly the problem goes away when xmain is locked to the main OS thread. But what's important? The fact that it's locked to a thread or the fact that it's the main OS thread? The new program (7) is halfway between (3) and (5): it runs xmain locked to a thread, just definitely not the main OS thread.

While Go certainly treats locked vs unlocked thread execution differently, it otherwise makes no distinctions or special cases for the main OS thread. If (7) succeeds, then the important detail is the Go runtime's thread locking, not the specific operating system thread, which would seem to implicate Go. If (7) crashes, then the important detail would appear to be the specific operating system thread used for the syscall.Exec, which would seem to implicate FreeBSD.

Also, if you remove the "do some work", does that affect crash frequency? It's not involved in any of these theories, nor in the SIGBUS/SIGILL above, but maybe it is important for getting the timing right.

If (7) crashes the next step is probably to try to convert it into a C program that does the fork+wait of /bin/true from the main thread and then starts a new pthread for the execve, leaving the main thread to block (maybe pthread_join on the new thread).

tmm1 commented Mar 8, 2017

(7) crashes.

Without the "do work" block, the crashes take longer to reproduce. For example, (2) took 46s and 90s to crash, and (4) took 27s and 21s to crash.

tmm1 commented Mar 8, 2017

I wrote this, but haven't been able to reproduce the crash with it yet:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>

extern char **environ;

void *do_exec(void *arg) {
	fprintf(stdout, "running exec...\n");
	char **argv = (char**)arg;
	execve(argv[0], argv, environ);
	fprintf(stdout, "exec failed: %d!\n", errno);
	exit(1);
	return NULL;
}

int main(int argc, char *argv[]) {
	fprintf(stdout, "started process.\n");
	int pid = fork();
	if (pid == 0) { // child
		if (execve("/usr/bin/true", argv, environ) < 0) {
			fprintf(stderr, "execve(true) failed: %d\n", errno);
		}
		exit(0);
	}
	waitpid(pid, NULL, 0);

	usleep(1000);

	pthread_t thread;
	pthread_create(&thread, NULL, do_exec, argv);
	pthread_join(thread, NULL);
	return 0;
}

tmm1 commented Mar 8, 2017

I suspect that even if memory corruption were occurring in the C program, there's just not enough going on for the program to notice.

Contributor

rsc commented Mar 8, 2017

FWIW, this crash at least is not memory corruption, at least not in the "allocated memory / program heap" sense.

[signal SIGBUS: bus error code=0x3 addr=0x451730 pc=0x451730]

The fault here is trying to read from memory the instruction to be executed (addr = pc), which objdump identified as a valid PC that should have had a RET instruction mapped there. This is very strongly suggestive of the operating system doing process management wrong.

It's definitely possible that the C binary is so tiny compared to the Go binary that there are many fewer pages and less opportunity to get things wrong. Compiling with -static might be worth trying.

tmm1 commented Mar 8, 2017

I tried upgrading my 10.3-RELEASE VM to 10.3-RELEASE-p11 and still couldn't reproduce any crashes.

Then I started up a fresh new 11.0-RELEASE-p8 VM (based on @stevenh's comment earlier), and sure enough it crashes there.

FreeBSD 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

@rsc Maybe you can reproduce on GCE with --image freebsd-11-0-release-p1-amd64 --image-project=freebsd-org-cloud-dev

Owner

bradfitz commented Mar 8, 2017

@rsc, we have FreeBSD 11 gomote-accessible builders, too.

tmm1 commented Mar 8, 2017 edited

Behaviors on FreeBSD 11 are slightly different than my FreeNAS VM, so it's probably worth testing all the cases again.

For instance, now it crashes with and without the time.Sleep(), but only if the program uses stdout as well. I commented out the log.Printf lines and it wouldn't crash (quickly?) anymore.

Similarly, if I remove the "do work" section I can't get it to crash anymore on 11.0 either.

My original simplified repro still crashes quickly though. Definitely curious to hear if you see the same on GCE.

tmm1 commented Mar 8, 2017 edited

Here's the crash I'm seeing on FreeBSD 11..

runtime: nelems=6 nfree=4 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x4fa91e, 0x20)
	/home/vagrant/go/src/runtime/panic.go:596 +0x95 fp=0xc42002fe58 sp=0xc42002fe38
runtime.(*mspan).sweep(0x800563b48, 0x800563b00, 0xc420000820)
	/home/vagrant/go/src/runtime/mgcsweep.go:288 +0x786 fp=0xc42002ff28 sp=0xc42002fe58
runtime.sweepone(0x42d276)
	/home/vagrant/go/src/runtime/mgcsweep.go:110 +0x1ad fp=0xc42002ff90 sp=0xc42002ff28
runtime.gosweepone.func1()
	/home/vagrant/go/src/runtime/mgcsweep.go:125 +0x2b fp=0xc42002ffb0 sp=0xc42002ff90
runtime.systemstack(0xc420016000)
	/home/vagrant/go/src/runtime/asm_amd64.s:327 +0x79 fp=0xc42002ffb8 sp=0xc42002ffb0
runtime.mstart()
	/home/vagrant/go/src/runtime/proc.go:1132 fp=0xc42002ffc0 sp=0xc42002ffb8

goroutine 3 [running]:
runtime.systemstack_switch()
	/home/vagrant/go/src/runtime/asm_amd64.s:281 fp=0xc420020f68 sp=0xc420020f60
runtime.gosweepone(0x0)
	/home/vagrant/go/src/runtime/mgcsweep.go:126 +0x53 fp=0xc420020f98 sp=0xc420020f68
runtime.bgsweep(0xc420014070)
	/home/vagrant/go/src/runtime/mgcsweep.go:59 +0xbb fp=0xc420020fd8 sp=0xc420020f98
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420020fe0 sp=0xc420020fd8
created by runtime.gcenable
	/home/vagrant/go/src/runtime/mgc.go:212 +0x61

goroutine 1 [runnable]:
syscall.Syscall(0x4, 0x2, 0xc4203cae40, 0x24, 0x24, 0x24, 0x0)
	/home/vagrant/go/src/syscall/asm_unix_amd64.s:19 +0x5 fp=0xc420035cb8 sp=0xc420035cb0
syscall.write(0x2, 0xc4203cae40, 0x24, 0x40, 0x20, 0x13, 0x100000000000011)
	/home/vagrant/go/src/syscall/zsyscall_freebsd_amd64.go:1271 +0x55 fp=0xc420035d10 sp=0xc420035cb8
syscall.Write(0x2, 0xc4203cae40, 0x24, 0x40, 0x80055f000, 0x0, 0x14)
	/home/vagrant/go/src/syscall/syscall_unix.go:181 +0x49 fp=0xc420035d58 sp=0xc420035d10
os.(*File).write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x40, 0x14, 0x40)
	/home/vagrant/go/src/os/file_unix.go:186 +0x95 fp=0xc420035dc8 sp=0xc420035d58
os.(*File).Write(0xc42000c020, 0xc4203cae40, 0x24, 0x40, 0x23, 0xc4203cae40, 0x14)
	/home/vagrant/go/src/os/file.go:142 +0x7c fp=0xc420035e40 sp=0xc420035dc8
log.(*Logger).Output(0xc420040050, 0x2, 0xc42030ffd0, 0xf, 0x0, 0x0)
	/home/vagrant/go/src/log/log.go:168 +0x210 fp=0xc420035ee0 sp=0xc420035e40
log.Printf(0x4f7579, 0xf, 0x0, 0x0, 0x0)
	/home/vagrant/go/src/log/log.go:291 +0x80 fp=0xc420035f28 sp=0xc420035ee0
main.main()
	/home/vagrant/spawntest/main.go:28 +0x62 fp=0xc420035f88 sp=0xc420035f28
runtime.main()
	/home/vagrant/go/src/runtime/proc.go:185 +0x20a fp=0xc420035fe0 sp=0xc420035f88
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420035fe8 sp=0xc420035fe0

goroutine 2 [force gc (idle)]:
runtime.gopark(0x4fd138, 0x573a30, 0x4f74a7, 0xf, 0x4fd014, 0x1)
	/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420020768 sp=0xc420020738
runtime.goparkunlock(0x573a30, 0x4f74a7, 0xf, 0xc420000114, 0x1)
	/home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc4200207a8 sp=0xc420020768
runtime.forcegchelper()
	/home/vagrant/go/src/runtime/proc.go:226 +0x9e fp=0xc4200207e0 sp=0xc4200207a8
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200207e8 sp=0xc4200207e0
created by runtime.init.4
	/home/vagrant/go/src/runtime/proc.go:215 +0x35

goroutine 4 [finalizer wait]:
runtime.gopark(0x4fd138, 0x58e1f0, 0x4f72c1, 0xe, 0x14, 0x1)
	/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021718 sp=0xc4200216e8
runtime.goparkunlock(0x58e1f0, 0x4f72c1, 0xe, 0x2000000014, 0x1)
	/home/vagrant/go/src/runtime/proc.go:277 +0x5e fp=0xc420021758 sp=0xc420021718
runtime.runfinq()
	/home/vagrant/go/src/runtime/mfinal.go:161 +0xb2 fp=0xc4200217e0 sp=0xc420021758
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200217e8 sp=0xc4200217e0
created by runtime.createfing
	/home/vagrant/go/src/runtime/mfinal.go:142 +0x62

goroutine 5 [GC worker (idle)]:
runtime.gopark(0x4fcfd8, 0xc42030f860, 0x4f7654, 0x10, 0x14, 0x0)
	/home/vagrant/go/src/runtime/proc.go:271 +0x13a fp=0xc420021f50 sp=0xc420021f20
runtime.gcBgMarkWorker(0xc420016000)
	/home/vagrant/go/src/runtime/mgc.go:1491 +0x138 fp=0xc420021fd8 sp=0xc420021f50
runtime.goexit()
	/home/vagrant/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420021fe0 sp=0xc420021fd8
created by runtime.gcBgMarkStartWorkers
	/home/vagrant/go/src/runtime/mgc.go:1412 +0x98
Abort trap (core dumped)

If I run with GOGC=off, I eventually see the program quit with "Illegal instruction" and a coredump.

Reading symbols from spawntest...done.
[New LWP 100878]
[New LWP 100139]
[New LWP 100146]
Core was generated by `spawntest'.
Program terminated with signal SIGILL, Illegal instruction.
#0  syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
50		SYSCALL
[Current thread is 1 (LWP 100878)]
Loading Go Runtime support.

(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 100878        syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
  2    LWP 100139        runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
  3    LWP 100146        runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21

(gdb) thread apply all bt
Thread 3 (LWP 100146):
#0  runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1  0x0000000000423b3d in runtime.futexsleep1 (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:92
#2  0x000000000044b5ea in runtime.futexsleep.func1 () at /home/vagrant/go/src/runtime/os_freebsd.go:80
#3  0x000000000044cf3b in runtime.systemstack () at /home/vagrant/go/src/runtime/asm_amd64.s:343
#4  0x0000000000423acd in runtime.futexsleep (addr=0xc420024518, val=0, ns=-1) at /home/vagrant/go/src/runtime/os_freebsd.go:81
#5  0x000000000040c68b in runtime.notesleep (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:145
#6  0x000000000042b2cd in runtime.stopm () at /home/vagrant/go/src/runtime/proc.go:1650
#7  0x000000000042c0e4 in runtime.findrunnable (gp#10=0xc420016000, inheritTime=false) at /home/vagrant/go/src/runtime/proc.go:2102
#8  0x000000000042cc7c in runtime.schedule () at /home/vagrant/go/src/runtime/proc.go:2222
#9  0x000000000042cf7b in runtime.park_m (gp=0xc420000680) at /home/vagrant/go/src/runtime/proc.go:2285
#10 0x000000000044ce6b in runtime.mcall () at /home/vagrant/go/src/runtime/asm_amd64.s:269
#11 0x000000c420016000 in ?? ()
#12 0x0000000000000000 in ?? ()

Thread 2 (LWP 100139):
#0  runtime.sys_umtx_op () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:21
#1  0x0000000000423cae in runtime.futexwakeup (addr=0xc420024518, cnt=1) at /home/vagrant/go/src/runtime/os_freebsd.go:102
#2  0x000000000040c5a6 in runtime.notewakeup (n=0xc420024518) at /home/vagrant/go/src/runtime/lock_futex.go:135
#3  0x000000000042b4bd in runtime.startm (_p_=0xc420016000, spinning=false) at /home/vagrant/go/src/runtime/proc.go:1713
#4  0x000000000042b635 in runtime.handoffp (_p_=0xc420016000) at /home/vagrant/go/src/runtime/proc.go:1725
#5  0x000000000043151d in runtime.retake (now=4278381200320, ~r1=0) at /home/vagrant/go/src/runtime/proc.go:3909
#6  0x000000000043113a in runtime.sysmon () at /home/vagrant/go/src/runtime/proc.go:3836
#7  0x000000000042a39e in runtime.mstart1 () at /home/vagrant/go/src/runtime/proc.go:1179
#8  0x000000000042a274 in runtime.mstart () at /home/vagrant/go/src/runtime/proc.go:1149
#9  0x0000000000450927 in runtime.thr_start () at /home/vagrant/go/src/runtime/sys_freebsd_amd64.s:45
#10 0x0000000000000000 in ?? ()

Thread 1 (LWP 100878):
#0  syscall.Syscall6 () at /home/vagrant/go/src/syscall/asm_unix_amd64.s:50
#1  0x0000000000478abb in os.(*Process).blockUntilWaitable (p=0xc420012330, ~r0=96, ~r1=...) at /home/vagrant/go/src/os/wait_wait6.go:29
#2  0x00000000004763bd in os.(*Process).wait (p=0xc420012330, ps=0x8, err=...) at /home/vagrant/go/src/os/exec_unix.go:22
#3  0x000000000047594b in os.(*Process).Wait (p=0xc420012330, ~r0=0x0, ~r1=...) at /home/vagrant/go/src/os/exec.go:115
#4  0x000000000049fc52 in os/exec.(*Cmd).Wait (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:435
#5  0x000000000049f08c in os/exec.(*Cmd).Run (c=0xc420064000, ~r0=...) at /home/vagrant/go/src/os/exec/exec.go:280
#6  0x00000000004be011 in main.init.1 () at /home/vagrant/spawntest/main.go:14
#7  0x00000000004be347 in main.init ()
#8  0x000000000042747a in runtime.main () at /home/vagrant/go/src/runtime/proc.go:173
#9  0x000000000044f9d1 in runtime.goexit () at /home/vagrant/go/src/runtime/asm_amd64.s:2197
#10 0x0000000000000000 in ?? ()

(gdb) l
45		MOVQ	a2+16(FP), SI
46		MOVQ	a3+24(FP), DX
47		MOVQ	a4+32(FP), R10
48		MOVQ	a5+40(FP), R8
49		MOVQ	a6+48(FP), R9
50		SYSCALL
51		JCC	ok6
52		MOVQ	$-1, r1+56(FP)	// r1
53		MOVQ	$0, r2+64(FP)	// r2
54		MOVQ	AX, err+72(FP)  // errno

(gdb) disas
Dump of assembler code for function syscall.Syscall6:
   0x000000000046c810 <+0>:	callq  0x42d9d0 <runtime.entersyscall>
   0x000000000046c815 <+5>:	mov    0x8(%rsp),%rax
   0x000000000046c81a <+10>:	mov    0x10(%rsp),%rdi
   0x000000000046c81f <+15>:	mov    0x18(%rsp),%rsi
   0x000000000046c824 <+20>:	mov    0x20(%rsp),%rdx
   0x000000000046c829 <+25>:	mov    0x28(%rsp),%r10
   0x000000000046c82e <+30>:	mov    0x30(%rsp),%r8
   0x000000000046c833 <+35>:	mov    0x38(%rsp),%r9
=> 0x000000000046c838 <+40>:	syscall
   0x000000000046c83a <+42>:	jae    0x46c859 <syscall.Syscall6+73>
   0x000000000046c83c <+44>:	movq   $0xffffffffffffffff,0x40(%rsp)
   0x000000000046c845 <+53>:	movq   $0x0,0x48(%rsp)
   0x000000000046c84e <+62>:	mov    %rax,0x50(%rsp)
   0x000000000046c853 <+67>:	callq  0x42ddf0 <runtime.exitsyscall>
   0x000000000046c858 <+72>:	retq
   0x000000000046c859 <+73>:	mov    %rax,0x40(%rsp)
   0x000000000046c85e <+78>:	mov    %rdx,0x48(%rsp)
   0x000000000046c863 <+83>:	movq   $0x0,0x50(%rsp)
   0x000000000046c86c <+92>:	callq  0x42ddf0 <runtime.exitsyscall>
   0x000000000046c871 <+97>:	retq
End of assembler dump.

tmm1 commented Mar 9, 2017

Ran with ktrace and it crashed after a few minutes. Here's the last iteration worth of events: https://gist.github.com/tmm1/f10330c43fe6dfebeabcf5188e3ff535

Not particularly interesting from the kernel's point of view.

Contributor

rsc commented Mar 9, 2017

Thanks @bradfitz. I only just realized that the numbers in freebsd-amd64-gce93 and freebsd-amd64-gce101 are FreeBSD versions. I thought they were GCE versions somehow.

Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.

Owner

bradfitz commented Mar 9, 2017

Yeah, I've started dropping "gce" in the suffixes due to the confusion. Also, the freebsd-amd64-110 gomote is quite new (9 days ago in #19097). They weren't available when y'all started debugging this.

tmm1 commented Mar 9, 2017 edited

Anyway, reproduced reliably on freebsd-amd64-110 gomote (FreeBSD 11.0). That's progress.

Great! Let me know if I can do anything else to help.

tmm1 commented Mar 9, 2017 edited

Booted up a fresh FreeBSD-10.3-STABLE VM this morning, and I'm not able to repro my crash there either. So far, here's where my crash can be reproduced:

version outcome
10.1-RELEASE (GCE) not crashing
10.3-RELEASE not crashing
10.3-RELEASE-p11 not crashing
10.3-STABLE (r314494) not crashing
10.3-STABLE (freenas-9.10 455d13d) crashing
10.3-STABLE (freenas-9.10.2 41eb257) crashing
11.0-RELEASE crashing
11.0-RELEASE-p8 crashing
11.0-STABLE (r314493) crashing

Interestingly the earlier repro was crashing for @derekmarcotte on 10.3-RELEASE, so I wonder if my issue is a different bug. I guess the 10.3 repro was also taking days, so possible it's just not triggering as readily on the older kernels.

erikarn commented Mar 9, 2017

Contributor

stevenh commented Mar 9, 2017

E5-2640 here

tmm1 commented Mar 9, 2017

All my testing has been on VirtualBox VMs, on a darwin 15.6.0 host with a i5-2415M

erikarn commented Mar 9, 2017

tmm1 commented Mar 9, 2017

Booted up a 11.0-STABLE (r314493) VM and that crashes too.

(you can use cpuset to push the process+descendents onto a single core.)

I tried cpuset -l 0 ./spawntest and it still crashed.

erikarn commented Mar 9, 2017

Contributor

stevenh commented Mar 9, 2017

Tried with here still crashes but takes much longer and that's on the metal.

kaey commented Mar 10, 2017

In case this is architecture specific, can people include their actual CPU
details too?

FreeBSD 11.0-RELEASE-p2 #0: Mon Oct 24 06:55:27 UTC 2016
    root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): resolution 640x480
CPU: AMD G-T40E Processor (1000.02-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x500f20  Family=0x14  Model=0x2  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802209<SSE3,MON,SSSE3,CX16,POPCNT>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,IBS,SKINIT,WDT>
  SVM: NP,NRIP,NAsids=8
  TSC: P-state invariant, performance statistics

asomers commented Mar 10, 2017

Sorry if I'm asking a dumb question, but could this be as simple as a stack overflow? By default pthreads have smaller stacks than main threads, and a SIGBUS is the normal symptom of overflowing a thread's stack. So that's compatible with the observation that the crash only happens when the thread that calls execve is not the main thread of the process.

Contributor

stevenh commented Mar 10, 2017

The failure signal is pretty random, I've seen SIGILL, SIGBUS and SIGSEGV :(

Contributor

ianlancetaylor commented Mar 10, 2017

@asomers All the stacks here are goroutine stacks which never overflow.

asomers commented Mar 10, 2017

@ianlancetaylor goroutines never SIGILL either, except when they do. There are real stacks down there somewhere.

Contributor

ianlancetaylor commented Mar 10, 2017

@asomers These crashes are happening with pure Go programs. There are no pthreads involved anywhere. The stacks grow automatically: the function prologue ensures there is enough stack space before the function begins. I don't know what the problem is, but I am confident that it is not "as simple as a stack overflow."

erikarn commented Mar 10, 2017

asomers commented Mar 10, 2017

@erikam it's a multithreaded process. As @tmm1's ktrace output show, there appear to be 3 threads.
@ianlancetaylor the bug is unlikely to like in pure Go code. It's probably either in the OS or in the implementation of Go. Remember, everything is assembly in the end. I don't know if a stack overflow is the problem, but I do know that when investigating a bug like this, we can't assume that's Go's high level safety guarantees are working.

Contributor

ianlancetaylor commented Mar 10, 2017

@asomers For background I've been working on Go since 2008 and I'm very familiar with Go's implementation. I agree that Go's high level safety guarantees are not relevant here. I am simply asserting with high confidence that this is not a stack overflow.

(Personally I've believed for months now that this is a bug in the FreeBSD kernel, but I've never been able to recreate the problem myself, so I can't prove it. I could certainly be wrong--it could be a bug in Go. If it is a bug in Go it is most likely due to some error in the way that Go invokes the fork system call in syscall/exec_bsd.go. Perhaps the system call needs special handling on FreeBSD in some way, as is true of, for example, vfork on GNU/Linux.)

Contributor

ianlancetaylor commented Mar 10, 2017

@erikarn

Does your Go runtime hook into any of the standard C runtime / linker stuff? initialisers, destructors, etc?

No.

Contributor

stevenh commented Mar 14, 2017

Another data point for this is that after 8 hours of testing the program from #15658 (comment) under go 1.4 it has still yet to fail.

The same under go 1.8 fails within a matter of minutes at the most.

Contributor

ianlancetaylor commented Mar 14, 2017

A key difference between Go 1.4 and Go 1.8 is that Go 1.4 always defaults to a GOMAXPROCS value of 1, whereas in 1.8 GOMAXPROCS defaults to the number of CPUs on the system. So when testing with Go 1.4, set the environment variable GOMAXPROCS=4 or however many CPUs you have.

Contributor

stevenh commented Mar 14, 2017

Good point, this is subtly different from using a single core in the cpuset as iirc there's an outstanding issue to allow go to set GOMAXPROCS correctly based on the number of cores available to the process and not the total cores of system, which I still got an error from on 1.8.

I've set GOMAXPROCS=24 and am retesting.

Contributor

stevenh commented Mar 14, 2017

If go 1.4 still holds up, what do we think the chances are that there was an bug introduced when moving from c to pure go?

Contributor

ianlancetaylor commented Mar 14, 2017

@stevenh I think the issue about the number of cores available to the process is #15206, which was recently fixed on tip. The fix will be in 1.9.

There isn't a significant amount of C code involved here, so while anything is possible I think it is unlikely to be related to the conversion from C to Go. If it works with 1.4 and fails with 1.8 then a far more likely immediate cause is the concurrent garbage collector. Perhaps there is some problem when one thread reads or writes memory pages while a different thread is calling fork.

Contributor

stevenh commented Mar 14, 2017

@ianlancetaylor GC should be out of the picture as we've had failures with GOGC=off

Contributor

ianlancetaylor commented Mar 14, 2017

@stevenh Hmmm, right.

Since we are only seeing reports on FreeBSD I think any problem with the C to Go conversion would be in runtime/os_freebsd.c converting to runtime/os_freebsd.go. I don't see any problems there in a quick look, but who knows.

Contributor

stevenh commented Mar 14, 2017 edited

@ianlancetaylor Yer I had a quick look too the other day and couldn't spot anything either.

However on cursory review libthr (FreeBSD's native thread library) has a lot more code when processing a fork than the go runtime seems to, so I'm wondering if some protection that's needed is missing, which is allowing additional threads to perform actions during fork and results in memory corruption we're seeing.

libthr fork can be seen here:
https://svnweb.freebsd.org/base/head/lib/libthr/thread/thr_fork.c?view=markup#l135

One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.

tmm1 commented Mar 15, 2017

One test that might help backup that premise, is if we could ensure that no other threads ran while a fork was in process.

Any ideas on the best way to hack this in place to try it out?

Contributor

ianlancetaylor commented Mar 15, 2017

@tmm1 Have beforefork in runtime/proc.go call stopTheWorld, and have afterfork call startTheWorld.

tmm1 commented Mar 15, 2017

@ianlancetaylor Thanks!

I applied this patch and built a new binary with it, but the resulting binary still crashed quickly.

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672de73..55950bf64d 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2793,6 +2793,7 @@ func beforefork() {
 //go:linkname syscall_runtime_BeforeFork syscall.runtime_BeforeFork
 //go:nosplit
 func syscall_runtime_BeforeFork() {
+       stopTheWorld("beforefork")
        systemstack(beforefork)
 }

@@ -2814,6 +2815,7 @@ func afterfork() {
 //go:nosplit
 func syscall_runtime_AfterFork() {
        systemstack(afterfork)
+       startTheWorld()
 }

 // Allocate a new g, with a stack big enough for stacksize bytes.
Contributor

stevenh commented Mar 15, 2017

Update on the go 1.4 test, its been running for over 20 hours now, no failures.

Contributor

stevenh commented Mar 17, 2017

Ok so I may have a lead on this.

Given go 1.4 appeared to be unaffected I tried go 1.5 which also appears to be unaffected, however switching to go 1.6 and crashed within seconds.

After some bisecting I found that reverting golang/go@a7cad52 from go 1.6 made it stable again.

Digging some more I see that after execve sigaltstack is returning the details of a signal stack set before the execve, and which should never happen.

Contributor

stevenh commented Mar 17, 2017

Ok reproduced it in C:

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

extern char **environ;

static void
die(const char *s)
{
       perror(s);
       exit(EXIT_FAILURE);
}

static void
printsigstack(const char *prefix)
{
        stack_t ss;

        if (sigaltstack(NULL, &ss) < 0)
                die("sigaltstack get");
        printf("%s - ss_sp=%p ss_size=%lu ss_flags=0x%x\n", prefix, ss.ss_sp,
            (unsigned long)ss.ss_size, (unsigned int)ss.ss_flags);
}

static void
setstack(void *arg)
{
        stack_t ss;

        ss.ss_sp = malloc(SIGSTKSZ);
        if (ss.ss_sp == NULL)
                die("malloc");

        ss.ss_size = SIGSTKSZ;
        ss.ss_flags = 0;
        if (sigaltstack(&ss, NULL) < 0)
                die("sigaltstack set");
}


static void *
thread(void *arg)
{
        char *argv[] = { "./test-sigs", "no-more", 0 };

        printsigstack("thread - before setstack");
        setstack(arg);
        printsigstack("thread - after setstack + before execve");

        execve(argv[0], &argv[0], environ);
        die("exec failed");

        return NULL;
}

int
main(int argc, char** argv)
{
        int i, j;
        int alloc_size = 32 * 1024;
        pthread_t tid;
        char buf[1024];
        void *arg;

        printf("argc: %d, %s\n", argc, argv[0]);
        printsigstack("main");
        if (argc != 1)
                return 0;

        j = pthread_create(&tid, NULL, thread, NULL);
        if (j != 0) {
               errno = i;
               die("pthread_create");
        }

        j = pthread_join(tid, NULL);
        if (j != 0) {
                errno = j;
                die("pthread_join");
        }

        return 0;
}

Produces:

argc: 1, ./test-sigs
main - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - before setstack - ss_sp=0x0 ss_size=0 ss_flags=0x4
thread - after setstack + before execve - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0
argc: 2, ./test-sigs
main - ss_sp=0x8014106c0 ss_size=34816 ss_flags=0x0

As you can see the second main output, which is straight after an execv, already has a signal stack which it shouldn't have.

This is corroborated by execve(2)
The signal stack is reset to be undefined (see sigaction(2) for more information).

So now to go kernel bug hunting....

Contributor

stevenh commented Mar 17, 2017

Just to be clear if the execve is done from the main thread, even if it has an alternative signal stack set, its not present after the execve, so doing the execve from another thread is key to reproducing this; which may explain why its not been identified in the past.

Contributor

stevenh commented Mar 17, 2017

Thanks to @kostikbel who knew exactly where to look for this, here's the kernel patch for people to test:

Index: sys/kern/kern_sig.c
===================================================================
--- sys/kern/kern_sig.c (revision 315303)
+++ sys/kern/kern_sig.c (working copy)
@@ -976,7 +976,6 @@ execsigs(struct proc *p)
         * and are now ignored by default).
         */
        PROC_LOCK_ASSERT(p, MA_OWNED);
-       td = FIRST_THREAD_IN_PROC(p);
        ps = p->p_sigacts;
        mtx_lock(&ps->ps_mtx);
        while (SIGNOTEMPTY(ps->ps_sigcatch)) {
@@ -1007,6 +1006,8 @@ execsigs(struct proc *p)
         * Reset stack state to the user stack.
         * Clear set of signals caught on the signal stack.
         */
+       td = curthread;
+       MPASS(td->td_proc == p);
        td->td_sigstk.ss_flags = SS_DISABLE;
        td->td_sigstk.ss_size = 0;
        td->td_sigstk.ss_sp = 0;

My test here which usually crashes in a few seconds has been running 20 mins now no issues, so I'm hopeful this is the fix for this.

Contributor

ianlancetaylor commented Mar 17, 2017

@stevenh Awesome work! Thanks!

tmm1 commented Mar 17, 2017 edited

Wow, incredible work @stevenh.

Looks like that patch was committed this morning: freebsd/freebsd@0350bb5

Does it make sense to workaround this bug in go until the upstream fix propagates out? (By reverting the os1_freebsd.go changes in a7cad52).

emaste commented Mar 17, 2017 edited

A huge thank you to @stevenh and everyone involved in tracking down this issue!

The FreeBSD change will make it into the stable/10 and stable/11 branches next week (by end of Mar 2017), but it will take more time to appear in new minor or errata releases; I suspect that it will be around August when we can count on fixed FreeBSD releases being widely available.

Is it possible to perform a run-time check, with the change in a7cad52 applying only to known fixed versions?

(Comment updated to clarify that it's the FreeBSD change that's making it into branches next week, and to provide a non-relative "next week".)

Contributor

josharian commented Mar 17, 2017

Note that Aug is when Go 1.9 will be released: https://github.com/golang/go/wiki/Go-Release-Cycle

Contributor

ianlancetaylor commented Mar 17, 2017

I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.

CL https://golang.org/cl/38325 mentions this issue.

tmm1 commented Mar 17, 2017

I think we can work around the problem without requiring a version check. Can someone with an affected system please test whether https://golang.org/cl/38325 fixes the problem? Thanks.

Confirmed that patch fixes the crash on FreeBSD 11.0-RELEASE-p8

erikarn commented Mar 17, 2017

@tmm1 tmm1 added a commit to fancybits/go that referenced this issue Mar 17, 2017

@ianlancetaylor @tmm1 ianlancetaylor + tmm1 runtime: clear signal stack on main thread
This is a workaround for a FreeBSD kernel bug. It can be removed when
we are confident that all people are using the fixed kernel. See #15658.

Fixes #15658.

Change-Id: I0ecdccb77ddd0c270bdeac4d3a5c8abaf0449075
f53460c
Contributor

stevenh commented Mar 17, 2017

After speaking to @kostikbel this afternoon we may yet still have an outstanding issue.

Its possible that the repo case identified by @tmm1 was independent of the original issue.

I'm rebuilding the original fork only test case and retesting.

Contributor

stevenh commented Mar 17, 2017

green ad test suite?

erikarn commented Mar 17, 2017

tmm1 commented Mar 18, 2017

Its possible that the repo case identified by @tmm1 was independent of the original issue.

😲 If that is the case, I apologize profusely for hijacking this thread.

As I noted to @steventh, I definitely able to reproduce the original memory corruption issue on the patched system. Also, somebody mentioned that the test code was stripped to only use fork() and still cause the issue. In other words, an issue is still there.

Contributor

stevenh commented Mar 18, 2017 edited

@kostikbel could you share the exact setup of test case that you had failed, as the three fork only tests I've been running here haven't failed yet with over 10 hours of testing.

For the record the there I've been running are:

Test 1 - No channels (with ktrace)
ktrace -tfw+ -i -f ktrace-test.out ./test
Test 2 - No channels (without ktrace)
./test2

package main

import (
        "log"
        "net/http"
        _ "net/http/pprof"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 24
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        log.Println(http.ListenAndServe(":8081", nil))
}

Test 3 - With Channels (without ktrace)
./test3

package main

import (
        "log"
        "syscall"
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
        return
}

func main() {
        log.Println("Starting a bunch of goroutines...")

        done := make(chan struct{}, 24)

        for i := 0; i < 24; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}

Fork Only:

package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

All test have the following env:

export GOGC=2
export GOTRACEBACK=crash
export GOMAXPROCS=24

Hardware:

Dec  6 18:08:23 head kernel: CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (2500.05-MHz K8-class CPU)
Dec  6 18:08:23 head kernel: Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
Dec  6 18:08:23 head kernel: Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Dec  6 18:08:23 head kernel: Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
Dec  6 18:08:23 head kernel: AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
Dec  6 18:08:23 head kernel: AMD Features2=0x1<LAHF>
Dec  6 18:08:23 head kernel: XSAVE Features=0x1<XSAVEOPT>
Dec  6 18:08:23 head kernel: VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
Dec  6 18:08:23 head kernel: TSC: P-state invariant, performance statistics
Dec  6 18:08:23 head kernel: real memory  = 137438953472 (131072 MB)
Dec  6 18:08:23 head kernel: avail memory = 133429284864 (127248 MB)

OS:

FreeBSD 12.0-CURRENT #39 r315303M: 

Go:

go version go1.8 freebsd/amd64

I use some binary, which was compiled from the source provided in the original message of this bug report. I am not sure which go version was used, but not older than 1/2 year, judging by dates.

Contributor

stevenh commented Mar 19, 2017

I've not been able to repo a crash yet, but I have had the 100% of a core and everything sitting in locks 0% issues even with the patch but only after ~24 hours.

I'm now retesting with the stop the world while holding the fork lock to see if that helps narrow down kernel or go runtime issue.

If anyone has any quicker reproduction cases like to hear them, as 24 hours between tests is going to hamper investigation :(

Contributor

stevenh commented Mar 20, 2017

For reference the following will detect if your kernel has the signal stack issue.

Its different from the original which would only detect the problem on FreeBSD 11.0+ due to the fact the order of thread inserts has changed, so to detect in all cases three threads are needed.

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>

extern char **environ;

static void
die(const char *s)
{
       perror(s);
       exit(EXIT_FAILURE);
}

static void
setstack(void *arg)
{
	stack_t ss;

	ss.ss_sp = malloc(SIGSTKSZ);
	if (ss.ss_sp == NULL)
		die("malloc");

	ss.ss_size = SIGSTKSZ;
	ss.ss_flags = 0;
	if (sigaltstack(&ss, NULL) < 0)
		die("sigaltstack set");
}

static void *
thread_exec(void *arg)
{
	struct timespec ts = {0, 1000};
	char *argv[] = {"./test-sigs", "no-more", 0};

	setstack(arg);
	nanosleep(&ts, NULL);

	execve(argv[0], &argv[0], environ);
	die("exec failed");

	return NULL;
}

static void *
thread_sleep(void *arg)
{
	sleep(10);

	return NULL;
}


int
main(int argc, char** argv)
{
	int i, j;
	int alloc_size = 32 * 1024;
	pthread_t tid1, tid2;
	char buf[1024];
	void *arg;

	if (argc != 1) {
		stack_t ss;

		if (sigaltstack(NULL, &ss) < 0)
			die("sigaltstack get");

		if (ss.ss_sp != NULL || ss.ss_flags != SS_DISABLE ||
		    ss.ss_size != 0) {
			fprintf(stderr, "invalid signal stack after execve: "
			    "ss_sp=%p ss_size=%lu ss_flags=0x%x\n", ss.ss_sp,
			    (unsigned long)ss.ss_size,
			    (unsigned int)ss.ss_flags);
			return 1;
		}

		printf("valid signal stack is valid after execve\n");

		return 0;
	}

	// We have to use two threads to ensure that can detect the
	// issue when new threads are added to the head (pre 269095)
	// and the tail of the process thread list.
	j = pthread_create(&tid1, NULL, thread_exec, NULL);
	if (j != 0) {
	       errno = i;
	       die("pthread_create");
	}

	j = pthread_create(&tid2, NULL, thread_sleep, NULL);
	if (j != 0) {
	       errno = i;
	       die("pthread_create");
	}

	j = pthread_join(tid1, NULL);
	if (j != 0) {
		errno = j;
		die("pthread_join");
	}

	j = pthread_join(tid2, NULL);
	if (j != 0) {
		errno = j;
	}

	return 0;
}

@gopherbot gopherbot pushed a commit that referenced this issue Mar 20, 2017

@ianlancetaylor ianlancetaylor runtime: clear signal stack on main thread
This is a workaround for a FreeBSD kernel bug. It can be removed when
we are confident that all people are using the fixed kernel. See #15658.

Updates #15658.

Change-Id: I0ecdccb77ddd0c270bdeac4d3a5c8abaf0449075
Reviewed-on: https://go-review.googlesource.com/38325
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
5dc14af

@stevenh I've quite a few FreeBSD machines in various configurations (virtualized, hardware, jailed, 11.0, 12-CURRENT and so on, lots of CPU's, lots of RAM) and I would love to help testing your fixes - let me know if there's any way I could help.

I've been able to reproduce the same crashes as before using the same methods with Ian's fixes added to the tree. Of three attempts, it wedged twice, panic'd once.

This strongly supports the case that the original issue stands.

Here's a log.

Here's a reference to the branch used to build the executable that produced the log.

Contributor

stevenh commented Mar 23, 2017

Yes indeed @derekmarcotte the fix added by Ian was to address the issue raised by @tmm1, which unfortunately was as separate issue than the original.

Contributor

stevenh commented Mar 23, 2017 edited

Continuing the debug of this and the evidence is building up that the issue is related to a running child during a GC pass.

Using the following code I'm reliably able to trigger a failure in ~10mins on FreeBSD 12.0 r315303M (I don't believe the FreeBSD version is important).

syscall/exec_bsd_debug.go

package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        print("forking: ", forks+1, "\n")
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        forks++
        runtime_AfterFork()
        print("forked: ", forks, "\n")
        runtime_GC()
        ForkLock.Unlock()

        pid = int(r1)
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                print("wait4: EINTR\n")
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return err
}

runtime/exec_debug.go

package runtime

//go:linkname syscall_runtime_GC syscall.runtime_GC
//go:nosplit
func syscall_runtime_GC() {
        GC()
}

main.go

package main

import (
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 1
        gc           = true
)

func run() {
        for {
                if err := syscall.ForkOnlyBSDTest(gc); err != nil {
                        log.Fatal(err)
                }
        }
}

func main() {
        log.Printf("Starting %v forking goroutines...\n", forkRoutines)
        log.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0), "version:", runtime.Version()

        for i := 0; i < forkRoutines; i++ {
                go run()
        }

        done := make(chan struct{})
        <-done
}

env settings

export GOGC=off
export GOTRACEBACK=crash
export GOMAXPROCS=16

I've discovered a few interesting facts:

  1. If I add a delay (tried up to 1.5 seconds) before exiting the child, so the child exits after the GC pass, I still get the issue.
  2. If I move the GC after the Wait4 block, so the child exits before the GC pass, I've not had a failure after hours.
  3. If I add a delay (10ms) before the GC, so the child exits before the GC pass, I've not had a failure after hours.

I believe 1 rules out signal delivery during the GC pass as a potential culprit.

So the issue appears to be triggered by having a running child present during a GC pass.

The crash typically looks like:

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x7d62ad, 0x14)
        /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0x7fffde9f2b58 sp=0x7fffde9f2b38
runtime.(*workbuf).checkempty(0x801166800)
        /usr/local/go/src/runtime/mgcwork.go:320 +0x4e fp=0x7fffde9f2b78 sp=0x7fffde9f2b58
runtime.getempty(0x7fffde9f2bd0)
        /usr/local/go/src/runtime/mgcwork.go:332 +0x96 fp=0x7fffde9f2bb0 sp=0x7fffde9f2b78
runtime.handoff(0x801166800, 0xc41fffa1b2)
        /usr/local/go/src/runtime/mgcwork.go:432 +0x26 fp=0x7fffde9f2be8 sp=0x7fffde9f2bb0
runtime.(*gcWork).balance(0xc42001f228)
        /usr/local/go/src/runtime/mgcwork.go:275 +0x4e fp=0x7fffde9f2c08 sp=0x7fffde9f2be8
runtime.gcDrain(0xc42001f228, 0x0)
        /usr/local/go/src/runtime/mgcmark.go:1066 +0x36c fp=0x7fffde9f2c40 sp=0x7fffde9f2c08
runtime.gchelper()
        /usr/local/go/src/runtime/mgc.go:1910 +0x94 fp=0x7fffde9f2c70 sp=0x7fffde9f2c40
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1653 +0xdd fp=0x7fffde9f2c98 sp=0x7fffde9f2c70
runtime.findrunnable(0xc420018000, 0x0)
        /usr/local/go/src/runtime/proc.go:2102 +0x2e4 fp=0x7fffde9f2d30 sp=0x7fffde9f2c98
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2222 +0x14c fp=0x7fffde9f2d70 sp=0x7fffde9f2d30
runtime.goschedImpl(0xc42014a000)
        /usr/local/go/src/runtime/proc.go:2300 +0x106 fp=0x7fffde9f2d90 sp=0x7fffde9f2d70
runtime.gopreempt_m(0xc42014a000)
        /usr/local/go/src/runtime/proc.go:2315 +0x36 fp=0x7fffde9f2da8 sp=0x7fffde9f2d90
runtime.newstack(0x0)
        /usr/local/go/src/runtime/stack.go:1081 +0x2f0 fp=0x7fffde9f2f20 sp=0x7fffde9f2da8
runtime.morestack()
        /usr/local/go/src/runtime/asm_amd64.s:398 +0x86 fp=0x7fffde9f2f30 sp=0x7fffde9f2f20

I also needed the following patch to ensure I got a core even with GOTRACEBACK=crash, as it seems that in this case were m.traceback is set to 1 it prevents the user preference taking effect, so it was impossible to trigger a crash on panic.

--- ./runtime/runtime1.go.orig  2017-03-21 16:27:26.000000000 +0000
+++ ./runtime/runtime1.go       2017-03-21 16:27:26.000000000 +0000
@@ -36,14 +36,16 @@ var traceback_env uint32
 func gotraceback() (level int32, all, crash bool) {
        _g_ := getg()
        all = _g_.m.throwing > 0
-       if _g_.m.traceback != 0 {
-               level = int32(_g_.m.traceback)
-               return
-       }
        t := atomic.Load(&traceback_cache)
        crash = t&tracebackCrash != 0
        all = all || t&tracebackAll != 0
-       level = int32(t >> tracebackShift)
+       if _g_.m.traceback != 0 {
+               level = int32(_g_.m.traceback)
+       }
+
+       if int32(t>>tracebackShift) > level {
+               level = int32(t >> tracebackShift)
+       }
        return
 }
Contributor

stevenh commented Mar 23, 2017 edited

Ok very interesting data point, it appears that if I compile with CGO_ENABLE=0 then the problem goes away. This would explain why we've never seen this in production as all of our binaries are compiled with CGO disabled.

Scratch that, after a few test runs it happened :(

@stevenh: FWIW - I'm still getting panics with GOGC=off, and no explicit calls to the GC (as in your example). I don't believe that garbage collection is implicated here, unless it runs sometimes despite GOGC. Last trawl through the runtime, I felt like GOGC=off kills the GC dead.

I've used a stripped down version of your example main.go, and my ForkOnlyBSDTest.

Contributor

stevenh commented Mar 27, 2017

@derekmarcotte I don't think GC is the cause, however I do think forcing GC between the fork and wait produces enough meaningful work in the go runtime to detect, the memory corruption.

Out of interest:

  1. What frequency do you get failures with your tests?
  2. Whats your setup?
  3. Could post your exact main.go and ForkOnlyBSDTest?
Contributor

stevenh commented Mar 27, 2017

Forgot to ask, with GC disabled what crash did you get?

@stevenh:
In terms of extra work, I think the /bin/true case is pretty quick to fail. I think it's helpful to eliminate as much as possible, to narrow down on the involved systems. The only things left AFAIK in play are the scheduler, stack expansion?, and the worker routines (e.g. fork/wait).

  1. Most recently, I had one hang, one crash of two tests
  2. Most of the testing I do is in a 10.3-RELEASE jail on a 10.3-RELEASE host (AMD 8350FX), the host is busy doing other things as well. Additionally, I've booted the 10.3-RELEASE memstick in single user mode on a Xeon+ECC, and have been able to crash in exactly the same fashion. It doesn't appear to be environment related (i.e. hardware/jails + background services).
  3. main.go, ForkOnlyBSDTest - these are intentionally bare bones to eliminate any other sources of memory corruption.
  4. Since moving to the ForkOnlyBSDTest, I have only panicked with fatal error: acquirep: invalid p state, here's a recent example:

Mon Mar 27 06:47:55 EDT 2017
acquirep: p->m=4895104(0) p->status=1
fatal error: acquirep: invalid p state
... stack traces
Mon Mar 27 11:54:57 EDT 2017

So that one was about 5 hours.

I haven't been meticulous about logging the timing any more though. If it's helpful, I'm sure I could script up something like this to detect the wedge, and start recording more times? I don't feel that would help much at this point though.

Contributor

stevenh commented Mar 28, 2017

Thanks @derekmarcotte that's why I've been using the force GC test recently as it blows up pretty reliably in between 1-10mins, which makes iterations much quicker.

@stevenh: I ended up scripting the thing, here's a sample log (with backtraces omitted):

=== Wed Apr  5 18:23:40 EDT 2017 - Killing wedged 42369
=== Wed Apr  5 18:23:45 EDT 2017 - Starting new instance
=== Thu Apr  6 02:45:19 EDT 2017 - Starting new instance
=== Thu Apr  6 12:52:40 EDT 2017 - Starting new instance
=== Thu Apr  6 18:24:49 EDT 2017 - Killing wedged 98836
=== Thu Apr  6 18:24:54 EDT 2017 - Starting new instance
=== Thu Apr  6 20:47:11 EDT 2017 - Starting new instance
=== Fri Apr  7 02:25:20 EDT 2017 - Killing wedged 53688
=== Fri Apr  7 02:25:25 EDT 2017 - Starting new instance
=== Fri Apr  7 02:49:28 EDT 2017 - Starting new instance
=== Fri Apr  7 03:16:05 EDT 2017 - Starting new instance
=== Fri Apr  7 06:42:34 EDT 2017 - Killing wedged 38753
=== Fri Apr  7 06:42:39 EDT 2017 - Starting new instance
=== Fri Apr  7 06:42:54 EDT 2017 - Killing wedged 24324
=== Fri Apr  7 06:42:59 EDT 2017 - Starting new instance
=== Fri Apr  7 06:43:04 EDT 2017 - Killing wedged 97972
=== Fri Apr  7 06:43:09 EDT 2017 - Starting new instance
=== Fri Apr  7 06:43:34 EDT 2017 - Killing wedged 23425
=== Fri Apr  7 06:43:39 EDT 2017 - Starting new instance
=== Fri Apr  7 06:44:29 EDT 2017 - Killing wedged 45400
=== Fri Apr  7 06:44:34 EDT 2017 - Starting new instance
=== Fri Apr  7 06:45:19 EDT 2017 - Killing wedged 86970
=== Fri Apr  7 06:45:24 EDT 2017 - Starting new instance
=== Fri Apr  7 06:45:39 EDT 2017 - Killing wedged 9083
=== Fri Apr  7 06:45:44 EDT 2017 - Starting new instance
=== Fri Apr  7 06:46:04 EDT 2017 - Killing wedged 78764
=== Fri Apr  7 06:46:09 EDT 2017 - Starting new instance
=== Fri Apr  7 06:46:14 EDT 2017 - Killing wedged 77503
=== Fri Apr  7 06:46:19 EDT 2017 - Starting new instance
=== Fri Apr  7 06:46:24 EDT 2017 - Killing wedged 2314
=== Fri Apr  7 06:46:29 EDT 2017 - Starting new instance
=== Fri Apr  7 06:46:59 EDT 2017 - Killing wedged 27186
=== Fri Apr  7 06:47:04 EDT 2017 - Starting new instance
=== Fri Apr  7 06:47:25 EDT 2017 - Killing wedged 73453
=== Fri Apr  7 06:47:30 EDT 2017 - Starting new instance
=== Fri Apr  7 06:47:35 EDT 2017 - Killing wedged 72424
=== Fri Apr  7 06:47:40 EDT 2017 - Starting new instance
=== Fri Apr  7 06:48:15 EDT 2017 - Killing wedged 96875
=== Fri Apr  7 06:48:20 EDT 2017 - Starting new instance
=== Fri Apr  7 06:49:15 EDT 2017 - Killing wedged 78286
=== Fri Apr  7 06:49:20 EDT 2017 - Starting new instance
=== Fri Apr  7 06:49:35 EDT 2017 - Killing wedged 57532
=== Fri Apr  7 06:49:40 EDT 2017 - Starting new instance
=== Fri Apr  7 06:50:40 EDT 2017 - Killing wedged 29647
=== Fri Apr  7 06:50:45 EDT 2017 - Starting new instance
=== Fri Apr  7 06:51:15 EDT 2017 - Killing wedged 41134
=== Fri Apr  7 06:51:20 EDT 2017 - Starting new instance
=== Fri Apr  7 06:51:40 EDT 2017 - Killing wedged 93192
=== Fri Apr  7 06:51:45 EDT 2017 - Starting new instance
=== Fri Apr  7 06:52:15 EDT 2017 - Killing wedged 86941
=== Fri Apr  7 06:52:20 EDT 2017 - Starting new instance
=== Fri Apr  7 06:52:35 EDT 2017 - Killing wedged 27716
=== Fri Apr  7 06:52:41 EDT 2017 - Starting new instance
=== Fri Apr  7 10:25:00 EDT 2017 - Killing wedged 6064
=== Fri Apr  7 10:25:05 EDT 2017 - Starting new instance
=== Fri Apr  7 16:12:26 EDT 2017 - Starting new instance
=== Fri Apr  7 18:25:48 EDT 2017 - Killing wedged 31596
=== Fri Apr  7 18:25:53 EDT 2017 - Starting new instance
=== Fri Apr  7 18:25:53 EDT 2017 - Killing wedged 53833
=== Fri Apr  7 18:25:58 EDT 2017 - Starting new instance
=== Fri Apr  7 19:03:51 EDT 2017 - Starting new instance
=== Sat Apr  8 04:17:57 EDT 2017 - Killing wedged 13586
=== Sat Apr  8 04:18:02 EDT 2017 - Starting new instance
=== Sat Apr  8 10:26:05 EDT 2017 - Killing wedged 44051
=== Sat Apr  8 10:26:10 EDT 2017 - Starting new instance
=== Sat Apr  8 18:25:48 EDT 2017 - Killing wedged 54803
=== Sat Apr  8 18:25:53 EDT 2017 - Starting new instance
=== Sat Apr  8 18:26:43 EDT 2017 - Killing wedged 6268
=== Sat Apr  8 18:26:48 EDT 2017 - Starting new instance
=== Sun Apr  9 00:12:15 EDT 2017 - Starting new instance
=== Sun Apr  9 01:02:04 EDT 2017 - Starting new instance
=== Sun Apr  9 02:26:56 EDT 2017 - Killing wedged 29694
=== Sun Apr  9 02:27:02 EDT 2017 - Starting new instance
=== Sun Apr  9 05:43:05 EDT 2017 - Starting new instance
=== Sun Apr  9 10:27:05 EDT 2017 - Killing wedged 90493
=== Sun Apr  9 10:27:10 EDT 2017 - Starting new instance
=== Sun Apr  9 18:27:27 EDT 2017 - Killing wedged 3961
=== Sun Apr  9 18:27:32 EDT 2017 - Starting new instance
=== Mon Apr 10 02:28:55 EDT 2017 - Killing wedged 57180
=== Mon Apr 10 02:29:00 EDT 2017 - Starting new instance
=== Mon Apr 10 02:29:00 EDT 2017 - Killing wedged 30622
=== Mon Apr 10 02:29:06 EDT 2017 - Starting new instance
=== Mon Apr 10 05:20:56 EDT 2017 - Starting new instance
=== Mon Apr 10 10:37:45 EDT 2017 - Killing wedged 39613
=== Mon Apr 10 10:37:50 EDT 2017 - Starting new instance
=== Mon Apr 10 11:30:20 EDT 2017 - Starting new instance
=== Mon Apr 10 18:28:54 EDT 2017 - Killing wedged 40438
=== Mon Apr 10 18:28:59 EDT 2017 - Starting new instance
=== Mon Apr 10 18:29:19 EDT 2017 - Killing wedged 97568
=== Mon Apr 10 18:29:24 EDT 2017 - Starting new instance
=== Mon Apr 10 22:43:26 EDT 2017 - Starting new instance
=== Tue Apr 11 00:32:26 EDT 2017 - Starting new instance
=== Tue Apr 11 01:53:53 EDT 2017 - Starting new instance
=== Tue Apr 11 02:29:11 EDT 2017 - Killing wedged 48237
=== Tue Apr 11 02:29:16 EDT 2017 - Starting new instance
=== Tue Apr 11 10:29:54 EDT 2017 - Killing wedged 28162
=== Tue Apr 11 10:29:59 EDT 2017 - Starting new instance
=== Tue Apr 11 13:44:17 EDT 2017 - Starting new instance
=== Tue Apr 11 13:46:22 EDT 2017 - Starting new instance
=== Tue Apr 11 16:25:46 EDT 2017 - Starting new instance
=== Tue Apr 11 18:30:02 EDT 2017 - Killing wedged 1645
=== Tue Apr 11 18:30:07 EDT 2017 - Starting new instance
=== Tue Apr 11 18:32:17 EDT 2017 - Killing wedged 25645
=== Tue Apr 11 18:32:22 EDT 2017 - Starting new instance
=== Tue Apr 11 20:44:39 EDT 2017 - Starting new instance
=== Tue Apr 11 20:44:59 EDT 2017 - Starting new instance
=== Wed Apr 12 02:30:15 EDT 2017 - Killing wedged 45351

If you see consecutive Starting new instance, this means that the first is a panic. So out of 64 runs, 44 wedged, and 20 panicked.

Hrm, on second though, the wedged detection should be inverted. i.e. - counting active threads, and kill when they dip below say, 4. Having idle threads is no big deal. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment