8.7 KiB
Delving into the unknown
This write-up summarises how to debug what appears to be a mischievous Lotus instance during our Testground tests. It also goes enumerates which assets are useful to report suspicious behaviours upstream, in a way that they are actionable.
Querying the Lotus RPC API
The local:docker
and cluster:k8s
map ports that you specify in the
composition.toml, so you can access them externally.
All our compositions should carry this fragment:
[global.run_config]
exposed_ports = { pprof = "6060", node_rpc = "1234", miner_rpc = "2345" }
This tells Testground to expose the following ports:
6060
=> Go pprof.1234
=> Lotus full node RPC.2345
=> Lotus storage miner RPC.
local:docker
-
Install the
lotus
binary on your host. -
Find the container that you want to connect to in
docker ps
.- Note that our container names are slightly long, and they're the last
field on every line, so if your terminal is wrapping text, the port
numbers will end up ABOVE the friendly/recognizable container name (e.g.
tg-lotus-soup-deals-e2e-acfc60bc1727-miners-1
). - The testground output displays the container ID inside coloured angle
brackets, so if you spot something spurious in a particular node, you can
hone in on that one, e.g.
<< 54dd5ad916b2 >>
.
⟩ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 54dd5ad916b2 be3c18d7f0d4 "/testplan" 10 seconds ago Up 8 seconds 0.0.0.0:32788->1234/tcp, 0.0.0.0:32783->2345/tcp, 0.0.0.0:32773->6060/tcp, 0.0.0.0:32777->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-clients-2 53757489ce71 be3c18d7f0d4 "/testplan" 10 seconds ago Up 8 seconds 0.0.0.0:32792->1234/tcp, 0.0.0.0:32790->2345/tcp, 0.0.0.0:32781->6060/tcp, 0.0.0.0:32786->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-clients-1 9d3e83b71087 be3c18d7f0d4 "/testplan" 10 seconds ago Up 8 seconds 0.0.0.0:32791->1234/tcp, 0.0.0.0:32789->2345/tcp, 0.0.0.0:32779->6060/tcp, 0.0.0.0:32784->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-clients-0 7bd60e75ed0e be3c18d7f0d4 "/testplan" 10 seconds ago Up 8 seconds 0.0.0.0:32787->1234/tcp, 0.0.0.0:32782->2345/tcp, 0.0.0.0:32772->6060/tcp, 0.0.0.0:32776->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-miners-1 dff229d7b342 be3c18d7f0d4 "/testplan" 10 seconds ago Up 9 seconds 0.0.0.0:32778->1234/tcp, 0.0.0.0:32774->2345/tcp, 0.0.0.0:32769->6060/tcp, 0.0.0.0:32770->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-miners-0 4cd67690e3b8 be3c18d7f0d4 "/testplan" 11 seconds ago Up 8 seconds 0.0.0.0:32785->1234/tcp, 0.0.0.0:32780->2345/tcp, 0.0.0.0:32771->6060/tcp, 0.0.0.0:32775->6060/tcp tg-lotus-soup-deals-e2e-acfc60bc1727-bootstrapper-0 aeb334adf88d iptestground/sidecar:edge "testground sidecar …" 43 hours ago Up About an hour 0.0.0.0:32768->6060/tcp testground-sidecar c1157500282b influxdb:1.8 "/entrypoint.sh infl…" 43 hours ago Up 25 seconds 0.0.0.0:8086->8086/tcp testground-influxdb 99ca4c07fecc redis "docker-entrypoint.s…" 43 hours ago Up About an hour 0.0.0.0:6379->6379/tcp testground-redis bf25c87488a5 bitnami/grafana "/run.sh" 43 hours ago Up 26 seconds 0.0.0.0:3000->3000/tcp testground-grafana cd1d6383eff7 goproxy/goproxy "/goproxy" 45 hours ago Up About a minute 8081/tcp testground-goproxy
- Note that our container names are slightly long, and they're the last
field on every line, so if your terminal is wrapping text, the port
numbers will end up ABOVE the friendly/recognizable container name (e.g.
-
Take note of the port mapping. Imagine in the output above, we want to query
54dd5ad916b2
. We'd uselocalhost:32788
, as it forwards to the container's 1234 port (Lotus Full Node RPC). -
Run your Lotus CLI command setting the
FULLNODE_API_INFO
env variable, which is a multiaddr:$ FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/$port/http" lotus chain list [...]
Alternatively, you could download gawk and setup a script in you .bashrc or .zshrc similar to:
lprt() {
NAME=$1
PORT=$2
docker ps --format "table {{.Names}}" | grep $NAME | xargs -I {} docker port {} $PORT | gawk --field-separator=":" '{print $2}'
}
envs() {
NAME=$1
local REMOTE_PORT_1234=$(lprt $NAME 1234)
local REMOTE_PORT_2345=$(lprt $NAME 2345)
export FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/$REMOTE_PORT_1234/http"
export STORAGE_API_INFO=":/ip4/127.0.0.1/tcp/$REMOTE_PORT_2345/http"
echo "Setting \$FULLNODE_API_INFO to $FULLNODE_API_INFO"
echo "Setting \$STORAGE_API_INFO to $STORAGE_API_INFO"
}
Then call commands like:
envs miners-0
lotus chain list
cluster:k8s
Similar to local:docker
, you pick a pod that you want to connect to and port-forward 1234 and 2345 to that specific pod, such as:
export PODNAME="tg-lotus-soup-ae620dfb2e19-miners-0"
kubectl port-forward pods/$PODNAME 1234:1234 2345:2345
export FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/1234/http"
export STORAGE_API_INFO=":/ip4/127.0.0.1/tcp/2345/http"
lotus-storage-miner storage-deals list
lotus-storage-miner storage-deals get-ask
Useful commands / checks
- Making sure miners are on the same chain: compare outputs of
lotus chain list
. - Checking deals:
lotus client list-deals
. - Sector queries:
lotus-storage-miner info
,lotus-storage-miner proving info
- Sector sealing errors:
STORAGE_API_INFO=":/ip4/127.0.0.1/tcp/53624/http" FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/53623/http" lotus-storage-miner sector info
STORAGE_API_INFO=":/ip4/127.0.0.1/tcp/53624/http" FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/53623/http" lotus-storage-miner sector status <sector_no>
STORAGE_API_INFO=":/ip4/127.0.0.1/tcp/53624/http" FULLNODE_API_INFO=":/ip4/127.0.0.1/tcp/53623/http" lotus-storage-miner sector status --log <sector_no>
Viewing logs of a particular container local:docker
This works for both started and stopped containers. Just get the container ID (in double angle brackets in Testground output, on every log line), and do a:
$ docker logs $container_id
Accessing the golang instrumentation
Testground exposes a pprof endpoint under local port 6060, which both
local:docker
and cluster:k8s
map.
For local:docker
, see above to figure out which host port maps to the
container's 6060 port.
Acquiring a goroutine dump
When things appear to be stuck, get a goroutine dump.
$ wget -o goroutine.out http://localhost:${pprof_port}/debug/pprof/goroutine?debug=2
You can use whyrusleeping/stackparse to extract a summary:
$ go get https://github.com/whyrusleeping/stackparse
$ stackparse --summary goroutine.out
Acquiring a CPU profile
When the CPU appears to be spiking/rallying, grab a CPU profile.
$ wget -o profile.out http://localhost:${pprof_port}/debug/pprof/profile
Analyse it using go tool pprof
. Usually, generating a png
graph is useful:
$ go tool pprof profile.out
File: testground
Type: cpu
Time: Jul 3, 2020 at 12:00am (WEST)
Duration: 30.07s, Total samples = 2.81s ( 9.34%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) png
Generating report in profile003.png
Submitting actionable reports / findings
This is useful both internally (within the Oni team, so that peers can help) and externally (when submitting a finding upstream).
We don't need to play the full bug-hunting game on Lotus, but it's tremendously useful to provide the necessary data so that any reports are actionable.
These include:
- test outputs (use
testground collect
). - stack traces that appear in logs (whether panics or not).
- output of relevant Lotus CLI commands.
- if this is some kind of blockage / deadlock, goroutine dumps.
- if this is a CPU hotspot, a CPU profile would be useful.
- if this is a memory issue, a heap dump would be useful.
When submitting bugs upstream (Lotus), make sure to indicate:
- Lotus commit.
- FFI commit.