* swarm/network: fix data races in TestInitialPeersMsg test
* swarm/network: add Kademlia.Saturation method with lock
* swarm/network: add Hive.Peer method to safely retrieve a bzz peer
* swarm/network: remove duplicate comment
* p2p/testing: prevent goroutine leak in ProtocolTester
* swarm/network: fix data race in newBzzBaseTesterWithAddrs
* swarm/network: fix goroutone leaks in testInitialPeersMsg
* swarm/network: raise number of peer check attempts in testInitialPeersMsg
* swarm/network: use Hive.Peer in Hive.PeerInfo function
* swarm/network: reduce the scope of mutex lock in newBzzBaseTesterWithAddrs
* swarm/storage: disable TestCleanIndex with race detector
* swarm/network: fix hive bug not sending shallow peers
- hive bug: needed shallow peers were not sent to nodes beyond connection's proximity order
- add extensive protocol exchange tests for initial subPeersMsg-peersMsg exchange
- modify bzzProtocolTester to allow pregenerated overlay addresses
* swarm/network: attempt to fix hive persistance test
* swarm/network: fix TestHiveStatePersistance (#1320)
* swarm/network: remove trace lines from the hive persistance test
* address PR review comments
* swarm/network: address PR comments on TestInitialPeersMsg
* eliminate *testing.T argument from bzz/hive protocoltesters
* add sorting (only runs in test code) on peersMsg payload
* add random (0 to MaxPeersPerPO) peers for each po
* add extra peers closer to pivot than control
* swarm/network: propagate span with ctx
* swarm/network: try to stop stream.send.request spans on time
* swarm/storage: add chunk ref as a log to netstore.fetcher span
These tests never run as the build tag excluded them from the CI
execution. As a results the (dead) code got out of sync with other
parts of Swarm and now they would not even compile. => Removed.
resolvesethersphere/go-ethereum#1238
* swarm/storage: increase mget timeout in common_test.go
TestDbStoreCorrect_1k sometimes timed out with -race on Travis.
--- FAIL: TestDbStoreCorrect_1k (24.63s)
common_test.go:194: testStore failed: timed out after 10s
* swarm: remove unused vars from TestSnapshotSyncWithServer
nodeCount and chunkCount is returned from setupSim and those values
we use.
* swarm: move race/norace helpers from stream to testutil
As we will need to use the flag in other packages, too.
* swarm: refactor TestSwarmNetwork case
Extract long running test cases for better visibility.
* swarm/network: skip TestSyncingViaGlobalSync with -race
As panics on Travis.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7e351b]
* swarm: run TestSwarmNetwork with fewer nodes with -race
As otherwise we always get test failure with `network_test.go:374:
context deadline exceeded` even with raised `Timeout`.
* swarm/network: run TestDeliveryFromNodes with fewer nodes with -race
Test on Travis times out with 8 or more nodes if -race flag is present.
* swarm/network: smaller node count for discovery tests with -race
TestDiscoveryPersistenceSimulationSimAdapters failed on Travis with
`-race` flag present. The failure was due to extensive memory usage,
coming from the CGO runtime. Using a smaller node count resolves the
issue.
=== RUN TestDiscoveryPersistenceSimulationSimAdapter
==7227==ERROR: ThreadSanitizer failed to allocate 0x80000 (524288) bytes of clock allocator (error code: 12)
FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:6976 "((0 && "unable to mmap")) != (0)" (0x0, 0x0)
FAIL github.com/ethereum/go-ethereum/swarm/network/simulations/discovery 804.826s
* swarm/network: run TestFileRetrieval with fewer nodes with -race
Otherwise we get a failure due to extensive memory usage, as the CGO
runtime cannot allocate more bytes.
=== RUN TestFileRetrieval
==7366==ERROR: ThreadSanitizer failed to allocate 0x80000 (524288) bytes of clock allocator (error code: 12)
FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:6976 "((0 && "unable to mmap")) != (0)" (0x0, 0x0)
FAIL github.com/ethereum/go-ethereum/swarm/network/stream 155.165s
* swarm/network: run TestRetrieval with fewer nodes with -race
Otherwise we get a failure due to extensive memory usage, as the CGO
runtime cannot allocate more bytes ("ThreadSanitizer failed to
allocate").
* swarm/network: skip flaky TestGetSubscriptionsRPC on Travis w/ -race
Test fails a lot with something like:
streamer_test.go:1332: Real subscriptions and expected amount don't match; real: 0, expected: 20
* swarm/storage: skip TestDB_SubscribePull* tests on Travis w/ -race
Travis just hangs...
ok github.com/ethereum/go-ethereum/swarm/storage/feed/lookup 1.307s
keepalive
keepalive
keepalive
or panics after a while.
Without these tests the race detector job is now stable. Let's
invetigate these tests in a separate issue:
https://github.com/ethersphere/go-ethereum/issues/1245
* swarm/newtork: WIP Span request span until delivery and put
* swarm/storage: Introduce new trace across single fetcher lifespan
* swarm/network: Put span ids for sendpriority in context value
* swarm: Add global span store in tracing
* swarm/tracing: Add context key constants
* swarm/tracing: Add comments
* swarm/storage: Remove redundant fix for filestore
* swarm/tracing: Elaborate constants comments
* swarm/network, swarm/storage, swarm:tracing: Minor cleanup
* swarm/network/stream: fix a goroutine leak in Registry
* swarm/network, swamr/network/stream: Kademlia close addr count and depth change chans
* swarm/network/stream: rename close channel to quit
* swarm/network/stream: fix sync between NewRegistry goroutine and Close method
* swarm/network: DRY out repeated giga comment
I not necessarily agree with the way we wait for event propagation.
But I truly disagree with having duplicated giga comments.
* p2p/simulations: encapsulate Node.Up field so we avoid data races
The Node.Up field was accessed concurrently without "proper" locking.
There was a lock on Network and that was used sometimes to access
the field. Other times the locking was missed and we had
a data race.
For example: https://github.com/ethereum/go-ethereum/pull/18464
The case above was solved, but there were still intermittent/hard to
reproduce races. So let's solve the issue permanently.
resolves: ethersphere/go-ethereum#1146
* p2p/simulations: fix unmarshal of simulations.Node
Making Node.Up field private in 13292ee897e345045fbfab3bda23a77589a271c1
broke TestHTTPNetwork and TestHTTPSnapshot. Because the default
UnmarshalJSON does not handle unexported fields.
Important: The fix is partial and not proper to my taste. But I cut
scope as I think the fix may require a change to the current
serialization format. New ticket:
https://github.com/ethersphere/go-ethereum/issues/1177
* p2p/simulations: Add a sanity test case for Node.Config UnmarshalJSON
* p2p/simulations: revert back to defer Unlock() pattern for Network
It's a good patten to call `defer Unlock()` right after `Lock()` so
(new) error cases won't miss to unlock. Let's get back to that pattern.
The patten was abandoned in 85a79b3ad3,
while fixing a data race. That data race does not exist anymore,
since the Node.Up field got hidden behind its own lock.
* p2p/simulations: consistent naming for test providers Node.UnmarshalJSON
* p2p/simulations: remove JSON annotation from private fields of Node
As unexported fields are not serialized.
* p2p/simulations: fix deadlock in Network.GetRandomDownNode()
Problem: GetRandomDownNode() locks -> getDownNodeIDs() ->
GetNodes() tries to lock -> deadlock
On Network type, unexported functions must assume that `net.lock`
is already acquired and should not call exported functions which
might try to lock again.
* p2p/simulations: ensure method conformity for Network
Connect* methods were moved to p2p/simulations.Network from
swarm/network/simulation. However these new methods did not follow
the pattern of Network methods, i.e., all exported method locks
the whole Network either for read or write.
* p2p/simulations: fix deadlock during network shutdown
`TestDiscoveryPersistenceSimulationSimAdapter` often got into deadlock.
The execution was stuck on two locks, i.e, `Kademlia.lock` and
`p2p/simulations.Network.lock`. Usually the test got stuck once in each
20 executions with high confidence.
`Kademlia` was stuck in `Kademlia.EachAddr()` and `Network` in
`Network.Stop()`.
Solution: in `Network.Stop()` `net.lock` must be released before
calling `node.Stop()` as stopping a node (somehow - I did not find
the exact code path) causes `Network.InitConn()` to be called from
`Kademlia.SuggestPeer()` and that blocks on `net.lock`.
Related ticket: https://github.com/ethersphere/go-ethereum/issues/1223
* swarm/state: simplify if statement in DBStore.Put()
* p2p/simulations: remove faulty godoc from private function
The comment started with the wrong method name.
The method is simple and self explanatory. Also, it's private.
=> Let's just remove the comment.
* swarm/network: new saturation for implementation
* swarm/network: re-added saturation func in Kademlia as it is used elsewhere
* swarm/network: saturation with higher MinBinSize
* swarm/network: PeersPerBin with depth check
* swarm/network: edited tests to pass new saturated check
* swarm/network: minor fix saturated check
* swarm/network/simulations/discovery: fixed renamed RPC call
* swarm/network: renamed to isSaturated and returns bool
* swarm/network: early depth check
* swarm/network: fix data race in stream.(*Peer).handleOfferedHashesMsg()
handleOfferedHashesMsg() contained a data race:
- read => in a goroutine, call to c.batchDone()
- write => in the main thread, write to c.sessionAt
c.batchDone() contained a call to c.AddInterval(). Client was a value
receiver for AddInterval. So on c.AddInterval() call the whole client
struct got copied (read) while one of its field was modified in
handleOfferedHashesMsg() (write).
fixesethersphere/go-ethereum#1086
* swarm/network: simplify some trivial statements