## Issue Addressed
fixes lints from the last rust release
## Proposed Changes
Fix the lints, most of the lints by `clippy::question-mark` are false positives in the form of https://github.com/rust-lang/rust-clippy/issues/9518 so it's allowed for now
## Additional Info
## Issue Addressed
🐞 in which we don't actually unsubscribe from a random long lived subnet when it expires
## Proposed Changes
Remove code addressing a specific case in which we are subscribed to all subnets and handle the removal of the long lived subnet. I don't think the special case code is particularly important as, if someone is running with that many validators to be subscribed to all subnets, it should use `--subscribe-all-subnets` instead
## Additional Info
Noticed on some test nodes climbing bandwidth usage periodically (around 27hours, the time of subnet expirations) I'm running this code to test this does not happen anymore, but I think it should be good now
## Issue Addressed
NA
## Proposed Changes
This PR attempts to fix the following spurious CI failure:
```
---- store_tests::garbage_collect_temp_states_from_failed_block stdout ----
thread 'store_tests::garbage_collect_temp_states_from_failed_block' panicked at 'disk store should initialize: DBError { message: "Error { message: \"IO error: lock /tmp/.tmp6DcBQ9/cold_db/LOCK: already held by process\" }" }', beacon_node/beacon_chain/tests/store_tests.rs:59:10
```
I believe that some async task is taking a clone of the store and holding it in some other thread for a short time. This creates a race-condition when we try to open a new instance of the store.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
This PR removes duplicated block root computation.
Computing the `SignedBeaconBlock::canonical_root` has become more expensive since the merge as we need to compute the merke root of each transaction inside an `ExecutionPayload`.
Computing the root for [a mainnet block](https://beaconcha.in/slot/4704236) is taking ~10ms on my i7-8700K CPU @ 3.70GHz (no sha extensions). Given that our median seen-to-imported time for blocks is presently 300-400ms, removing a few duplicated block roots (~30ms) could represent an easy 10% improvement. When we consider that the seen-to-imported times include operations *after* the block has been placed in the early attester cache, we could expect the 30ms to be more significant WRT our seen-to-attestable times.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Fixes an issue introduced in #3574 where I erroneously assumed that a `crossbeam_channel` multiple receiver queue was a *broadcast* queue. This is incorrect, each message will be received by *only one* receiver. The effect of this mistake is these logs:
```
Sep 20 06:56:17.001 INFO Synced slot: 4736079, block: 0xaa8a…180d, epoch: 148002, finalized_epoch: 148000, finalized_root: 0x2775…47f2, exec_hash: 0x2ca5…ffde (verified), peers: 6, service: slot_notifier
Sep 20 06:56:23.237 ERRO Unable to validate attestation error: CommitteeCacheWait(RecvError), peer_id: 16Uiu2HAm2Jnnj8868tb7hCta1rmkXUf5YjqUH1YPj35DCwNyeEzs, type: "aggregated", slot: Slot(4736047), beacon_block_root: 0x88d318534b1010e0ebd79aed60b6b6da1d70357d72b271c01adf55c2b46206c1
```
## Additional Info
NA
## Proposed Changes
Improve the payload pruning feature in several ways:
- Payload pruning is now entirely optional. It is enabled by default but can be disabled with `--prune-payloads false`. The previous `--prune-payloads-on-startup` flag from #3565 is removed.
- Initial payload pruning on startup now runs in a background thread. This thread will always load the split state, which is a small fraction of its total work (up to ~300ms) and then backtrack from that state. This pruning process ran in 2m5s on one Prater node with good I/O and 16m on a node with slower I/O.
- To work with the optional payload pruning the database function `try_load_full_block` will now attempt to load execution payloads for finalized slots _if_ pruning is currently disabled. This gives users an opt-out for the extensive traffic between the CL and EL for reconstructing payloads.
## Additional Info
If the `prune-payloads` flag is toggled on and off then the on-startup check may not see any payloads to delete and fail to clean them up. In this case the `lighthouse db prune_payloads` command should be used to force a manual sweep of the database.
## Issue Addressed
https://github.com/ethereum/beacon-APIs/pull/222
## Proposed Changes
Update Lighthouse's randao verification API to match the `beacon-APIs` spec. We implemented the API before spec stabilisation, and it changed slightly in the course of review.
Rather than a flag `verify_randao` taking a boolean value, the new API uses a `skip_randao_verification` flag which takes no argument. The new spec also requires the randao reveal to be present and equal to the point-at-infinity when `skip_randao_verification` is set.
I've also updated the `POST /lighthouse/analysis/block_rewards` API to take blinded blocks as input, as the execution payload is irrelevant and we may want to assess blocks produced by builders.
## Additional Info
This is technically a breaking change, but seeing as I suspect I'm the only one using these parameters/APIs, I think we're OK to include this in a patch release.
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/3556
## Proposed Changes
Delete finalized execution payloads from the database in two places:
1. When running the finalization migration in `migrate_database`. We delete the finalized payloads between the last split point and the new updated split point. _If_ payloads are already pruned prior to this then this is sufficient to prune _all_ payloads as non-canonical payloads are already deleted by the head pruner, and all canonical payloads prior to the previous split will already have been pruned.
2. To address the fact that users will update to this code _after_ the merge on mainnet (and testnets), we need a one-off scan to delete the finalized payloads from the canonical chain. This is implemented in `try_prune_execution_payloads` which runs on startup and scans the chain back to the Bellatrix fork or the anchor slot (if checkpoint synced after Bellatrix). In the case where payloads are already pruned this check only imposes a single state load for the split state, which shouldn't be _too slow_. Even so, a flag `--prepare-payloads-on-startup=false` is provided to turn this off after it has run the first time, which provides faster start-up times.
There is also a new `lighthouse db prune_payloads` subcommand for users who prefer to run the pruning manually.
## Additional Info
The tests have been updated to not rely on finalized payloads in the database, instead using the `MockExecutionLayer` to reconstruct them. Additionally a check was added to `check_chain_dump` which asserts the non-existence or existence of payloads on disk depending on their slot.
## Issue Addressed
NA
## Proposed Changes
I have observed scenarios on Goerli where Lighthouse was receiving attestations which reference the same, un-cached shuffling on multiple threads at the same time. Lighthouse was then loading the same state from database and determining the shuffling on multiple threads at the same time. This is unnecessary load on the disk and RAM.
This PR modifies the shuffling cache so that each entry can be either:
- A committee
- A promise for a committee (i.e., a `crossbeam_channel::Receiver`)
Now, in the scenario where we have thread A and thread B simultaneously requesting the same un-cached shuffling, we will have the following:
1. Thread A will take the write-lock on the shuffling cache, find that there's no cached committee and then create a "promise" (a `crossbeam_channel::Sender`) for a committee before dropping the write-lock.
1. Thread B will then be allowed to take the write-lock for the shuffling cache and find the promise created by thread A. It will block the current thread waiting for thread A to fulfill that promise.
1. Thread A will load the state from disk, obtain the shuffling, send it down the channel, insert the entry into the cache and then continue to verify the attestation.
1. Thread B will then receive the shuffling from the receiver, be un-blocked and then continue to verify the attestation.
In the case where thread A fails to generate the shuffling and drops the sender, the next time that specific shuffling is requested we will detect that the channel is disconnected and return a `None` entry for that shuffling. This will cause the shuffling to be re-calculated.
## Additional Info
NA
## Issue Addressed
#3285
## Proposed Changes
Adds support for specifying histogram with buckets and adds new metric buckets for metrics mentioned in issue.
## Additional Info
Need some help for the buckets.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
We were unable to update lighthouse by running `cargo update` because some of the `mev-build-rs` deps weren't pinned. But `mev-build-rs` is now pinned here and includes it's own pinned commits for `ssz-rs` and `etheruem-consensus`
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
Closes#3514
## Proposed Changes
- Change default monitoring endpoint frequency to 120 seconds to fit with 30k requests/month limit.
- Allow configuration of the monitoring endpoint frequency using `--monitoring-endpoint-frequency N` where `N` is a value in seconds.
## Issue Addressed
Resolves https://github.com/sigp/lighthouse/issues/3517
## Proposed Changes
Adds a `--builder-profit-threshold <wei value>` flag to the BN. If an external payload's value field is less than this value, the local payload will be used. The value of the local payload will not be checked (it can't really be checked until the engine API is updated to support this).
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
Add a flag that can increase count unrealized strictness, defaults to false
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: sean <seananderson33@gmail.com>
## Issue Addressed
When requesting an index which is not active during `start_epoch`, Lighthouse returns:
```
curl "http://localhost:5052/lighthouse/analysis/attestation_performance/999999999?start_epoch=100000&end_epoch=100000"
```
```json
{
"code": 500,
"message": "INTERNAL_SERVER_ERROR: ParticipationCache(InvalidValidatorIndex(999999999))",
"stacktraces": []
}
```
This error occurs even when the index in question becomes active before `end_epoch` which is undesirable as it can prevent larger queries from completing.
## Proposed Changes
In the event the index is out-of-bounds (has not yet been activated), simply return all fields as `false`:
```
-> curl "http://localhost:5052/lighthouse/analysis/attestation_performance/999999999?start_epoch=100000&end_epoch=100000"
```
```json
[
{
"index": 999999999,
"epochs": {
"100000": {
"active": false,
"head": false,
"target": false,
"source": false
}
}
}
]
```
By doing this, we cover the case where a validator becomes active sometime between `start_epoch` and `end_epoch`.
## Additional Info
Note that this error only occurs for epochs after the Altair hard fork.
## Issue Addressed
We currently subscribe to attestation subnets as soon as the subscription arrives (one epoch in advance), this makes it so that subscriptions for future slots are scheduled instead of done immediately.
## Proposed Changes
- Schedule subscriptions to subnets for future slots.
- Finish removing hashmap_delay, in favor of [delay_map](https://github.com/AgeManning/delay_map). This was the only remaining service to do this.
- Subscriptions for past slots are rejected, before we would subscribe for one slot.
- Add a new test for subscriptions that are not consecutive.
## Additional Info
This is also an effort in making the code easier to understand
## Issue Addressed
NA
## Proposed Changes
As we've seen on Prater, there seems to be a correlation between these messages
```
WARN Not enough time for a discovery search subnet_id: ExactSubnet { subnet_id: SubnetId(19), slot: Slot(3742336) }, service: attestation_service
```
... and nodes falling 20-30 slots behind the head for short periods. These nodes are running ~20k Prater validators.
After running some metrics, I can see that the `network_recv` channel is processing ~250k `AttestationSubscribe` messages per minute. It occurred to me that perhaps the `AttestationSubscribe` messages are "washing out" the `SendRequest` and `SendResponse` messages. In this PR I separate the `AttestationSubscribe` and `SyncCommitteeSubscribe` messages into their own queue so the `tokio::select!` in the `NetworkService` can still process the other messages in the `network_recv` channel without necessarily having to clear all the subscription messages first.
~~I've also added filter to the HTTP API to prevent duplicate subscriptions going to the network service.~~
## Additional Info
- Currently being tested on Prater
## Issue Addressed
Partly resolves#3518
## Proposed Changes
Change the slot notifier to use `duration_to_next_slot` rather than an interval timer. This makes it robust against underlying clock changes.
## Issue Addressed
NA
## Proposed Changes
Adds more `debug` logging to help troubleshoot invalid execution payload blocks. I was doing some of this recently and found it to be challenging.
With this PR we should be able to grep `Invalid execution payload` and get one-liners that will show the block, slot and details about the proposer.
I also changed the log in `process_invalid_execution_payload` since it was a little misleading; the `block_root` wasn't necessary the block which had an invalid payload.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
This PR is motivated by a recent consensus failure in Geth where it returned `INVALID` for an `VALID` block. Without this PR, the only way to recover is by re-syncing Lighthouse. Whilst ELs "shouldn't have consensus failures", in reality it's something that we can expect from time to time due to the complex nature of Ethereum. Being able to recover easily will help the network recover and EL devs to troubleshoot.
The risk introduced with this PR is that genuinely INVALID payloads get a "second chance" at being imported. I believe the DoS risk here is negligible since LH needs to be restarted in order to re-process the payload. Furthermore, there's no reason to think that a well-performing EL will accept a truly invalid payload the second-time-around.
## Additional Info
This implementation has the following intricacies:
1. Instead of just resetting *invalid* payloads to optimistic, we'll also reset *valid* payloads. This is an artifact of our existing implementation.
1. We will only reset payload statuses when we detect an invalid payload present in `proto_array`
- This helps save us from forgetting that all our blocks are valid in the "best case scenario" where there are no invalid blocks.
1. If we fail to revert the payload statuses we'll log a `CRIT` and just continue with a `proto_array` that *does not* have reverted payload statuses.
- The code to revert statuses needs to deal with balances and proposer-boost, so it's a failure point. This is a defensive measure to avoid introducing new show-stopping bugs to LH.
## Proposed Changes
This PR has two aims: to speed up attestation packing in the op pool, and to fix bugs in the verification of attester slashings, proposer slashings and voluntary exits. The changes are bundled into a single database schema upgrade (v12).
Attestation packing is sped up by removing several inefficiencies:
- No more recalculation of `attesting_indices` during packing.
- No (unnecessary) examination of the `ParticipationFlags`: a bitfield suffices. See `RewardCache`.
- No re-checking of attestation validity during packing: the `AttestationMap` provides attestations which are "correct by construction" (I have checked this using Hydra).
- No SSZ re-serialization for the clunky `AttestationId` type (it can be removed in a future release).
So far the speed-up seems to be roughly 2-10x, from 500ms down to 50-100ms.
Verification of attester slashings, proposer slashings and voluntary exits is fixed by:
- Tracking the `ForkVersion`s that were used to verify each message inside the `SigVerifiedOp`. This allows us to quickly re-verify that they match the head state's opinion of what the `ForkVersion` should be at the epoch(s) relevant to the message.
- Storing the `SigVerifiedOp` on disk rather than the raw operation. This allows us to continue track the fork versions after a reboot.
This is mostly contained in this commit 52bb1840ae5c4356a8fc3a51e5df23ed65ed2c7f.
## Additional Info
The schema upgrade uses the justified state to re-verify attestations and compute `attesting_indices` for them. It will drop any attestations that fail to verify, by the logic that attestations are most valuable in the few slots after they're observed, and are probably stale and useless by the time a node restarts. Exits and proposer slashings and similarly re-verified to obtain `SigVerifiedOp`s.
This PR contains a runtime killswitch `--paranoid-block-proposal` which opts out of all the optimisations in favour of closely verifying every included message. Although I'm quite sure that the optimisations are correct this flag could be useful in the event of an unforeseen emergency.
Finally, you might notice that the `RewardCache` appears quite useless in its current form because it is only updated on the hot-path immediately before proposal. My hope is that in future we can shift calls to `RewardCache::update` into the background, e.g. while performing the state advance. It is also forward-looking to `tree-states` compatibility, where iterating and indexing `state.{previous,current}_epoch_participation` is expensive and needs to be minimised.
## Issue Addressed
#3465
## Proposed Changes
Filter out any validator registrations for validators that are not `active` or `pending`. I'm adding this filtering the beacon node because all the information is readily available there. In other parts of the VC we are usually sending per-validator requests based on duties from the BN. And duties will only be provided for active validators so we don't have this type of filtering elsewhere in the VC.
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
#3032
## Proposed Changes
Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync
#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.
#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.
#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.
## Additional Info
**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed
Next gen of #3094
Will work best with #3439
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
## Proposed Changes
Update the merge migration docs to encourage updating mainnet configs _now_!
The docs are also updated to recommend _against_ `--suggested-fee-recipient` on the beacon node (https://github.com/sigp/lighthouse/issues/3432).
Additionally the `--help` for the CLI is updated to match with a few small semantic changes:
- `--execution-jwt` is no longer allowed without `--execution-endpoint`. We've ended up without a default for `--execution-endpoint`, so I think that's fine.
- The flags related to the JWT are only allowed if `--execution-jwt` is provided.
## Issue Addressed
NA
## Proposed Changes
Bump versions to v3.0.0
## Additional Info
- ~~Blocked on #3439~~
- ~~Blocked on #3459~~
- ~~Blocked on #3463~~
- ~~Blocked on #3462~~
- ~~Requires further testing~~
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
Run fork choice when the head is 256 slots from the wall-clock slot, rather than 4.
The reason we don't *always* run FC is so that it doesn't slow us down during sync. As the comments state, setting the value to 256 means that we'd only have one interrupting fork-choice call if we were syncing at 20 slots/sec.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds some metrics so we can track payload status responses from the EE. I think this will be useful for troubleshooting and alerting.
I also bumped the `BecaonChain::per_slot_task` to `debug` since it doesn't seem too noisy and would have helped us with some things we were debugging in the past.
## Additional Info
NA
## Proposed Changes
Match the timeouts from the `execution-apis` spec. Our existing values were already quite close so I don't imagine this change to be very disruptive.
The spec sets the timeout for `engine_getPayloadV1` to only 1 second, but we were already using a longer value of 2 seconds. I've kept the 2 second timeout as I don't think there's any need to fail faster when producing a payload.
There's no timeout specified for `eth_syncing` so I've matched it to the shortest timeout from the spec (1 second). I think the previous value of 250ms was likely too low and could have been contributing to spurious timeouts, particularly for remote ELs.
## Additional Info
The timeouts are defined on each endpoint in this document: https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md
## Issue Addressed
Fixes an issue whereby syncing a post-merge network without an execution endpoint would silently stall. Sync swallows the errors from block verification so previously there was no indication in the logs for why the node couldn't sync.
## Proposed Changes
Add an error log to the merge-readiness notifier for the case where the merge has already completed but no execution endpoint is configured.
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2962
## Proposed Changes
Build all caches on the checkpoint state before storing it in the database.
Additionally, fix a bug in `signature_verify_chain_segment` which prevented block verification from succeeding unless the previous epoch cache was already built. The previous epoch cache is required to verify the signatures of attestations included from previous epochs, even when all the blocks in the segment are from the same epoch.
The comments around `signature_verify_chain_segment` have also been updated to reflect the fact that it should only be used on a chain of blocks from a single epoch. I believe this restriction had already been added at some point in the past and that the current comments were just outdated (and I think because the proposer shuffling can change in the next epoch based on the blocks applied in the current epoch that this limitation is essential).
## Issue Addressed
NA
## Proposed Changes
Start issuing merge-readiness logs 2 weeks before the Bellatrix fork epoch. Additionally, if the Bellatrix epoch is specified and the use has configured an EL, always log merge readiness logs, this should benefit pro-active users.
### Lookahead Reasoning
- Bellatrix fork is:
- epoch 144896
- slot 4636672
- Unix timestamp: `1606824023 + (4636672 * 12) = 1662464087`
- GMT: Tue Sep 06 2022 11:34:47 GMT+0000
- Warning start time is:
- Unix timestamp: `1662464087 - 604800 * 2 = 1661254487`
- GMT: Tue Aug 23 2022 11:34:47 GMT+0000
The [current expectation](https://discord.com/channels/595666850260713488/745077610685661265/1007445305198911569) is that EL and CL clients will releases out by Aug 22nd at the latest, then an EF announcement will go out on the 23rd. If all goes well, LH will start alerting users about merge-readiness just after the announcement.
## Additional Info
NA
## Proposed Changes
Enable multiple database backends for the slasher, either MDBX (default) or LMDB. The backend can be selected using `--slasher-backend={lmdb,mdbx}`.
## Additional Info
In order to abstract over the two library's different handling of database lifetimes I've used `Box::leak` to give the `Environment` type a `'static` lifetime. This was the only way I could think of using 100% safe code to construct a self-referential struct `SlasherDB`, where the `OpenDatabases` refers to the `Environment`. I think this is OK, as the `Environment` is expected to live for the life of the program, and both database engines leave the database in a consistent state after each write. The memory claimed for memory-mapping will be freed by the OS and appropriately flushed regardless of whether the `Environment` is actually dropped.
We are depending on two `sigp` forks of `libmdbx-rs` and `lmdb-rs`, to give us greater control over MDBX OS support and LMDB's version.
## Issue Addressed
N/A
## Proposed Changes
Fix clippy lints for latest rust version 1.63. I have allowed the [derive_partial_eq_without_eq](https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq) lint as satisfying this lint would result in more code that we might not want and I feel it's not required.
Happy to fix this lint across lighthouse if required though.
## Issue Addressed
Solves #3390
So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.
## Proposed Changes
Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.
Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.
Improves some logging as well.
## Additional Info
We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync
This also fixes a return that reports which batch failed and caused us some confusion checking the logs