Oddbean

▲ ▼

 A Bitcoin node will detect bugs in the matrix

since I'm not seeing any serious memory or disk issues, and assuming there are no bugs in bitcoin-core, I wonder if my bitcoin node is getting hit by cosmic rays and bits are getting flipped every now and then, leading to leveldb checksum errors.

I looked into ECC ram but I may need to build a home server, since it's not common in desktops motherboards.

Is this is why you’ve been going on about ecc memory @semisol ? I found a post about it from greg maxwell 10 years ago as well, saying all his non-laptop machines use ecc memory:

https://www.reddit.com/r/Bitcoin/comments/2jpk54/risks_of_running_bitcoin_client_on_a_computer/cle3qyb/ nostr:note1ktns7scrqr00h9a400eajn8k23hcxzzp35syfr7j4tvzjkdpjjdsj4z0sf

▲ ▼

 I run ECC RAM on my hard nodes

▲ ▼

 gonna have to setup a server rack in my house @Alex Gleason 🐍🚬

▲ ▼

 i haven't had a rack in my house in a decade... i could be persuaded though...

▲ ▼

 another water heater… :)

▲ ▼

 ryzen supports ecc

▲ ▼

 Need motherboard support though which is rare for consumer motherboards

▲ ▼

 I read up you need a ryzen pro apu

▲ ▼

 I have bought 2 random ASUS boards, one meant for “professional use” and one meant to be lower end (B)

Both work

▲ ▼

 Can you link?

▲ ▼

 Will do in a bit

▲ ▼

 It is detected as ECC too.

G series CPUs don’t support it don’t ask how I know (tried to add ECC RAM to a system with one) and it didn’t detect as ECC

▲ ▼

 It is supported but not “validated” (don’t blame us if it doesn’t work) for consumer platforms

▲ ▼

 /r/homelab sends their regards

Doesn't need to be a power hungry monster homelab though, IIRC some of the newer Lenovo thinkcentre tiny nodes support ECC and have two NVME slots

▲ ▼

 In many years I've never had a core edition leveldb corruption without a no-raid disk failure (bad sectors)

▲ ▼

 Yeah time to get serious about this. Gonna do a proper zfs raid and ecc setup

▲ ▼

 Yes.

I have never had a corrupt DB with ECC memory, even with a few hard power cuts. I use RAID 1, though not with checksumming (mdadm) so about 50% of the time any issues should come up with disks if any.

▲ ▼

 You’re more likely hitting hardware issues that only crop up when running the machine hot. Try running y-cruncher and memtest. Similarly try testing your disk (don’t know any applications that test if it corrupts at high rate, I know they exist tho)

▲ ▼

 Maybe but i’m never really running this machine hot

▲ ▼

 and its water cooled

▲ ▼

 i already tried to convince Will of this a few days ago, don't bother
nostr:nevent1qvzqqqqqqypzqz4rnedwlxdqqznmmv95ny2cey4uf23qldjexxjj6p2mt6mdlaecqy28wumn8ghj7mn0wd68ytncxpnzummjvuhszythwden5te0dehhxarj9emkjmn99uqzplvg592v9d62fx4qc3zsuv2h2sml4x98kenrm4hf3e4kekfr0t3m02ylks

▲ ▼

 I have been running many tests since then, nothing has come up yet. Started documenting my experience here:

https://github.com/bitcoin/bitcoin/issues/30692#issuecomment-2450235448

Give me some time 🙄

▲ ▼

 I also noticed i had a pretty small dbcache which could have possibly increased the frequency at which these errors might occur due to frequent flushes.

▲ ▼

 Hit it again… running memtest now. Will see 🤔 https://i.nostr.build/3nHp1Ev3iLaEzzrS.jpg

▲ ▼

 my brain is fried from election nonsense.  read this as 'memetest'.  I will show myself out now.

▲ ▼

 Nothing yet… running hot as well. https://i.nostr.build/ERa0xSlAk8elJIQv.jpg

▲ ▼

 I have two nodes. About 7 days ago, one started crashing every 15-20 hours. That one’s on a raspberry pi 4 with 1tb ssd. The errors are always i/o related. df -h says I have 147gb available. But thinking the ssd is going down. The other node is on Debian and hasn’t had any problems.

▲ ▼

 yeah I thought it was my drive so I got a new ssd but still having issues. ran tests on that and didn't find anything. trying memtest now.

▲ ▼

 i'd still be kind of interested in a copy of the corrupted state to see what kind of glitch this is

▲ ▼

 it just happened again so I can send it

▲ ▼

 i assume it's another corruption in the chainstate ?
most useful would be to zip up the entire chainstate dir i think (i'm not sure just the one corrupted ldb file will tell much)

▲ ▼

 yeah chainstate again

▲ ▼

 not sure what to test next https://i.nostr.build/uQ19fvhzp9LcK1rc.jpg https://i.nostr.build/DtM2hiS379ZZhFhP.jpg https://i.nostr.build/MXBTUyRSN5dENt9v.jpg

▲ ▼

 what are you using for storage?

▲ ▼

 was using a hdd, switched to a 4tb ssd recently. I suspect it was due to a low dbcache (32) value but not sure yet

▲ ▼

 did not see until after that you swapped storage. if you do in fact have an intermittent hardware problem the best way in my experience is to swap parts where you can. since you have the storage out of the way, if you have multiple memory modules I would run one at a time and see how that turns out

▲ ▼

 You may want to keep an eye on uptime to be sure something isn’t hard power cycling the host, and dmesg to be sure the bitcoind process isn’t getting OOM killed. 

I’ve had issues with the later on an RPI4. JSON-RPC calls from Fulcrum or Electrs syncs also seem to be able to use an unbounded amount of memory…

I think ungraceful stops of the bitcoind process at the right time can cause the chainstate DB to come unsynced from the blockchain DB in ways that require reindexing. The really bad throughput of the RPi4’s USB3 ports for block storage may exasperate this

▲ ▼

 This machine is on a UPS and doesn’t ever randomly restart. It’s never unstable. I don’t run it hot. If it can get corrupted from ungraceful stops then thats a bug.

▲ ▼

 56C isn’t hot? Memtest generally doesn’t get a CPU hot, just a bit warm.

▲ ▼

 The Udoo Bolt Gear mini PC supports ECC RAM , that’s why I got one 3 years ago and it’s been running non stop ever since. Silent , powerful. Not super cheap but a reasonable price for what you get

▲ ▼

 You are probably experiencing memory corruption or something similar without noticing it.

The only times I've ever had a leveldb database get corrupted were due to bad memory. I spent a fair bit of ₿ upgrading my desktop to ECC memory a few years ago due to a run-in with bad memory that corrupted files.

▲ ▼

 I’ve had many bad memory issues in the past and it usually always leads to system instability. If it is memory it must be a very minor issue that somehow doesn’t cause anything else to crash.

▲ ▼

 It's very easy for minor memory issues to result in disk corruption rather than overall system instability. You just need something like a single bad bit that only shows up sometimes, eg while hot as nostr:nprofile1qqsr6tj32zrfn7v0pu4aheaytdnnc6rluepq73ndc2tdjzus34gat9qpz4mhxue69uhhyetvv9ujuerpd46hxtnfduhswulwwv pointed out.

▲ ▼

 Hmm will run more tests to confirm

▲ ▼

 Write a memory test program that stores 4092 bytes and a CRC

▲ ▼

 memtest86 does this more thoroughly… ive always used that in the past

▲ ▼

 You want real workloads

▲ ▼

 Yeah but its harder to test every physical address, isn’t that the point

▲ ▼

 Due to interleaving and similar you already mostly achieve that

▲ ▼

 Memtest doesn’t tend to get your CPU hot, though. Different things can fail at different utilization levels…

▲ ▼

 There was that prime-something program i remember using a long time ago for that, not sure if there are more modern solutions

▲ ▼

 ycruncher?

▲ ▼

 I remember prime95 but maybe that was like 20 years ago

▲ ▼

 damn im old

▲ ▼

 whatever type of error you are having must be a more widespread issue than a few pages to be triggered so often

the ideal program would have a small pool it nonstop allocates to and deallocates from and a pool it very slowly checks and rotates allocations in/out

write random data and CRC it as I said

you then want to stop the process and get a debugger if there’s a mismatch and see the physical location along with identifying the RAM module

▲ ▼

 Prime95 will definitely get a CPU toasty, but you would need to ensure it's running with a mix of large and small FFT and for a while (e.g. hours/days). Small FFT maximizes CPU heat, but isn't the best for finding instabilities working with RAM. Large FFT helps there. In general, I find the overclocking community sometimes works on "vibes" rather than completely proven test methodologies, so take this with a grain of salt.

▲ ▼

 not that you're overclocking, but that group tends to accumulate tribal knowledge of CPU/RAM stability tests. Hardware can be "fun". A ton of variables. Even things like bios versions can cause instabilities.