Reading the forum it looks like there is an increasing number of users facing issues with the eMMC and it is not clear whether all/any case(s)s is/are caused by having run LXC or NextCloud on the eMMC instead of another storage medium.
Hence, a few questions
what is the live span expectancy of the eMMC in the TO, i.e. how many write/read operations?
is there any tool available in the TOS repo for monitoring the health/wear level of the eMMC, e.g. something similar to common NAND SSD monitoring tools?
are other apps than LXC stressing the eMMC with extensive write operations, e.g. data collection (pakon) or senitnel?
if the eMMC is worn out (EOL) can it be de-soldered and be replaced with another eMMC unit or does it require a new mainboard altogether?
to be honest we have so many connectors and options with TO, even SFP, that is hard used by anyone, I dont understand why we do not have a card reader for this, so anyone could just replace it, or put in an even bigger card like 128GB.
I would also need a tool like, iotop, that is unfortunately not available in TO. Anyone knows a similar tool/way to check disk usage (eMMC and NAS HDD)?
Booting from the flash surely doesn’t require too many writes, even with some updates… but I perhaps it’s not suitable for those who have worn it out already.
Thanks for the inputs which though seems to be sliding off topic. Could we please stay on the eMMC and the questions posted and not veer off to uboot and usb?
Whilst the TO data sheet does not specify the brand/type of the eMMC I came across some hint that it might be a SK hynix eMMC, is that correct?
Flash memory in TO is (up to my knowledge) SK hynix eMMC4.5 which manufacturer specifies, for density of 4GB, 2.4TB total bytes written before EOL.
eMMC on its own does not track write cycles like standard drives and does not have mechanism like smart. You can use mmc extcsd read /dev/mmcblk0 from package mmc-utils to take a peak in to amount of used reserved blocks. That is field EXT_CSD_DEVICE_LIFE_TIME_EST. Value is in range of percents so 0x01 is from 0%-10% blocks used, 0x02 is from 10%-20% and so on. When amount of reserved block used is close to 100% then NAND is pretty much EOL.
You should know that this is not linear. This is secondary statistics and tells you nothing about complete wear. Common wear is an avalanche effect. Nothing is happening for a long time and then all blocks starts failing at once. Meaning percentage of used reserved blocks might start rising very quickly.
Note that you need TOS 4.0+ to see required field. Version of mmc-utils in TOS 3.x is not capable to read it.
We have also in pipeline tool called healtcheck that is intended as a monitoring tool for potential problems on router.
Non that we know about. There might be and user can easily create one just by miss-configuring some standard application. Nonetheless all applications we know about that by design write to FS are pointed to /srv and storage plugin can be used to mount /srv to external storage.
That is pretty much impossible. It is 153 ball FBGA. I am not saying that is can’t be done and you can found companies and individuals that are able to do that but it is almost not worth it. Unless you know someone who is able to do this kind of repair for cheap then it is not worth it. And because you have to heat up board to desolder and solder chip you are also risking some other malfunction.
How about schnapps export which seems pretty write extensive. And also schnapps rollback and frequent medkit installations (when testing RC or 4.x or testing tweaked settings)?
Bad Block Management mode [SEC_BAD_BLK_MGMNT]: 0x00
Does it mean that Bad Block Management is turned off (assuming it would read 0x01 if turned on)? And if so would turning it on be beneficial to the health management of the eMMC?
4.x only. Simply because I am the one who is going to be implementing it and I have in pipeline before that migration from 3.x.
Depends where you are exporting it. In general it creates single tar file of approximate size of 100M or so (depends on what you have installed in your system). If you want to protect flash then you should export it to /tmp instead of /root. This is up to you.
This is just an inode update. No huge write happening there.
This is just once again only write of cca 100M. BTRFS pretty much setups it self with minimal overhead. It only setups its header. So every medkit is around 100M written and that means that you can do around 2.4 thousands of them before you reach EOL.
As I wrote: up to my knowledge. I went to some old datasheets. The production memory is probably newer as it seems. I am not the one handling hardware stuff. The number is just to give you an approximation of what you can expect.
It is turned off because this is not handled by hardware but by FS driver of OS. At lest that is what I think (not 100% sure).
Perhaps it is worth to investigate and clarify and if beneficial for the eMMC’s health turn it on. Thus far my reading on the subject indicates that BBM is not supported in BTRFS
BTRFS makes the perfectly reasonable assumption that you’re not trying to use known bad hardware. It’s not alone in this respect either, pretty much every Linux filesystem makes the exact same assumption (and almost all non-Linux ones too), because it really is a perfectly reasonable assumption. The only exception is ext[234], but they only support it statically (you can set the bad block list at mkfs time, but not afterwards, and they don’t update it at runtime), and it’s a holdover from earlier filesystems which originated at a time when storage was sufficiently expensive and unreliable that you kept using disks until they were essentially completely dead.
Btrfs can initiate a check of the entire file system by triggering a file system scrub job that is performed in the background. The scrub job scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way. The file system only checks and repairs the portions of disks that are in use—this is much faster than scanning all the disks in a logical volume or storage pool.
I don’t think it is and I suspect that you don’t want to play with it. I haven’t looked in to it but I suspect that it handles reserved sectors mapping to failed ones. This is I think handled by software in Linux. I suspect that block layer or mmc driver handles that (maybe in cooperation).
I think that this is there to allow usage with microcontrollers. Having reserved blocks mapping handled in hardware eases integration to microcontrollers with limited program memory and speed.
This is something different. They are comparing it to ext bad blocks exclusion list. This is not the same thing like mapping reserved blocks to bad blocks. That is bad blocks avoidance on FS level. The idea is to map what blocks are bad (command badblocks) and avoid using them in FS. In ext FSs that was introduced to support storage media that have manufacture faults. Intended target media were diskettes. Those don’t have any logic in them to manage failed blocks so FS was doing it instead. The difference between current flash memories is that they now have such logic and also have reserved blocks. It is common that some blocks are faulty from fabrication, that is why you can get new flash memories with something like 0-30% of used reserved blocks. So problem they are talking about is just not there with new storage media. It is hidden and mapped over with reserved blocks. The only situation when you might want to use it is when you have no more reserved blocks. In that situation you could map bad blocks and avoid using them. The problem is the avalanche effect I already noted. Modern FSs are rotating writes on whole storage to not wear out specific location. That means that when one block goes rest is close behind. I suspect that avoiding bad blocks would give you just few hours of run time before another one fails and FS is corrupted on modern storage. And even if you would have dynamic avoidance I would suspect that that would give you just few weeks of life time to storage. I think that that is not worth it and I think that it is waste of resources implementing that for BTRFS. It is just solution to issue that is no longer valid. You just should not use storage that is at the end of its live.