eMMC is broken?

Qwiz · September 7, 2018, 6:11pm

Hi all!

Filesystem on my Omnia got broken and now router doesn’t boot. Restoring with medkit didn’t help.

I connected to UART and saw a message: ERROR: Did not find a cmdline Flattened Device Tree

After all I found out there is something weird with eMMC.
I boot into rescue shell and tried to recreate partition on /dev/mmcblk0. All commands for this operation I picked from rescue.sh script file.

So, recreating partitions with fdisk fails with an error message:
/dev/mmcblk0: close device failed: I/O error

It seems eMMC degraded too much and not usable anymore. But I can’t prove this.

Is it possible to bring router back to work or it was completely broken and I need to replace eMMC?

Nones · September 7, 2018, 7:50pm

First … there is an important question.
And the question is: “Do you use the LXC containers or NextCloud application on your Omnia router?”

Qwiz · September 7, 2018, 7:59pm

No, i didn’t.
Tried to create and play around. But never kept them up for a long time.

jiberjaber · September 8, 2018, 12:03pm

I believe I have the same issue here also. I have my containers and rrd on an external 1Tb drive

System started to degrade (some networking fell over) so I decided to reboot the router, since then nothing. Prior to the reboot, syslog showed errors similar to:
“BTRFS warning (device mmcblk0p1): csum failed ino 1144950 off 942080 csum 1102237448 expected csum 3388852613”
I initially tried a rollback to previous snapshot, then roll back to factory and finally medkit. I’ve connected to the serial port and have also installed a tftp server locally and followed teh instructions to boot network debian and see if I can recover the mmc filesystem but nothing.

From the serial output I can see two issues. One is the following, and I see it at every reset/reboot in the first few lines:
“SF: Detected S25FL164K with page size 256 Bytes, erase size 64 KiB, total 8 MiB
*** Warning - bad CRC, using default environment”

And secondly, when I try the medkit, it detects fine, repartitions the mmc but then fails to mount resulting in a reboot.

Terminal capture here: https://pastebin.com/gv7YT0dS

I have managed to resuscitate an old router to get me back on line but this is quite disappointing…

Pepe · September 8, 2018, 2:14pm

Hello guys,

I’m not aware of any eMMC failures, which wasn’t caused by LXC containers and I’d like to tell you, what and why it happened. If you had or have any LXC containers, I’d need to explain to you, why it is a really bad idea.

Common GNU/Linux distributions in LXC containers don’t count with running them the router and their logs or potentially databases writing into storage with very high frequency. That’s why we have system logs in RAM. This can be seen in the articles in our documentation. First see Error/bug reporting and then LXC containers.

Keep in mind, LXC containers are not enabled by default. They require to have at least some knowledge, how you can install one of those available images of Linux distributions as LXC containers and how you can use them.

Why it happened?
Internal storage in the router is eMMC, which the flash memory and it is used in micro SD cards, USB flash drives and so on. My point is all of them has a lifespan of writes and they don’t count with excessive amounts of writes, which can wear it and it is just a matter of time when you’ll wear it. The advantage of what I think of eMMC is that it is more reliable and faster than those devices, which I mention.

From both outputs of the serial console, I can see both eMMC are dead, which means all of your data is gone, and we can’t recover them. We have soldered eMMC on board, and it’s not easy to replace. It is almost impossible to swap it without expensive equipment. In our case, hopefully, the repair will be done by the 3rd company, and it will be paid repair because this is not what we can cover by warranty as it’s not manufacturers fault.

When you’d like to have LXC containers on your router, which is completely fine, but you’ll need to have external removable storage. In that case, even USB drive would be OK for that because they’re cheap and very easily replaceable just plug and play. Since Turris OS 3.10 you’re able to use Storage plugin, which is Foris to avoid any misconfigurations.

If you’d like to have to work your router ASAP, you can boot from mSATA SSD. The mSATA SSD should be inserted to the rightest slot near to heatsink. For more details I recommend you to see our documentation, where you can find more details about how you can boot from mSATA SSD. For future, we’d like to have the option to boot from USB stick.

The most recent warning what I can think for now was introduced in release notes for Turris OS 3.10 that this situation can happen. Release notes are available to be seen in our documentation, where you can also find Errata, which is the list of known bugs.

Once you had created LXC containers in CLI or GUI, you received notification in Foris, if you have configured sending notifications from Foris to your email address, you can find them in email as well. This warning is there for a long time, and it is described in the documentation for LXC containers. For the next version of Turris OS 3.10.6, which we’d like to release soon, we implemented more intensive warnings, which will be shown, when it detects that you’re running LXC containers on eMMC, it will tell you to use e.g. Storage plugin.

Qwiz · September 8, 2018, 3:23pm

Thanks for an answer.

I had some containers, but I didn’t use them for a long time and they were permanently down for about a year. So I don’t think that they caused eMMC failure.

Btw it is not a big problem for me to replace dead eMMC. Maybe it will be cheaper than buying mSATA SSD.
Finding exactly the same one is much harder. So is it possible to use flash with another capacity instead of stock 8Gb? Maybe there are some similar chips with the nearly same specifications I can use for replacement?

jiberjaber · September 8, 2018, 5:55pm

So to be clear, even though my containers were stored on an external drive, they still caused the eMMC failure?

anon50890781 · April 22, 2020, 11:05am

Some basic info on the eMMC health could be gained with (not sure if the app is actually included in the mtd’s rescue image, probably not though)

There been also the mention of

But I am not sure whether that ever materialised.

the u-boot console provides some limited (e)MMC commands

=> mmc
mmc - MMC sub system

Usage:
mmc info - display info of the current MMC device
mmc read addr blk# cnt
mmc write addr blk# cnt
mmc erase blk# cnt
mmc rescan
mmc part - lists available partition on current mmc device
mmc dev [dev] [part] - show or set current mmc device [partition]
mmc list - lists available devices
mmc hwpartition [args...] - does hardware partitioning
  arguments (sizes in 512-byte blocks):
    [user [enh start cnt] [wrrel {on|off}]] - sets user data area attributes
    [gp1|gp2|gp3|gp4 cnt [enh] [wrrel {on|off}]] - general purpose partition
    [check|set|complete] - mode, complete set partitioning completed
  WARNING: Partitioning is a write-once setting once it is set to complete.
  Power cycling is required to initialize partitions after set to complete.
mmc setdsr <value> - set DSR register value

From what been posted the eMMC does not seem to exhibit the symptoms of exhaustion in comparison to what been reported elsewhere, e.g.

It seems more like the partition table is suffering some issue, after

Those errors might have been useful in troubleshooting

peci1 · April 22, 2020, 2:55pm

Thanks for tips. I only started playing with the partition table after several failures of 4-LED reflash.

This is what I see when I try to recreate the partition table:

Summary

/ # fdisk /dev/mmcblk0
Command (m for help): p
Disk /dev/mmcblk0: 7.3 GiB, 7818182656 bytes, 15269888 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbf9fe0e9

Device         Boot Start      End  Sectors  Size Id Type
/dev/mmcblk0p1       2048 15269887 15267840  7.3G 83 Linux


Command (m for help): o
[   47.988028] random: fdisk urandom read with 12 bits of entropy available
Created a new DOS disklabel with disk identifier 0x79b79083.

Command (m for help): p
Disk /dev/mmcblk0: 7.3 GiB, 7818182656 bytes, 15269888 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x79b79083



Command (m for help): w
The partition ta[   56.441683]  mmcblk0: p1
ble has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

/ # fdisk /dev/mmcblk0

Welcome to fdisk (util-linux 2.25.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p
Disk /dev/mmcblk0: 7.3 GiB, 7818182656 bytes, 15269888 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbf9fe0e9

Device         Boot Start      End  Sectors  Size Id Type
/dev/mmcblk0p1       2048 15269887 15267840  7.3G 83 Linux

It still seems I just can’t get rid of the partition…

peci1 · April 22, 2020, 2:59pm

Another attempt:

Summary

/ # dd if=/dev/zero of=/dev/mmcblk0 bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512B) copied, 0.000984 seconds, 508.1KB/s
/ # fdisk /dev/mmcblk0

Welcome to fdisk (util-linux 2.25.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): x

Expert command (m for help): d

First sector: offset = 0, size = 512 bytes.
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
000001b0  00 00 00 00 00 00 00 00  e9 e0 9f bf 00 00 00 00
000001c0  01 20 83 03 d0 ff 00 08  00 00 00 f8 e8 00 00 00
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa

peci1 · April 22, 2020, 3:28pm

I retrieved the logs from an HDD where I also stored them from the dying router. BTRFS started exhibiting some errors a few days before the router died for good. And I didn’t see any reason for that in the logs, no suspicious process or logline or whatever… Just out of nothing.

03:00:01 info kernel[]: [6042129.748264] BTRFS info (device mmcblk0p1): quota is enabled
03:00:41 warning kernel[]: [6042169.392647] csum_tree_block: 15 callbacks suppressed
03:00:41 warning kernel[]: [6042169.392664] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 9D38AFE9 level 0
03:00:41 warning kernel[]: [6042169.396643] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 464E0DD2 level 0
03:00:41 err kernel[]: [6042169.396671] BTRFS error (device mmcblk0p1): error loading props for ino 103526 (root 281): -5
03:00:41 warning kernel[]: [6042169.409603] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 16BCA50E level 0
03:00:41 warning kernel[]: [6042169.413832] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 211E9922 level 0
03:00:41 warning kernel[]: [6042169.417829] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 8864B62D level 0
03:00:41 warning kernel[]: [6042169.421765] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found DDDF6C22 level 0
03:00:41 warning kernel[]: [6042169.425788] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 108E4AB2 level 0
03:00:41 warning kernel[]: [6042169.429787] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 2CA0142C level 0
03:00:41 warning kernel[]: [6042169.433808] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found D12EE038 level 0
03:00:41 warning kernel[]: [6042169.437832] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found F404A49 level 0
03:02:07 warning kernel[]: [6042255.560674] csum_tree_block: 11 callbacks suppressed
03:02:07 warning kernel[]: [6042255.560686] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 741B5F8D level 0
03:02:07 err kernel[]: [6042255.560704] BTRFS error (device mmcblk0p1): error loading props for ino 103526 (root 320): -5
03:02:07 warning kernel[]: [6042255.573839] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found D6858EED level 0
03:02:07 warning kernel[]: [6042255.577848] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 9419CE28 level 0
03:02:07 warning kernel[]: [6042255.581746] BTRFS warning (device mmcblk0p1): mmcblk0p1 checksum verify failed on 2056929280 wanted 67B2369E found 8E96FCA level 0

anon50890781 · April 22, 2020, 3:28pm

Since there are no i/o errors it seems (optimistic) that the eMMC is not worn.

with x you moved into the expert mode and therein d means

print the raw data of the first sector from the device

whilst in standard mode d means

delete a partition

fdisk /dev/mmcblk0
d (that should print)

Selected partition 1
Partition 1 has been deleted.

w
reboot
try 4-led medkit

If not mistaken this is (likely) rather an BTRFS issue than the eMMC being exhausted (bad blocks). BTRFS in TOS3.x (kernel 4.4) is kind of immature and the code has undergone significant development in more contemporary kernel versions (which may not been backported to kernel 4.4)

peci1 · April 22, 2020, 8:39pm

I think this might “prove” that there is really a problem with the memory chip:

/ # mkfs.btrfs -f /dev/mmcblk0p1
btrfs-progs v4.5.1
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Performing full device TRIM (7.28GiB) ...
Warning, could not drop caches
Warning, could n[ 1470.178202] BTRFS: device fsid 19dc81d1-5bae-4808-9c59-0315776a9a19 ot drop caches
devid 1 transid 3 /dev/mmcblk0p1
Label:              (null)
UUID:               19dc81d1-5bae-4808-9c59-0315776a9a19
Node size:          16384
Sector size:        4096
Filesystem size:    7.28GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     7.28GiB  /dev/mmcblk0p1

Warning, could not drop caches

The Could not drop caches message means that fsync failed: https://patchwork.kernel.org/patch/9603285/ .