Jun 25 2019 11:19 AM
Jun 25 2019 11:19 AM
I discovered the new Storage Migration tool in Windows 2019 server, and we have an old 2003 server we needed to migrate. Both are member servers of the same domain.
I have run it multiple times, twice from scratch, also deltas.
It is a royal mess. The servers are on the same LAN, connected by 1GB ethernet, both are virtualized. It takes FOREVER to run, despite only having about 700GB and around 500,000 files. That's OK for the initial copy, but the deltas need to go faster (last one took 3 days) as otherwise we can never cutover. We need a way to get a final delta in an hour or at most a few.
But that is not the real problem. The job says "Couldn't transfer some devices" and every share says "couldn't transfer some files". The D$ share is the only one that shows activity (probably appropriate as it is only that drive and doing other share names explicitly would be duplicates), and shows 96,645 failed files.
There appears to be no way I can find to gain an understanding of WHY they are not copied. Looking in the debug logs I see all sorts of errors like "An unexpected network error occurred" or "the specified network name is no longer available" or "the handle is invalid" or "access is denied".
I've done many hours of transfers with Goodsync and with Robocopy as a comparison, and not one network error. So if these are being caused, with Storage Migration, by network errors, I cannot see why.
I ASSUME it is using the backup API; we do indeed have some files that are protected against Domain Admin, but I am running it with an account that is both Domain Admin and explicitly Backup Operator (though I think that is assumed as an administrator). Many if not most of the files I spot checked, that gave errors, ARE on the new server, and look correct.
But missing almost 100,000 files is a real issue.
I installed the latest version (1.25.0 the first time I think, later installed 1.42.0) which purports to have better error messages, and reran (seemed even slower) but saw no more useful messages. When I look at files that gave errors, nothing jumps out at me. I can copy them manually for example, so there's not a disk corruption issue.
I don't really know where to start to get a handle on the problem.
I did set it up to use CRC64 for checksum validation, and did not set it up to backup the folders to be overwritten, and set 10080 minutes duration, 3 retries and 60 seconds between.
Most files I see in the debug log as failures are actually present (yet I did not see a debug entry for a succesful copy in the few I dug deeper for). Some files not present are on the source protected against admin access, BUT it should be using the backup API, right? But other files similarly protected are moved.
There seems no rhyme or reason to what works and does not, but 100,000 or so failures is too much to deal with individually. I'm not quite sure where to go.
And robocopy, just tried a bit ago on a subset, works perfectly, not one error. But I would rather use storage migration as it moves the shares themselves. but I may have to revert to old school.
Any advice, what sorts of things could I do to try to debug the failures? The debug log shows LOTS Of failures, but no real cause of them. I can go copy the same file, manually, that shows a failure. And again, running robosync (with 8 threads) does not show a single error (on a subset).
Thanks in advance,
Jun 26 2019 07:40 AM
I ran this again on a more limited set of 24,205 files, and it failed on about 60. I grabbed one and looked in depth without finding anything wrong. I can copy the same file pulled from the destination system with just a "copy" statement, so it's not being blocked by anti-virus or disk corruption.
I tracked it down in the event log and the error is simply:
(64) The specified network name is no longer available.
Bear in mind that every test I have done with other tools, like robocopy or goodsync, have succeeded without error. If there are network issues they are not showing up in other tools.
But we are running 12.1.1101.401 RU1 MP1, so it's much later. It is running only on the source, not the destination. I cannot turn it off due to policy (this is a production site in 24x7 use). And again, no other tools are hitting that error.
I am running it again to see if it fails on the same files, but it takes an astoundingly long time. These folders took about an hour with robocopy (I did not time it precisely), but too Storage Migration about 10 hours. And I ran SMS in a relatively idle time, and robocopy during prime business hours.
And yes, I am running SMS on Windows 2019 server, fresh install, fully patched, with proxy server, with the SMS component updated to the latest version.
Any ideas? Is it really this slow? Is it really this unreliable?
Jun 26 2019 07:41 AM
I should have included the full error text:
"06/26/2019-02:17:41.225 [Erro] Transfer error for \\***pathRedacted***\PTWin32\Archive-032206\Data\Acctcode.px: (64) The specified network name is no longer available.
at Microsoft.StorageMigration.Proxy.Service.Transfer.FileDirUtils.GetTargetFile(String path)
at Microsoft.StorageMigration.Proxy.Service.Transfer.FileDirUtils.GetTargetFile(FileInfo file)
at Microsoft.StorageMigration.Proxy.Service.Transfer.FileTransfer.TryTransfer() [d:\os\src\base\dms\proxy\transfer\transferproxy\FileTransfer.cs::TryTransfer::55]"
Jul 18 2019 01:43 PM
Jul 22 2019 04:50 PM
@Ned Pyle thanks. We gave up and moved to robocopy.
I had tried it again without CRC without significant impact on performance.
We had the OS fully patched in late June, so I assume it had the april update, unless you are saying that update is not in the normal stream of windows' updates.
For us, in our situation, for whatever reason it was simply not usable -- too many files disappeared, and there was no good way to account for them and ensure we could find and fix all the issues. It was too risky to use the tool the way it works, notably the way it reports errors. There should be a consolidated summary (an as-of summary after reruns as well) that shows open issues - files not sync'd and why. You shouldn't have to spend hours trying to find them in event logs, nor should you resort to (as I did) other tools to do complete folder directories and checksums and then do a differences on the old and new drive.
The risk of data loss given its poor performance, combined with poor error reporting, was just too great.
I hope there's a version 2 that would work - robocopy and checksum tools are not great tools. But they work. The simplicity of a log file that only shows errors (and very few of them, like locked files) is reassuring.
Jul 22 2019 05:00 PM
@Linwood Thanks. In the future, please do not give up and move to robocopy - open a support case so we can investigate. If there's a bug, the case is free. Otherwise, nothing will improve (robocopy spent 20 years being improved through support cases :) ). We have had tens of thousands of migrations, moved 10PB of data, and no one has reported the exact symptoms you are reporting, we'd like to understand what happened here.
That turning off CRC didn't help performance means there was something very wrong going on in this migration, the different will always be dramatic. Regarding the logs - did you look at the CSV logs that you can download after transfers, but find them to be inadequate? You shouldn't need to look at event logs, we have transfer logs for this reason - both complete and error-only.
Jul 22 2019 05:40 PM
In the future I will consider it, but opening a support case with Microsoft has historically been extremely frustrating if you are not a huge corporation with good contacts. I haven't opened a case in many years, and not for cost reasons -- it is just too awful an experience. Besides, the recommendation in the FAQ has that as the last option. It says:
To get support:
So I picked the first option.
Anyway, I logged into the client to check a few things. 2019-06 CU had been installed, which should have included it. To make sure I downloaded the individual update and ran it, it says "is not applicable to your computer". So I presume we are current on that.
My memory is fading on the CSV files. I went back to look but had deleted the migration job when we gave up. So I can't comment on why I didn't use them -- didn't try, didn't work, couldn't make sense, stupidity, ignorance? I will tell you I spent HOURS on log files and just ended up confused.
I posted one of the errors above, the network name not found -- so what is that? Ok, I retract that question, I understand the error in principle, but why does Robocopy never fail with it, and LOTS of errors in storage migration? What is it doing differently that causes so many more failures on Storage Migration?
I really wish I could run a delta and give you more details, but we have things staged and keeping it updated with robocopy pending them getting an application ready to move, and if I start over just getting the first run with migration will take days, probably most of the week, and I don't think I can afford that.
I've got another 2019 server on that LAN staged for exchange that is held up for unrelated reasons, let me see, maybe I could pick a tree that had problems and just try migrating part of the drive.
More later if I can.
Jul 22 2019 05:47 PM
Jul 22 2019 05:53 PM
@Linwood Hi. Yep, I wrote that article. It's just ordered by likelihood of needing to pay, so I start with free options that don't often require logs and a case. If you open a support case for anything I own - Storage Migration Service, Storage Replica, SMB, DFSR, etc. and aren't getting anywhere, pinging me here or emailing me at firstname.lastname@example.org will always get a reaction of my boot in someone's butt, I promise. :)
Yes, you have that fix.
I've found some folks simply just don't find the logs (as they pop out of the Details menu after transfers are done, which I don't consider super intuitive, and am making some changes to for easier discovery). So this is useful feedback for sure; if you don't remember, we did a bad job of making it visible.
The error is possibly a bug in SMS error handling, since as you point out, Robocopy doesn't see it (or handles it more robustly). We've found a number of conditions in customer networks and servers that cause that error, so we've started adding retry code in SMS to get around it - originally we'd just quit and move on, meaning that SMS started exposing a long existing underlying problem but didn't try to deal with it. But we've also found at least one straight up bug there in our code previously for certain conditions, so there's room for more to be found.
If you can grab logs with https://aka.ms/smslogs when you see the issue on your next server and open a case, then say "Ned Pyle told me he wanted to see your logs and maybe debug my server", we can short circuit the usual bureaucracy and figure out what's going on here. I can personally have one of my supportability PMs non-decrement the case if we find a bug.
Jul 22 2019 05:54 PM
@Linwood No worries. I can hold out for someone else to see this conversation, I hope :)
Nov 10 2019 07:35 AM
@Ned PyleI'm having very similar problems.
My source server also stores the DFS root folder. Will DFS operations be affected after the cutover?