Troubleshooting DRS in Configmgr

In this article, I will attempt to peel back some of the mystery that goes on with replication in Configmgr and provide some additional help when it comes to resolving replication issues, and they will occur.

The Basics of Replication in Configmgr

Configmgr underwent a shift several years ago as Microsoft tried to fix that backlog issue that customers often times encountered with the file based replication style.  The hierarchy architecture was flattened and replication went from files to SQL, for the most part at least.  This is well covered on TechNet so I will spare you the details.

There are some additional basics of replication that should be stated for completeness of the topic.

Configmgr does not use the built-in SQL replication it uses components of SQL and manages the replication of data on its own.  The easiest way I can state this is, the product team coded their own replication into the product, it does not use the built in features of Microsoft SQL Replication.

There are mainly two types of data replication between sites, Global data and Site data.  Global data is replicated from the CAS downward, while site data is the opposite.

Global data includes data such as collection rules.

Site data includes things like collection results and client data.

Each data type contains several replication groups that logically groups data from different tables together.  These replication groups are classified as global and site replication groups.

select * from vReplicationData is a query to display all replication groups.

If you want to see what replication groups are included as part of global or site data, simply modify the previous query by specifying the type of data, similar to these.

select * from vReplicationData where Replicationpattern = ‘global

select * from vReplicationData where Replicationpattern = ‘site

Want to see what data is part of that replication group from the output of the previous two queries?  The first column of the output should be ID, simply select the matching ID from the replication group you want to investigate and use the following query.  Replace 30 with the ID number you are interested in.

select * from vArticleData where ReplicationID = 30

ID 1 – 14 is global data, 15 is cloud data, 16 – 30 is for site data, 31 – 33 is for secondary site replication data, beyond 33 is typically more site data replication groups.

RepGroups-Global and Cloud 1-15RepGroups-Sites 17-30RepGroups-SecondarySitesRepGroups-34 plus

Since we are almost to the real meat of what data is being sent why not look and see what tables are included in a group?  And what better to look at than HWINV data?  This query will return the list of tables that is being replicated as part of any of the replication groups with Hardware_Inventory in the name of the group.

select ArticleName from ArticleData where ReplicationID in (select ID from vReplicationData where ReplicationGroup like ‘Hardware_Inventory%’)

UPDATE: A friend was gracious enough to send me an email regarding the previous query and how to better see what data is in a replication group I recommend using his below.  He is easily one of the foremost experts on DRS and how to troubleshoot replication.  He and his team have one of the largest implementations of SCCM globally and it is magnificently run.  Just from an efficiency standpoint, his query is much better and more elegant. 

“HINV replication groups vary by install and by how much you extend the mof.  I don’t think any 2 groups look alike or contain the same data.

To see what data is in each I use this:”

SELECT Rep.ReplicationGroup,


       App.ArticleName, App.ReplicationID

FROM vArticleData AS App

INNER JOIN v_ReplicationData AS Rep ON App.ReplicationID = Rep.ID

ORDER BY Rep.ReplicationGroup, App.ArticleName


Another important aspect of DRS is the SQL Service Broker (SSB) which handles incoming and outgoing messages, guaranteeing their delivery by allowing them to use an asynchronous queue to store the messages in.  If this cannot function the flow of data into and out of SQL replication will stop.

Site Data Processing – An Example

When a client runs its scheduled hardware inventory it stores the output of the WMI data in an XML file. Then the client copies that XML file up to its management point (MP).  Assuming the MP is not a primary site server, the MP message handler processes the clients XML file and it gets converted into a MIF file.  Then the MP File Dispatch Manager takes the MIF file and uploads into the clients primary site servers dataldr inbox folder.  The MIF file is then read by the data loader component and the data is inserted in the SQL database.

Technically, there are more than two types of data being replicated, but technically Configmgr can also use the built-in SQL replication too, technically.

Additional info from TechNet:

Plan for Database Replication Thresholds

How to Monitor Database Replication Links and Replication Status

Procedures for Monitoring Database Replication

Advanced Troubleshooting of Replication in Configmgr

When the Data Replication Service (DRS) stops working it can be a nightmare.  If replication breaks and you cannot fix it within a certain amount of time you will lose data and if you have to re-initialize your replication it generate gigs and gigs of network traffic across the wire as you replicate all that data to each site server, again.  Below are some additional tips to help when troubleshooting DRS when the Replication Link Analyzer (RLA) doesn’t do the trick.

Running RLA in a script or from the cmd prompt:

%path%\Microsoft Configuration Manager\AdminConsole\bin\Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.Wizard.exe <source site server FQDN> <destination site server FQDN>

From TechNet: About the Replication Link Analyzer

 If RLA fails any remediation actions while it is running, the log files contain more detail than the XML file.  Also, ensure that it was able to restart the SMS_SITE_COMPONENT_MANAGER and SMS_EXECUTIVE services.

Verbose Logging

The first step is to get more information from the log files, this is accomplished the same way as all other logs in CM, by modifying the amount of logging through registry keys.  There are two different log types we are going to use, your standard text based log files you typically view with notepad, splunk, or cmtrace, while the other type of log files are viewed in SQL because they are SQL managed components of Configmgr and the logging is stored in SQL tables not like the .log files for the majority of Configmgr’s components.  You can add this to the long list of reasons why SQL should be ON BOX and CM admins should have sysadmin forever on their SQL instance(s).  Moving on.

Replication Configuration Monitor Log

Rcmctrl.log – Replication Configuration Monitoring (RCM) log file that shows an overview of sync status, site status and stored procs used.

Rcmctrl Regkey:  HKEY_LOCAL_MACHINE\Software\Microsoft\SMS\Components\SMS_REPLICATION_CONFIGURATION_MONITOR\Verbose logging

Default is Value 0

DWORD Value 0 = Errors and key messages
DWORD Value 1 = Errors, key messages, and more general information*
DWORD Value 2 = Everything (Verbose)*

*Make sure you return this back to 0 after you have resolved your DRS issues.

By default the two replication groups that record messages to the Rcmctrl.log are Site Control Data and Configuration Data, if you see errors with another replication group you can include it in the logging by adding it to the following registry key.


*Make sure you return this back to the default after you have resolved your DRS issues. The default is: Site Control Data,Configuration Data

SQL Components Logging

vLogs (stored in SQL views) – Because some components of RCM are running in SQL Server hosted managed code, each of these components is provided a table in CM SQL db to record log messages to.  Logging is recorded in: vDrsReceivedMessages, vDrsReceivedHistory, vDrsSendHistory, vDrsSentMessages, vDrsSyncHistory, and vDrsSyncMessages


Default is Value 1

DWORD Value 0 = Errors and critical information only
DWORD Value 1 = Errors, critical information, warnings, and general information
DWORD Value 2 = Everything (Verbose)*

*Make sure you return this back to 1 after you have resolved your DRS issues.

Now that additional details are being logged here are two queries to run that will display details of the vLogs information.  If you cannot determine the source of the problem using this information I have listed below some additional queries and troubleshooting information.

Query the SQL vLog

Now we are ready to query the vLog and get extra details.   A word of caution first, I recommend that you use the first query that only returns the first 1000 messages, by running the query and not limiting it you risk making things worse by adding additional pressure to your SQL db when it may already be in a degraded state and using the maximum amount of resources it has available.

This query returns the last thousand messages which have been logged, ordered by the time they were written into the database.

select Top 1000 * from vLogs order by LogTime desc

This query does not limit the messages returned and may cause your server to fall over!  This is your second and final warning.

select * from vLogs where LogTime > GETDATE()-1 and ProcedureName <> ‘spDRSSendChangesForGroup’ ORDER BY LogTime DESC

spdiagDRS (stored procedure) – This stored procedure provides an overview of the state of DRS replication at the site including status, messages in queue, messages processed, messages sent, conflicts, current link status, last sync for each replication group, and versions.  This storedproc will give you most of the SQL related information you will need for a lot of replication troubleshooting, or at least point you in the best direction to look deeper.

exec spdiagdrs


Replication and Troubleshooting Certificates

Often times when the basics of DRS troubleshooting have not gotten to root cause and I get contacted it ends up being issues related to certificates.  A couple common issues are that the certificates do not match or are missing.  Another issue can be from changing the account that SQL is running under, if install and use the SYSTEM account to start the SQL services and later on change the accounts to a service account or domain account it will break because the new account does not have the rights to read the original master key used to generate and validate the certs.

“Connection handshake failed.  Error 15581 occurred while initializing the private key corresponding to the certificate….State 88” or

“Service Broker login attempt failed with error: ‘Connection handshake failed.  The certificate used by the peer is invalid…State 89

Let’s next verify we are headed down the correct rabbit hole by first running the following query.  Withing the results of this query, the transmission_status column should display any errors that are related to network communications, such as firewalls blocking replication or authentication errors.

select * from sys.transmission_queue

In the case of Error 15581 you should see the following or something similar.

“Service Broker login attempt failed with error: ‘Connection handshake failed. An error occurred while receiving data: ‘10054(An existing connection was forcibly closed by the remote host.)'”

To resolve this you have two options, delete the current master key and generate a new master key in SQL or assign the new account full control to the MachineKeys folder.  Giving the new account full control to all and the child objects as well is the faster and easier solution.  If you want to go the route of generating a new key you will need to use the storedproc spCreateAndBackupSQLCert to build the new key and copy the certs to all of the site servers participating in replication.  You can see more on how to accomplish the second method below but it also provides some good information on additional troubleshooting certs.

exec spCreateAndBackupSQLCert

Query to display certificates


select * from vSMS_SC_SiteDefinition_Properties where name=’SQLServerSSBCertificateThumbprint’

use master

select name, cert_serial_number  from sys.certificates

Here you can see the output from my CAS and it’s child primary site server.  The first example is the CAS server row 9 shows the endpoint cert from the child primary site (MSC).  Notice the thumbprints in row 8 and 9 show they have properly exchanged the correct certificate versions as the serial numbers in the second column match.

SQL Command to view certificates-CAS-01

SQL Command to view certificates-PRI-01

You should see the same certificate thumbprint listed in this registry key.  If not that is a problem and see the script below on exporting and import certificates.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\SMS\SQL Server\SSBCerticateThubmprint

Script to Export or Import Certificates

This will make a backup of your endpoint certificate locally to the root of the C drive.

use Master

backup certificate configmgrendpointcert to file = C:\CM-EndPoint.CER

Some of the certificates you may not have the rights to export, but you should be able to export the ConfigMgrEndpointCert(s) and the following certificates:





In this example I am inserting the certificate exported from my CAS (CAS) server to my primary site server (MSC).

use CM_MSC

exec dbo.spCreateSSBLogin @EndPointLogin=’ConfigMgrEndPointLoginCAS’, @DestSiteCode=’CAS‘, @DestSiteCertFile=’C:\CM-EndPoint.CER’, @EndPointName=’ConfigMgrEndpoint’

Useful SQL Queries

If you are still not able to determine the source of your replication issues here are a few SQL queries that may help isolate the problem.

Query to find a LockID of locked resources

select * from SEDO_LockState where LockStateID = 1

Query to list DRS Conflicts

select * from DrsConflictInfo

SQL Query for Replication Data Conflict-CAS-01

SQL Query for Replication Data Conflict-CAS-cont-01

Query for the link status

select * from RCM_ReplicationLinkStatus

SQL Query for Link Status-CAS-01

Query for Service Broker status

select * from sys.tcp_endpoints where type_desc = ‘SERVICE_BROKER’

SQL Routes

select * from sys.routes

Query to view the first 1000 messages in the DRS queue

select top 1000 *, casted_message_body =
case message_type_name when ‘X’
then cast(message_body AS NVARCHAR(MAX))
else message_body
from [CM_CAS].[dbo].[ConfigMgrDRSQueue] with(NOLOCK)

Query for all sites replication status

select * from ServerData

There is so many things that can go wrong with replication and it is such a large topic it is difficult to cover it.  Hopefully this information, while a little disorganized, is helpful in your troubleshooting efforts of replication.  One final piece of advice, if you have to reinitialize your replication make sure you fully understand how much data is going to be sent before you do it.  Feel free to contact me if you have questions.





(c) 2015  All rights reserved.  You may not copy and post more than a single paragraph without written authorization from the author.  You may not copy and paste this article on any other blog or website without written authorization from the author.

Critical Error Your Start Menu Isn’t Working

I have been working on a post regarding DRS the last couple days and only had to test some custom SQL before finishing it but this morning I ran into this…

Critical Error Your Start Menu Isn't Working
Critical Error

This was the result of clicking on the start button in Windows 10 build 1511.

Seems I am not alone, and the culprit seems to be Dropbox at least those are the rumors (here and here).  I recently reinstalled DB on my PC, maybe a week or two ago so that seemed reasonable.  I uninstalled DB, rebooted, but no joy, same issue.

In the Microsoft forums a Microsoft Support Engineer suggested this and it seemed to work for quite a few people.

“Please follow the below steps and check if it helps in resolving the issue.

  1. Open the Task manager. Here’s a tip: Press CTRL+Shift+ESC.
  2. Click File > Run New Task.
  3. Make sure you have a check mark beside “Create this task with administrative privileges”.
  4. Type CMD.
  5. Type the following 4 commands at the CMD prompt:

    dism /online /cleanup-image /restorehealth

    sfc /scannow


    Get-AppXPackage -AllUsers |Where-Object {$_.InstallLocation -like “*SystemApps*”} | Foreach {Add-AppxPackage -DisableDevelopmentMode -Register “$($_.InstallLocation)\AppXManifest.xml”}

  6. Close the CMD window.”

Not for me though.  I was getting “Error: 0x800f081f” or “The source files could not be found” when I ran the DISM command “dism /online /cleanup-image /restorehealth”

Seems the source files were corrupt so here is how I got it work.

Note: You will need your Windows 10 installation source, I have an ISO but a DVD, USB key, or the extracted files on a disk works just as well.

  1. Open PowerShell or command prompt as an administrator.  Since you can’t use the start menu, you can either right click on the start menu and select Command Prompt (Admin) or use good old Ctrl-Alt-Del, open Task Manager, select File, Run new task, check the check box to create with admin rights and type PowerShell (see below).Task Manager Run

2. Either insert it into your Windows 10 installation disk into your DVD drive or if you have the ISO, simply right click on it and choose Mount from the menu.  Note the drive letter where your Windows 10 bits are now.

3. Replace the X in the command below with your own drive letter noted from step 2 and then copy and paste this into your your cmd prompt or PowerShell window and hit enter.

DISM /Online /Cleanup-Image /RestoreHealth /source:WIM:X:\Sources\Install.wim:1 /LimitAccess

Instead of it sitting at 20% you should see it slowly progress and it should complete after several minutes.

Since you have a few minutes…This is the same command from above that errors out dism /online /cleanup-image /restorehealth vs. DISM /Online /Cleanup-Image /RestoreHealth, we just add the source location to tell DISM where to find our Install.wim file by adding /source:WIM:X:\Sources\Install.wim:1, and the /LimitAccess at the end tells DISM not to try and get the files from Windows Update so we don’t waste that time waiting.

4. Once that completes successfully, using the same window, run this command to repair any damaged system files.

/sfc scannow

5.  After that completes you can close the window and reboot.

After you reboot your start menu and Cortana should be working again.  I did not need to run the last command suggested by the Support Engineer or the one he suggested after his solution.  Get-AppXPackage installs a package but wasn’t needed in my case.  Hopefully it continues to work as some have reported that it seems to come back.

Good luck!


Your error may look like one of these:

Critical Error
Your start menu is damaged. We will try to fix it on start up.

Critical Error
Your Start Menu isn’t working. We’ll try to fix it the next time you sign in.

Critical Error
Start menu and Cortana aren’t working. We’ll try to fix it the next time you sign in.

And in event viewer you may see these events:

Activation of app Microsoft.Windows.ShellExperienceHost_cw5n1h2txyewy!App failed with error: This app does not support the contract specified or is not installed. See the Microsoft-Windows-TWinUI/Operational log for additional information.

Event ID 5973

 Activation of app Microsoft.Windows.Cortana_cw5n1h2txyewy!CortanaUI failed with error: This app does not support the contract specified or is not installed. See the Microsoft-Windows-TWinUI/Operational log for additional information

Event ID 5973

Configmgr vNext Visio Shapes

I have been using these shapes for my architecture designs after getting tired of using shapes that based on Vista.  These are more modern shapes and includes two variations for servers and alternative shapes for most roles in Configmgr to fit different styles.  I have included some other miscellaneous shapes and shapes for containers, BranchCache, PeerCache, Nano server and others.  In total, there are about 125 different shapes.

Configmgr vNext Shapes


Also on TechNet here

Package Creation Internals in Configmgr

In this article, I will describe in depth how a package and program is created.  Why the older package and program process and not the newer application model?  The application method may come next, but mostly I wrote this to share the information with a peer and because it is more depth than most articles I see on the subject.  I will describe the process from the time the Configmgr admin creates the package in the console up until the time when the newly created package a program(s) are distributed, distribution of a package warrants its own article at this depth.  Specifically I will show how the different Configmgr components are involved, what logs are written to, what you should see in the logs, as well as some additional details like the inboxes, files created for processing and additional details that are not typically discussed.  If you are new to Configmgr, this is not a good article to start with as I assume there is some basic understanding of Configmgr terminology and its process.

Read morePackage Creation Internals in Configmgr

10 Things You Must Know Before Your Next Windows Deployment

This is the first article in a series of articles on Windows 10 deployment and management.  In this first article, I will cover some of the basics for a refresher or if Windows 10 is your first time deploying an OS.  I will also cover what has changed or updated recently as well as what is new.  This series will get progressively deeper into deployment and then I will cover management of your newly deployed Windows 10 computers and devices starting with the basics again followed by deeper technical articles.  With Windows 10 being available on July 29th and the paradigm shift in Windows and Windows deployment my intention is to provide enough information to enable others with the ability to confidently deploy and manage Windows 10 before the 29th of July.


Series of articles:

  1. What is new in Windows deployment
  2. Options for deploying Windows 10
  3. How-to successfully deploy Windows 10
  4. What is new in Windows systems management
  5. Options for managing Windows 10
  6. How-to manage Windows 10

Read more10 Things You Must Know Before Your Next Windows Deployment

Advanced Threat Analytics Now Included in EMS

ZDNet ran a story on this yesterday with an update today.  But this looks like it is confirmed at this point.  No additional cost to the EMS license for this, which is interested and not sure if I believe that part.

“Update (June 23): Microsoft has removed the blog post about EMS getting Advanced Threat Analytics, and also has removed the webinar registration post. (My guess is they weren’t yet ready to announce this.)”

From Brad Anderson’s Ignite summary post on ATA

The problems caused by compromised user credentials is the #1 issue we hear reported by organizations all over the world.

The reason for this problem is twofold:

  • First, many end users are still getting up to speed when it comes to understanding the importance of credential security.
  • Second, the existing security tools are just too cumbersome – they create way too many false positives, they take years to fine tune, and the reports they generate are nearly impossible to read and understand quickly.

Perhaps the most problematic issue of all is how traditional IT security solutions operate once a breach occurs. Getting a massive data dump when you’re trying to identify and isolate the intrusion can take far too long at a time when every second can make or break your organization. It’s counterproductive to have your security software hand you a haystack when you really need a needle.

  • You can detect advanced security threats fast via behavioral analytics that leverage Machine Learning.
  • Now you can adapt to the changing nature of cyber-security threats with a technology that is continuously learning.
  • You can narrow down the most important factors using the simple attack timeline.
  • ATA’s innovative technology reduces false positive fatigue and raises red flags only when needed.

More info on ATA here.