Several years ago Kent A. and someone else put out a couple spreadsheets on database sizing and server sizing for Configmgr. I used them off and on for a few years and over time I have revised them based on the newer SQL server best practices (SQL 2012 R2+).
In the days of SQL 2008 and earlier, the common db practice was to have multiple files for the configmgr database files (MDF & NDF files), while the TempDB was configured to use a single file (MDF file). And all databse transaction log files (LDF Files) were configured to use single files alongside each database.
TempDB or tempdb database: Is a globally used file available and is used for holding temporary objects such as local and global tables, stored procedures, table variables, row versions, and query results. The tempdb is temporary, it gets recreated at each reboot or SQL service start. It is literally empty at each start, all data is purged and the db cannot be backed up. The tempdb.mdf is the data file, templog.ldf is the log file for the tempdb. The tempdb autogrowth setting can cause serious performance issues if the size of the db is too small and it is constantly growing the db. A common practice when running a SQL server in an Azure VM is to put the temdb files on the D drive since the drive is non-persistent during reboots.
The best practice for SQL 2012 R2 and beyond is to use a single database file for the Configmgr database and multiple files for the TempDB files (MDF & NDF files). The transaction log files (LDF files) are still configured for a single file as before. My buddy Steve Thompson points this out in a recent post where he discusses the proper tempdb creation practices. Steve is also a former SQL MVP, now a Configmgr MVP so he knows both well.
The topic of using a CAS or multiple primaries for a customer came up on a discussion list today and when I replied I had screen shots of my sql sizing spreadsheet and my site sizing spreadsheet and a few people asked me for copies of them. They can be downloaded now and as they get updated I will do my best to make the new versions available.
In this article, I will attempt to peel back some of the mystery that goes on with replication in Configmgr and provide some additional help when it comes to resolving replication issues, and they will occur.
The Basics of Replication in Configmgr
Configmgr underwent a shift several years ago as Microsoft tried to fix that backlog issue that customers often times encountered with the file based replication style. The hierarchy architecture was flattened and replication went from files to SQL, for the most part at least. This is well covered on TechNet so I will spare you the details.
There are some additional basics of replication that should be stated for completeness of the topic.
Configmgr does not use the built-in SQL replication it uses components of SQL and manages the replication of data on its own. The easiest way I can state this is, the product team coded their own replication into the product, it does not use the built in features of Microsoft SQL Replication.
There are mainly two types of data replication between sites, Global data and Site data. Global data is replicated from the CAS downward, while site data is the opposite.
Global data includes data such as collection rules.
Site data includes things like collection results and client data.
Each data type contains several replication groups that logically groups data from different tables together. These replication groups are classified as global and site replication groups.
select * from vReplicationData is a query to display all replication groups.
If you want to see what replication groups are included as part of global or site data, simply modify the previous query by specifying the type of data, similar to these.
select * from vReplicationData where Replicationpattern = ‘global‘
select * from vReplicationData where Replicationpattern = ‘site‘
Want to see what data is part of that replication group from the output of the previous two queries? The first column of the output should be ID, simply select the matching ID from the replication group you want to investigate and use the following query. Replace 30 with the ID number you are interested in.
select * from vArticleData where ReplicationID = 30
ID 1 – 14 is global data, 15 is cloud data, 16 – 30 is for site data, 31 – 33 is for secondary site replication data, beyond 33 is typically more site data replication groups.
Since we are almost to the real meat of what data is being sent why not look and see what tables are included in a group? And what better to look at than HWINV data? This query will return the list of tables that is being replicated as part of any of the replication groups with Hardware_Inventory in the name of the group.
select ArticleName from ArticleData where ReplicationID in (select ID from vReplicationData where ReplicationGroup like ‘Hardware_Inventory%’)
UPDATE: A friend was gracious enough to send me an email regarding the previous query and how to better see what data is in a replication group I recommend using his below. He is easily one of the foremost experts on DRS and how to troubleshoot replication. He and his team have one of the largest implementations of SCCM globally and it is magnificently run. Just from an efficiency standpoint, his query is much better and more elegant.
“HINV replication groups vary by install and by how much you extend the mof. I don’t think any 2 groups look alike or contain the same data.
To see what data is in each I use this:”
FROM vArticleData AS App
INNER JOIN v_ReplicationData AS Rep ON App.ReplicationID = Rep.ID
ORDER BY Rep.ReplicationGroup, App.ArticleName
Another important aspect of DRS is the SQL Service Broker (SSB) which handles incoming and outgoing messages, guaranteeing their delivery by allowing them to use an asynchronous queue to store the messages in. If this cannot function the flow of data into and out of SQL replication will stop.
Site Data Processing – An Example
When a client runs its scheduled hardware inventory it stores the output of the WMI data in an XML file. Then the client copies that XML file up to its management point (MP). Assuming the MP is not a primary site server, the MP message handler processes the clients XML file and it gets converted into a MIF file. Then the MP File Dispatch Manager takes the MIF file and uploads into the clients primary site servers dataldr inbox folder. The MIF file is then read by the data loader component and the data is inserted in the SQL database.
Technically, there are more than two types of data being replicated, but technically Configmgr can also use the built-in SQL replication too, technically.
Advanced Troubleshooting of Replication in Configmgr
When the Data Replication Service (DRS) stops working it can be a nightmare. If replication breaks and you cannot fix it within a certain amount of time you will lose data and if you have to re-initialize your replication it generate gigs and gigs of network traffic across the wire as you replicate all that data to each site server, again. Below are some additional tips to help when troubleshooting DRS when the Replication Link Analyzer (RLA) doesn’t do the trick.
Running RLA in a script or from the cmd prompt:
%path%\Microsoft Configuration Manager\AdminConsole\bin\Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.Wizard.exe <source site server FQDN> <destination site server FQDN>
If RLA fails any remediation actions while it is running, the log files contain more detail than the XML file. Also, ensure that it was able to restart the SMS_SITE_COMPONENT_MANAGER and SMS_EXECUTIVE services.
The first step is to get more information from the log files, this is accomplished the same way as all other logs in CM, by modifying the amount of logging through registry keys. There are two different log types we are going to use, your standard text based log files you typically view with notepad, splunk, or cmtrace, while the other type of log files are viewed in SQL because they are SQL managed components of Configmgr and the logging is stored in SQL tables not like the .log files for the majority of Configmgr’s components. You can add this to the long list of reasons why SQL should be ON BOX and CM admins should have sysadmin forever on their SQL instance(s). Moving on.
Replication Configuration Monitor Log
Rcmctrl.log – Replication Configuration Monitoring (RCM) log file that shows an overview of sync status, site status and stored procs used.
DWORD Value 0 = Errors and key messages DWORD Value 1 = Errors, key messages, and more general information* DWORD Value 2 = Everything (Verbose)*
*Make sure you return this back to 0 after you have resolved your DRS issues.
By default the two replication groups that record messages to the Rcmctrl.log are Site Control Data and Configuration Data, if you see errors with another replication group you can include it in the logging by adding it to the following registry key.
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\SMS\COMPONENTS\SMS_REPLICATION_CONFIGURATION_MONITOR\DRS Replication Group Message Logging
*Make sure you return this back to the default after you have resolved your DRS issues. The default is: Site Control Data,Configuration Data
SQL Components Logging
vLogs (stored in SQL views) – Because some components of RCM are running in SQL Server hosted managed code, each of these components is provided a table in CM SQL db to record log messages to. Logging is recorded in: vDrsReceivedMessages, vDrsReceivedHistory, vDrsSendHistory, vDrsSentMessages, vDrsSyncHistory, and vDrsSyncMessages
DWORD Value 0 = Errors and critical information only DWORD Value 1 = Errors, critical information, warnings, and general information DWORD Value 2 = Everything (Verbose)*
*Make sure you return this back to 1 after you have resolved your DRS issues.
Now that additional details are being logged here are two queries to run that will display details of the vLogs information. If you cannot determine the source of the problem using this information I have listed below some additional queries and troubleshooting information.
Query the SQL vLog
Now we are ready to query the vLog and get extra details. A word of caution first, I recommend that you use the first query that only returns the first 1000 messages, by running the query and not limiting it you risk making things worse by adding additional pressure to your SQL db when it may already be in a degraded state and using the maximum amount of resources it has available.
This query returns the last thousand messages which have been logged, ordered by the time they were written into the database.
select Top 1000 * from vLogs order by LogTime desc
This query does not limit the messages returned and may cause your server to fall over! This is your second and final warning.
select * from vLogs where LogTime > GETDATE()-1 and ProcedureName <> ‘spDRSSendChangesForGroup’ ORDER BY LogTime DESC
spdiagDRS (stored procedure) – This stored procedure provides an overview of the state of DRS replication at the site including status, messages in queue, messages processed, messages sent, conflicts, current link status, last sync for each replication group, and versions. This storedproc will give you most of the SQL related information you will need for a lot of replication troubleshooting, or at least point you in the best direction to look deeper.
Replication and Troubleshooting Certificates
Often times when the basics of DRS troubleshooting have not gotten to root cause and I get contacted it ends up being issues related to certificates. A couple common issues are that the certificates do not match or are missing. Another issue can be from changing the account that SQL is running under, if install and use the SYSTEM account to start the SQL services and later on change the accounts to a service account or domain account it will break because the new account does not have the rights to read the original master key used to generate and validate the certs.
“Connection handshake failed. Error 15581 occurred while initializing the private key corresponding to the certificate….State 88” or
“Service Broker login attempt failed with error: ‘Connection handshake failed. The certificate used by the peer is invalid…State89“
Let’s next verify we are headed down the correct rabbit hole by first running the following query. Withing the results of this query, the transmission_status column should display any errors that are related to network communications, such as firewalls blocking replication or authentication errors.
select * from sys.transmission_queue
In the case of Error 15581 you should see the following or something similar.
“Service Broker login attempt failed with error: ‘Connection handshake failed. An error occurred while receiving data: ‘10054(An existing connection was forcibly closed by the remote host.)'”
To resolve this you have two options, delete the current master key and generate a new master key in SQL or assign the new account full control to the MachineKeys folder. Giving the new account full control to all and the child objects as well is the faster and easier solution. If you want to go the route of generating a new key you will need to use the storedproc spCreateAndBackupSQLCert to build the new key and copy the certs to all of the site servers participating in replication. You can see more on how to accomplish the second method below but it also provides some good information on additional troubleshooting certs.
Query to display certificates
select * from vSMS_SC_SiteDefinition_Properties where name=’SQLServerSSBCertificateThumbprint’
select name, cert_serial_number from sys.certificates
Here you can see the output from my CAS and it’s child primary site server. The first example is the CAS server row 9 shows the endpoint cert from the child primary site (MSC). Notice the thumbprints in row 8 and 9 show they have properly exchanged the correct certificate versions as the serial numbers in the second column match.
You should see the same certificate thumbprint listed in this registry key. If not that is a problem and see the script below on exporting and import certificates.
If you are still not able to determine the source of your replication issues here are a few SQL queries that may help isolate the problem.
Query to find a LockID of locked resources
select * from SEDO_LockState where LockStateID = 1
Query to list DRS Conflicts
select * from DrsConflictInfo
Query for the link status
select * from RCM_ReplicationLinkStatus
Query for Service Broker status
select * from sys.tcp_endpoints where type_desc = ‘SERVICE_BROKER’
select * from sys.routes
Query to view the first 1000 messages in the DRS queue
select top 1000 *, casted_message_body = case message_type_name when ‘X’ then cast(message_body AS NVARCHAR(MAX)) else message_body end from [CM_CAS].[dbo].[ConfigMgrDRSQueue] with(NOLOCK)
Query for all sites replication status
select * from ServerData
There is so many things that can go wrong with replication and it is such a large topic it is difficult to cover it. Hopefully this information, while a little disorganized, is helpful in your troubleshooting efforts of replication. One final piece of advice, if you have to reinitialize your replication make sure you fully understand how much data is going to be sent before you do it. Feel free to contact me if you have questions.
(c) 2015 configmgr.com. All rights reserved. You may not copy and post more than a single paragraph without written authorization from the author. You may not copy and paste this article on any other blog or website without written authorization from the author.