Fixing the AppFabric Cache Cluster in SharePoint 2013

I ran into this at a client site recently, and wanted to blog my experience.  I had a number of things not working as expected in the Cache (including Event ID 1000 and Event ID 1026), and at the end of the day, it appears to have boiled down to 2 things.  Firstly, the cache cluster was improperly configured.  As such, I ended wiping out the cluster, and rebuilding it.  Then after much pain, I found that one of the servers in the cluster that was constantly complaining about not being able to start properly was still misconfigured (using the wrong account), and after stopping the cluster, exporting the config, fixing it, and reimporting the config, then restarting the cluster finally solved the problem for good.

I started with this blog post http://www.sharepointconsultant.ch/2013/03/07/adding-a-local-sharepoint-2013-development-server-as-a-cache-host-to-appfabrics-cache-cluster/.

That gave me the following knowledge:

1) There’s a Windows Service named “AppFabric Caching Service”, which matches 1:1 to each server in the cluster (IE. every server that’s part of the cluster has this service on it, and it should be set to run “Automatic”, and be running, if it’s healthy).

2) The key PowerShell you’ll need to know is as follows.

** Always run your PowerShell window as Administrator when working with the AppFabric Cache **

Start with the following line of PowerShell to let it know who’s boss.

PS C:> Use-CacheCluster

Next, find out the details about your individual host.  (It’s most likely configured on port 22233)

PS C:> Get-CacheHostConfig –ComputerName [yourServerName] -CachePort 22233

That should return the details for this server in the cluster.  Something like below.

HostName        : [Your Server Name]
ClusterPort     : 22234
CachePort       : 22233
ArbitrationPort : 22235
ReplicationPort : 22236
Size            : 400 MB
ServiceName     : AppFabricCachingService
HighWatermark   : 99%
LowWatermark    : 90%
IsLeadHost      : True

If, however, you’re getting an error along the lines of:

PS C:> Get-AFCacheHostConfiguration : ErrorCode<ERRCAdmin010>:SubStatus<ES0001>:Specified host is not present in cluster.

You can register your host in the cluster as follows.

PS C:> Register-CacheHost –Provider [yourProvider] –ConnectionString [yourConnectionString] -Account "NT AuthorityNetwork Service" -CachePort 22233 -ClusterPort 22234 -ArbitrationPort 22235 -ReplicationPort 22236 –HostName [yourServerName]

You’ll need 3 pieces of information to properly run the statement above.

yourProvider & yourConnectionString – Can be found in the registry under HKLM Software Microsoft AppFabric V1.0 Configuration or they can also be found in C:Program FilesAppFabric 1.1 for Windows Server in the DistributedCacheService.exe.config file.

yourServerName – The name of your server

(Optionally you can change the account, but I would recommend you leave the Network Service account in place – this seems to keep SharePoint 2013 happy)

Now when you run this command:

PS C:> Get-CacheHost

You should see the following.

HostName : CachePort         Service Name            Service Status Version Info
——————–         ————            ————– ————
MyServer1.domain.com:22233   AppFabricCachingService UP             3 [3,3][1,3]
MyServer2.domain.com:22233   AppFabricCachingService UP             3 [3,3][1,3]

At the very least, you should see both servers in the cluster at this point.  If you see this above, you’re done, and don’t need the rest of this article.  However, if you’re unlucky, and one or more of the servers are down (Service Status = Down, or Starting) keep reading.

At this point, one of my servers was not started (DOWN), so I went ahead and ran the following.

PS C:> Start-CacheHost –ComputerName [yourServerName] –CachePort 22233

If that failed, like it did for me, I would recommend exporting your cache cluster configuration, and seeing if anything is wrong.  To do this, run the following.

PS C:> Export-CacheClusterConfig [path to output filename]

So, for example…

PS C:> Export-CacheClusterConfig c:file.txt

When looking at the file, down near the bottom, I noticed that the account that MyServer1 was running under was all goofy (usernames shouldn’t have tilde’s in them).

<hosts>
     <host replicationPort=”22236″ arbitrationPort=”22235″ clusterPort=”22234″
         hostId=”1909348767″ size=”800″ leadHost=”true” account=”DOMAINappsrv1~
         cacheHostName=”AppFabricCachingService” name=”MyServer1.domain.com”
         cachePort=”22233″ />
     <host replicationPort=”22236″ arbitrationPort=”22235″ clusterPort=”22234″
         hostId=”1634054989″ size=”400″ leadHost=”true” account=”DOMAINspService”
         cacheHostName=”AppFabricCachingService” name=”MyServer2.domain.com”
         cachePort=”22233″ />
</hosts>

WARNING: MAKE A BACKUP BEFORE YOU MAKE ANY CHANGES!!!

I fixed the account name (to match the service account on the other server DOMAINspService) and then had to import the configuration back in. 

** BUT WAIT – There’s more! **

Before you try to import your configuration, you’ll need to go into your Windows “Services” application and disable the “AppFabric Caching Service”, and then stop the service on each server in the cluster.

To do this, go find the following service and double click on it.

image

Next follow this order exactly, set the startup type to disabled, then stop the service (this is the same as running the PowerShell to shut down the AppFabric host).

image

Repeat the above steps on each server in the cache cluster.

Finally, once you’re done, now you can import the file like below.

PS C:> Import-CacheClusterConfig C:file.txt

Confirm
Are you sure you want to perform this action?
Performing operation “Replace cluster configuration.” on Target “Cluster configuration.”.
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is “Y”): y

If you shut down the cluster properly (like I describe above), your configuration should take at this point. 

If you see the following error, ensure that you’ve shut down the service on all servers in the cluster (seen above).

Import-AFCacheClusterConfiguration : ErrorCode<ERRCAdmin001>:SubStatus<ES0001>:Hosts are already running in the
cluster.
At line:1 char:1
+ Import-AFCacheClusterConfiguration C:file.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     + CategoryInfo          : NotSpecified: (:) [Import-AFCacheClusterConfiguration], DataCacheException
     + FullyQualifiedErrorId : ERRCAdmin001,Microsoft.ApplicationServer.Caching.Commands.ImportAFCacheClusterConfigurat
    ionCommand

Go back to your services window and set your AppFabric service back to Automatic. Now all you should need to do is start the cluster, and you’ll be good.

PS C:> Start-CacheCluster

And all your servers should be UP at this point.  You can also check the cluster health with the following.

PS C:> Get-CacheClusterHealth

You can also check the Cache status with the following command.

PS C:> Get-Cache

Don’t forget, you can always see all the valid PowerShell commands using the following.

PS C:> Get-Help *Cache*

I hope this helps others where I was pulling my hair out.

30 responses to “Fixing the AppFabric Cache Cluster in SharePoint 2013

  1. This is a fantastic article and very well written. Thanks – helped me solve my problem.

    1. This article literally saved me a tremendous amount of time!! Thank you to the author!

    2. I agree! I’ve spent days trying to fix this. For me, just exporting and re-importing the config file, then starting things back up, finally fixed the problem. GREAT blog post!!

  2. Hi,
    All My settigs mentioned in the article are fine with my server but still getting below error very frequently. Can you please help me with this?

    ViewStateLog: Failed to write to the velocity cache: http://server:2731/default.aspx

    Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage ‘DistributedViewStateCache’ – Exception ‘Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out.. Additional Information : The client was trying to communicate with the server : net.tcp://server:22233 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody) at Microsoft.ApplicationServer.Caching.DataCacheFactory.GetCacheProperties(RequestBody request, IClientChannel channel) at Microsoft.ApplicationServer.Caching.DataCacheFactory.GetCache(String cacheName) at Microsoft.SharePoint.DistributedCaching.SPDistributedCachePointerWrapper.InitializeDataCacheFactory()’.

    Below is the info from my dev server

    ps>> Get-Cachehostconfig with host details gives me

    HostName : server.corp.domain.com
    ClusterPort : 22234
    CachePort : 22233
    ArbitrationPort : 22235
    ReplicationPort : 22236
    Size : 819 MB
    ServiceName : AppFabricCachingService
    HighWatermark : 99%
    LowWatermark : 90%
    IsLeadHost : True

    ps>>get-cachehost

    PS C:Usersgnfoip02> Get-CacheHost

    HostName : CachePort Service Name Service Status Vers
    ion
    Info
    ——————– ———— ————– —-
    server.corp.domain.com:22233 AppFabricCachingService UP 3 [3
    ,3][
    1,3]

    Hosts under exported files is having

  3. Great article, helped me out with a quirk in our SP installation! Thanks a lot!

  4. Thanks for this, I was drawing a blank not knowing that you need to execute Use-CacheCluster first. Why oh why does the Technet documentation not mention this? It turned out after exporting the config that I also had the wrong account configured

  5. Great post, helped me solve an annoying issue.

    What I found out to resolve my issue, was that the cache-host causing problems had the pre-windows 2000 hostname in the host name attribute. Changing that to the FQDN and importing the config fixed the problem.

  6. This post is truly a gold mine.
    I had this issue and found a couple of things

    When you register FQDN for the server … otherwise it creates a dummy entry which you need to remove. Also if you accidentally create an other service instantance you get and error which is easy to fix

    PS C:Usersdwesterdale> Add-SPDistributedCacheServiceInstance
    Add-SPDistributedCacheServiceInstance : Cannot start service AppFabricCachingService on computer ‘.’.
    At line:1 char:1
    + Add-SPDistributedCacheServiceInstance
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidData: (Microsoft.Share…ServiceInstance:SPCmdletAddDist…ServiceInstance) [Add-SPDistributedCacheServiceInstan
    ce], InvalidOperationException
    + FullyQualifiedErrorId : Microsoft.SharePoint.PowerShell.SPCmdletAddDistributedCacheServiceInstance

    # now deleete the old instance
    PS C:Usersdwesterdale> Remove-SPDistributedCacheServiceInstance

    PS C:Usersdwesterdale> Add-SPDistributedCacheServiceInstance

    PS C:Usersdwesterdale> Start-CacheCluster

    — this shows an ‘error’ bacause the host has already started but I think there is no issues really.

    Thereafter I unregister the dummy entry I mentioned above ..
    AppFabric service is now happy!

  7. This post really nails this problem. Great. Why microsoft does not have such blogs?

    1. My response would be, that’s why the have the MVP program. It helps them to get the best of the community contributors and reward them by saying, you helped us out and figured this out on our behalf.

  8. Colin – thank you, the Export-CacheClusterConfig helped immensely. I used the MS powershell to change the Cache Svc from the farm account to a managed service account, and everything was fine until I removed the farm account from local admin – at which point, the Dynamic Cache service crashed continually.
    When I ran the export I found that 2 groups were involved: securityProperties>

    The managed account was already in the WSS_WPG group but not in WSS_ADMIN_WPG
    I added the account and all is well.
    That still doesn’t explain why the farm account being in the local admin group would affect the DCS (which was running under another service account).

  9. Great write-up, helped me enormously. I had two cache hosts in the cluster, one of which was starting (forever) and the other down. Exporting the config, manually matching up the service account and port values, and then importing the saingle file to both hosts did the trick.
    Interestingly, my dodgy config came from autospinstaller installation of the service!

  10. Excellent post and great time and life saver 🙂

    It did help me understand as well as troubleshoot the problem. My problem was bit different. One the host in cluster was suffering from ping loss and that has put the services down.

    Is that okay if I can have only one host running the cache services while I disable the windows service on others?

    Thank you very much Colin!!

  11. Hi,
    I am working on Windows Server App fabric. I have been trying to add 2nd cache host to my cluster.
    What I did : –
    I created a cluster and added my local machine as cache host on cache port 22233.
    Now i installed appfabric on 2nd machine and while configuring it I joined it with the previous cluster.
    When i used command Get-Cache Host, it showed my machine with service status as UP and it showed the 2nd machine with service status as UNKNOWN.
    Also, when i viewed the config of each host both were shown as lead host.

    Please help me in adding 2nd cache host in a cluster.

  12. Colin,

    Excellent Post… Thanks for sharing the knowledge.
    This helped me solve a rather perplexing issue!

  13. Hello – thank you for your article which almost helped me. 😉 I have 2 servers in a cluster where one has a status of UNKNOWN. The App Fabric service stops and disables itself immediately on being started. The event viewer shows a failure in KERNELBASE.dll. My errors have resisted your fixes. Any ideas?
    1. Get-CacheHostConfig : ErrorCode:SubStatus:The requested name is valid, but no data of the requested type was found
    2. Register-CacheHost : ErrorCode:SubStatus:The requested name is valid, but no data of the requested type was found
    3. Start-CacheHost : ErrorCode:SubStatus:The requested name is valid, but no data of the requested type was found

    1. I must use the NetBIOS name of the server, so if your server name is over 15 characters, truncate it in the PowerShell commands. Also check you have a DNS entry for the shorter name.

  14. This all requires that your cluster somehow runs, i.e. there is at least one server that is ok. If that is not the case, say, you have one server Sharepoint farm and the distributed cache is broken on it, then the very first command “Use-CacheCluster” already fails (“Failed to connect hosts in the cluster.”).I’m looking for help how to rebuild the cluster from absolutely scratch. Where the original, the one and only Sharepoint server has gone, and you have a clone of it under different name. (This was *not* my idea.)

    Any pointers appreciated,

    TIA

Comments are closed.