Monday, March 26, 2012

CMS Replication issues

While installing a Lync Server 2010, I ran into an issue where one front-end would not replicate to the Central Management Store. I could ping and telnet to port 445 so that wasn't the issue.  Running netstat from the problem server and filtering on port 445, I received the below errors:

A quick internet search on search on CMS replication issues turned up an excellent post titled Troubleshooting CMS Replication from The Lync Guy's Blog (Fromerly OCSGuy). The Lync Guy ran into this exact issue where access was denied to the CMS database while trying to write to the RTCReplicaRoot folder on the problem server. In his case, a security script had broken NTFS permissions and a complete rebuild (including the OS) was warranted.

My case was somewhat different; I wasn't using a security script and I was trying to avoid re-installing the OS primarily because I had done so previously due to another issue in which the roles and features in Windows Server 2008 R2 displayed an error message in service manager.
Additionally, roles and features could not be changed using Power Shell. I followed the steps here to no avail and rebuilt the OS but I digress. 


I initially tried the steps listed on Louis UC Blog to repair the share permissions to the RTCReplicaRoot folder but this did not work in my case.

So I removed the problem server from the Lync topolopy, performed a bootstrap /scorch (to remove Lync) and deleted the RTCReplicaRoot folder (to ensure that the folder would be recreated by Lync setup with the correct permissions), added the server back to the topology, and re-ran Lync Setup. 

This solved my CMS replication issue!



Friday, March 23, 2012

One Case of Dialed Number not in Service

A trouble ticket came in where a user could not dial a specific number. The client normalized the number but would not dial it. Additionally, the number could be dialed via a PBX or PSTN phone. Since the number normalized and the CSCP test voice route showed the number routing to the gateway, it was easy to guess that the AudioCodes gateways did not have the 630 prefix defined (as shown below).



Note: If this had not been the case I would have looked at the gateway logs to determine if the number was being manipulated correctly and if so, engage the switch room team.

After the prefix was added, Lync clients could dial it successfully. 

Why did the PBX dialing from our organization work? Because the PBXs were updated with the new prefix. The issue was that there was no internal process to update the gateways whenever the PBXs were updated with new prefixes. This issue has been corrected. 





Friday, March 16, 2012

Polycom CX Phone reboot issue - Resolved

During Lync Pilot, users complained that the CX600 Lync Phones would completely reboot after being awaken from sleep mode. After engaging Cisco, it was determined that this was an issue with the 3750-X series switches that were in use.

Here are the notes from the Cisco Engineer that assisted with troubleshooting this issue,

"You have class-2 PoE IP telephones that are using approximately 2.6 -
3.0 Watts when in sleep mode.  When resuming operation, the devices
request a power value of 65 via LLDP.  During this transition, the
phones are being reset by the Cisco switch.  Debugging this issue on the
switch we found the following:

014173: May 16 16:21:32.550: ilpower_powerman_request_from_lldp: rx lldp
power class tlv from port GigabitEthernet0/19
014174: May 16 16:21:32.550: Shut down interface (Gi0/19) since
consumption power 6500 is greater than allocation 3000

Power value of 65 allows utilization between 3.84 and 6.49 W.  The
problem isn't that the device is requesting more power than allowed by
the class, rather that it is attempting to use LLDP to request higher
power.  According to the ANSI standard, this type of request requires a
reload of the device.

###################



10.2.5.4 Power value

The Power Value field shall represent the maximum power consumption of
an Endpoint Device or the minimum power available from a PSE port of a
Network Connectivity Device when the LLDPDU is transmitted. It shall not
be used to indicate instantaneous variations of power required or
offered at the time of transmission of the LLDPDU.

A PD Endpoint Device that requires more power than previously
advertised, due to a hardware reconfiguration, shall not advertise a
higher power value than is currently advertised by the PSE Network
Connectivity Device it is connected to on that port.

If the PD Endpoint Device requires power greater than either the power
value currently advertised by the PSE Network Connectivity Device or the
upper power limit of the corresponding IEEE 802.3af power class, then it
must renegotiate this new power limit by resetting the port (i.e. by
triggering renegotiation of power at the IEEE 802.3af level).

################### "


Therefore, the reboot is expected behavior.  As a workaround to the
issue, the power TLV was disabled from the LLDP field as the Cisco Engineer suggested. 


RESOLVED with Lync Server CU4 (Novemeber) patch!!!

One case of immediate fast busy

This is an issue that occurred early in Lync Pilot stage but was one inherited from the OCS 2007 R2 infrastructure.  Symptom: When a user makes a MOC/Lync call it sometimes goes to an immediate fast busy.  Common experiences from users are below:
·         Users were able to make the call on a typical desk phone immediately after failure.
·         Peers were able to call the number successfully immediately after a failure.
·         Call fails with 5 digit or 10 digit dialing.
·         Intermittent issue in all cases and  difficult or impossible to duplicate.
·         Users with OCS desk phones and headsets experienced the issue.

All client ucc logs contained the highlighted text in this excerpt:

Aug 12 14:56:53 10.1.137.98 (122-124-138) CC -APP|ACU_CLEAR_IN   |   0:61:   1#68:hungup,cause 41h=65=Bearer capability not implemented 00 00 00 00 00 00 00 00 68 41 00 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 ff ff 00 00 00 ff 00 00 00 00 00 00 00 00 00 00 00
Aug 12 14:56:53 10.1.137.98 (   lgr_psbrdex)(6136914   )  pstn recv <-- CALL_DISCONNECTED Trunk:0 Conn:1 RetCause:104 NetCause:65 [Time: 19:56:53]
Aug 12 14:56:53 10.1.137.98 (121-122-138) DIT-PHD|PH_IT_XMIT_IN  |   0: 0:   0|4: 00 01 01 c6
Aug 12 14:56:53 10.1.137.98 (   lgr_psbrdif)(6136915   )  Abnormal Disconnect cause:65#GWAPP_BC_NOT_IMPLEMENTED Trunk:0 Conn:1 [Time: 19:56:53]

Bearer capability not implemented can be caused by one of the following occurrences:
http://www.cisco.com/iam/unified/ipt1/images/blank.gifYou need to change the PCM Type value to the setting appropriate for your country. This is the most common cause, especially in countries where G.711 A-law companding is the standard. If your gateway is configured for ยต-law and the service provider or PBX is expecting A-law, you will see calls disconnected with this cause code.
http://www.cisco.com/iam/unified/ipt1/images/blank.gifThe central office (CO) does not understand an information element in the setup message.
http://www.cisco.com/iam/unified/ipt1/images/blank.gifYou are connected to a PBX and you are sending out a network type number when the switch accepts only Unknown or National.
http://www.cisco.com/iam/unified/ipt1/images/blank.gifYou are selecting European PRI and you have the progress indicators turned on when they should be off. 
None of the above were applicable in our case.

The AudioCodes M2K gateway logs showed the following: 

Good call setup message notice the mu-law>
0:   1|40:               Q.931:  PD=8 CR=30228(0x7614,Orig) SETUP(05):                                               Bc(04):  3.1kHz-Audio/64k/mu-law                90 90 a2                                                ChanId(18):        B10/excl(PRI)     a9 83 8a                                                CallingNb(6c):                type:natl,plan:ISDN "6153340986"            a1 36 31 35 33 34 33 30 39 38 36                                   CalledNb(70):                type:unknown,plan:unknown "99475224"            80 39 39 34 37 35 32 32 34...          Hex dump: 08 02 76 14 05 04 03 90 90 a2 18 03 a9 83 8a 6c 0b a1 36 31 35 33 34 33 30 39 38 36 70 09 80 39 39 34 37 35 32 32 34 a1
Nov  1 05:40:29 10.1.137.98 (122-123-138) DLD-PHD|PH_DA_RQ       |   0: 0:   0|44:I[52,122]p0 tei=0  02 01 68 f4 08 02 76 14 05 04 03 90 90 a2 18 03 a9 83 8a 6c 0b a1 36 31 35 33 34 33 30 39 38 36 70 09 80 39 39 34 37 35 32 32 34 a1
Nov  1 05:40:29 10.1.137.98 (122-123-138) NS -MNS|MNS_EVENT_IN   |   0: 0
------------------------------------------------------------------------------------------------------------
Bad call with A-law>
0:   1|40:               Q.931:  PD=8 CR=17691(0x451b,Orig) SETUP(05):                                               Bc(04):  3.1kHz-Audio/64k/A-law                90 90 a3                                                ChanId(18):        B15/excl(PRI)     a9 83 8f                                                 CallingNb(6c):                type:natl,plan:ISDN "6158579738"            a1 36 31 35 38 37 35 39 37 33 38                                   CalledNb(70):                type:unknown,plan:unknown "92922426"            80 39 32 39 32 32 34 32 36...          Hex dump: 08 02 45 1b 05 04 03 90 90 a3 18 03 a9 83 8f 6c 0b a1 36 31 35 38 37 35 39 37 33 38 70 09 80 39 32 39 32 32 34 32 36 a1
Cause(08):          cause 41h=65=Bearer capability not implemented            81 c1 04                Hex dump: 08 02 c5 1b 45 08 03 81 c1 04
----------------------------------------------------------------------------------------------------------
Using the A-law causes the bearer capability problem and the call with fail.

Normal calls should have been using mu-law (used for North America and Japan) but on the failed immediate fast busy calls they were trying to use A-law (European coding). 

After changing a bit in the AudioCodes gateway on the outbound call to use mu-law, the failed behavior continued. 

AudiCodes Support was engaged and a firmware patch was developed for this case. 

The "Bearer capability not implemented" error has not returned since the firmware upgrade.




Thursday, March 15, 2012

Script to compare an Enterprise Voice number to the assigned one in AD


Import-Module activedirectory
Import-Module lync

Get-csUser | where  {$_.EnterpriseVoiceEnabled -eq 'true'} | foreach{
$u = get-aduser -Properties telephoneNumber -Identity $_.Identity.tostring()
If ($u -ne $null)
{
#strip out all non numerical values
$rawdigits = $u.telephonenumber -replace '\.',''
$rawdigits = $rawdigits -replace '\s',''
$rawdigits = $rawdigits -replace '\(',''
$rawdigits = $rawdigits -replace '\)',''
$rawdigits = $rawdigits -replace '-',''
$rawdigits = $rawdigits.trim()

$LyncLineuri = $_.lineuri

$LyncLineuri = $LyncLineuri -replace 'tel:\+1',''

If($rawdigits -eq $LyncLineuri)
{
write-host $u.distinguishedname  " telephoneNumber attribute - "  $rawdigits  " matches the LineUri "  $LyncLineuri
}
else
{
write-host $u.distinguishedname  " telephoneNumber attribute - "  $rawdigits  " does not match the LineUri " $LyncLineuri
}
}
else
{
write-host $_.Identity.tostring() " no user match found"

}
}

Troubleshooting Certificates in Lync with CAPI2 logging

Certificate issues in Lync 2010 can sometimes be complex to troubleshoot.  In my experience, sometimes the Lync Server Event logs are not always helpful enough to solve the issue. Enter the CAPI2 log (which is not enabled by default).

First Enable the log:  

Let the CAPI2 events collect:


Now investigate:


Other certificate troubleshooting can involve the following:

Enabling the root certificate for all purposes:


Verifying Enhanced Key Usage (depending on Server role check here):


Validating Lync Server access to root Certificate Revocation lists. This can be easily done by copying the CRL Distribution Point Url into a web browser and being prompted for a download :

     




F5 web services configuration guidance for Lync

Microsoft guidance for configuring cookie-based persistence for hardware load-balanced Lync Web Services connections can be found here.

Hardware Load Balancer Requirements for Web Services:
·         For External Web Services virtual IPs (VIPs), set cookie-based persistence on a per port basis for external ports 4443, 8080 on the hardware load balancer. For Lync Server 2010, cookie-based persistence means that multiple connections from a single client are always sent to one server to maintain session state. To configure cookie based persistence the load balancer must decrypt and re-encrypt SSL traffic. Therefore, any certificate assigned to the external Web service FQDN must also be assigned the 4443 VIP of the hard load balancer.
·         For Internal Web Services VIPS, set Source_addr persistence (internal port 80, 443) on the hardware load balancer. For Lync Server 2010, source_addr persistence means that multiple connections coming from a single IP address are always sent to one server to maintain session state.
·         Use TCP idle timeout of 1800 seconds.
·         On the firewall between the reverse proxy and the next hop pool’s hardware load balancer, create a rule to allow https: traffic on port 4443, from the reverse proxy to the hardware load balancer. The hardware load balancer must be configured to listen on ports 80, 443, and 4443

Cookie persistence requires SSL termination on the load balancer. Otherwise, the load balancer cannot inspect HTTP traffic to insert cookies. To enable cookie-based persistence, you need to enable client_ssl and server_ssl profiles in addition to any existing profiles which are already enabled. Since client_ssl enables decryption of packets, the certificate assigned to the client_ssl profile must match the certificate requirements for the Lync FE Pool or Single Edition FE server. Server_ssl enables re-encryption of packets before routing them to the FE pool.  (NOTE: Front End servers don’t accept unencrypted HTTP traffic.)

In addition to the client_ssl and server_ssl profiles, a OneConnect profile (or similar) must also be used to properly load balance requests for External Lync Web Services.  Using a OneConnect profile for HLB requests is explained in detail on the F5 support website here and is described below:

HTTP parsing without a OneConnect profile
If the virtual server does not reference a One Connect profile, the BIG-IP system performs load balancing for each TCP connection. Once the TCP connection is load balanced, the system sends all requests that are part of the connection to the same pool member. For example, if the virtual server does not reference a One Connect profile, and the BIG-IP system initially sends a client request to node A in pool A, the system inserts a cookie for node A. Then, within the same TCP connection, if the BIG-IP system receives a subsequent request that contains a cookie for node B in pool B, the system ignores the cookie information and incorrectly sends the request to node A instead.

HTTP parsing using a OneConnect profile
Using a OneConnect type of profile solves the problem. If the virtual server references a OneConnect profile, the BIG-IP system can perform load balancing for each request within the TCP connection. That is, when an
HTTP client sends multiple requests within a single connection; the BIG-IPsystem is able to process each HTTP request individually. The BIG-IP system sends the HTTP requests to different destination servers if necessary.
For example, if the virtual server references a OneConnect profile and the client request is initially sent to node A in pool A, the BIG-IP system inserts a cookie for node A. Then, within the same TCP connection, if the BIG-IP system receives a subsequent request that contains a cookie for node B in pool B, the system uses that cookie information and correctly sends the request to node B.

Note: The latest Lync Server 2010 configuration guide from F5 does not include the steps for configuring a OneConnect profile for load balancing requests for External Lync Web Services.  You can find a copy of this configuration guide here

Using this configuration, Application sharing will fail and you may see failures like these in the Edge event log:


Additionally, Edge A/V sessions will not work using the currently documented F5 Edge A/V settings:

 We had to configure SNAT on the Edge A/V VIPs in order to get them functional.

One other note: If you are not using the reverse proxy role, F5 documentation does not show the FE pool VIP needing port 80. If you do not allow port 80 HTTP to the FE pool VIP, then lync phone devices will not be able to download their root certificates!!!

Reference:
http://technet.microsoft.com/en-us/library/gg398478.aspx
http://support.f5.com/kb/en-us/solutions/public/7000/200/sol7208.html
http://www.f5.com/pdf/deployment-guides/f5-lync-dg.pdf

UPDATE 3/15:  Document Version 1.9 has been updated to reflect corrections to the Edge Web Services VIPs. However, the Edge A/V documentation remains the same!!!







Wednesday, March 14, 2012

Lync Server 2010 - Device Update error codes


After each reboot and occasionally afterward, a UC phone will attempt to contact the Device Update website to download any available firmware updates which may be available. 
  • Phone POSTS manufacturer, model, current firmware version.
  • Server checks set of Device Update Rules for two things:
a) That a rule exists for the specific manufacturer, model, and revision of phone
b) Whether a more recent update file has been approved for the given device
  • Server will respond as appropriate:
a) (0x0 / 200) - No update exists
b) Internal / External URL for the firmware update (http for internal, https for external -- hardcoded!)
c) (0x# / 0x#####) - Error
  • Device will stop any existing download of the Address Book and delete all of the ABS files (to clear disk space)
  • Device will GET firmware using either the internal or external URL provided by Device Update server
  • Device will apply the image, then reboot after 5 minutes of inactivity (idle)
Check the System Information screen for key pieces of troubleshooting information:
Version - Current firmware version (e.g. 4.0.7576.0 = Lync RTM) and Bootloader version (1.23 or later required)
Last Update - Last time a firmware was downloaded/installed successfully
Last Update Request - Should be today, or moments ago if phone was rebooted
Last Update Status - Status code indicating success/failure (details below)

Format:  (0x#/0x#####)
Possible error codes highlighted in red are as follows:
0x0 = Ok eg, 0x0/200 means update request was sent successfully
 
0x2 = certificate validation failed (e.g. on a SHIP device when we download a test signed or unsigned image or an image signed with the wrong certificate, image update fails because the image is not signed with correct cert.  Images prior to RTM were signed with a MD5 certificate.  RTM images and post-RTM images are signed with SHA1 certificate.)
 
0xd = Failure to write the image in Nand flash after download. This happens due to multibit error in Nand flash. – RTM software will not be able to update if this error occurs. However, this will be fixed in CU1 i.e. if a multi-bit error is found in the OS region during the image update, the image update software will properly deal with this error and will complete the update process successfully
 
0x7 = Failure to download image update files because FileSystem is corrupted(eg multibit error in FS) or the filesystem is Full. – In this case, the user should perform a hard reset which reformats the file system and fixes this problem. After the hard reset, image update will succeed
 
0x5/401 - this is auth failure during image update request

Error codes highlighted in yellow are HTTP related and can be found here:

Exchange 2010 UM implementation steps


1. Build Exchange 2010 SP2 UM server
2. Replace the self-signed cert with a cert that’s trusted by all Exchange servers. Cert CN=fqdn of exchange UM server, no SANs required (Make sure any required intermediate and root certs are installed)
3. Run the exchucutil.ps1 PowerShell script on the 2010 Exchange server.
4. Add the Exchange 2010 server to the existing Dial plan

NOTE: ocsumutil.exe is run once for each dial plan. In a migration scenario the dial plan already exists and there is no need to perform this step as part of the Exchange 2010 UM implementation

Reference:

 Here’s a screenshot for adding the 2010 UM server to the existing dial plan:


Procedure to create a cert request and not have pool names

Run  - Request, install or Assign Certificates






Deselect Server default and Web services internal







With only Web services external selected, click Request




Finish the wizard to make the CSR

Live meeting button integration behavior during MOC to Lync upgrade

The first screen shot shows the Outlook with the Live Meeting Addin, pressing the “Schedule a Live Meeting” button creates a Live Meeting appointment as one would expect (Red box)
We can also see the meeting link has the meet:// verb meaning that meeting is Live Meeting based (OCS AV server will host it


After running a default install of Lync (the Lync installation program may prompt to close Outlook) We can see the Live Meeting button has been replaced in Outlook. The “Online Meeting” (red box) and the meeting invite link is now https://...., indicating this is now a Lync based meeting.  Should this user need to join a meeting that was created in OCS, the Live meeting client will be leveraged. We can see below that the Live Meeting client is still installed.




However to use Live meeting add-in along with Lync Online meetings (to ease the co-existence) phase, use the below registry entry (with value of 0) on the client computer to restore functionality :


HKLM\Software\Microsoft\Communicator\W13AddinSwitch

Value = 0 : w13 has full UI.
Value = 1 : no w13 OCS functions. W13 Only shows Live meeting service functions.
Value = 2 : no w13 UI at all.

UI = user Interface
W13= OCS 2007 R2


Lync Vanity Simple URLs issue

Documentation for Lync Server 2010 states that simple URLs can be configured one of three ways:

Option #1 (Default):
https://meet.contoso.com
https://dialin.contoso.com
https://admin.contoso.com

Option #2:
https://lync.contoso.com/Meet
https://lync.contoso.com/Dialin
https://lync.contoso.com/Admin

Option #3:
https://lync.contoso.com/contosoSIPdomain/Meet 
https://lync.contoso.com/contosoSIPdomain/Dialin
https://lync.contoso.com/contosoSIPdomain/Admin 

My company decided to use the second choice, although formatted in a different way. https://ucmeet.contoso.com/Meet worked perfectly. However, the other two produced this error:
NOTE: If not patched to CU5, the error will say "The Meeting URL is invalid..."

Front-End server event logs showed this error:


I reran the deployment wizard step 2 on the servers that are servicing the simple URL's (front-ends and directors) and performed an IISreset.

Still the errors persisted.

https://129.x.x.x/Dialin along with the httpd://fepool.contoso.com/dialin all worked. Everything except the chosen vanity one https://ucmeet.contoso.com/Dialin or https://ucmeet.contoso.com/Admin

A support case was opened and 4.5 hours later (along with two support engineers - one being an IIS specialist) it was determined that this is a bug because the Lync setup configures the IIS rewrites which were not behaving as expected. The case is still open and has been sent to the Lync development team. I'll keep you posted on this!!!

UPDATE: The Microsoft development team has determined that there is an issue (not yet defined as a bug) when using meet in the vanity URL in which the rewrite rules do not know how to handle the request. If any word besides meet is used, then the vanity URL works perfectly. For example: 

https://join.contoso.com/Dialin or https://join.contoso.com/Admin works fine but
https://ucmeet.contoso.com/Dialin or https://ucmeet.contoso.com/Admin throws an error





Lync Error Code -2146762487

I patched the Lync 2010 environment to the latest and greatest CU5 update yesterday and received this error on one of the Front-End pool servers:
These other services would not start also:
A quick Internet search did not return any relevant results concerning error code -2146762487. An examination of the Lync Server Event viewer logs showed these errors:
I then launched certificates.mmc and found that the GEOtrust intermediate certificate had vanished from the affected server. After importing the Geotrust intermediate certificate into the intermediate certification authorities, all services started and all event log errors cleared.