U4-7869 - Examine & SQL Azure - Timeouts

Created by Jeavon Leopold 01 Feb 2016, 09:59:30 Updated by Jeavon Leopold 03 Feb 2016, 12:46:19

Relates to: U4-7887

Relates to: U4-623

Relates to: U4-2491

Relates to: U4-6480

Relates to: U4-7486

We have noticed a that sites running on Azure have quite a few log entries about media missing and therefore falling back, e.g. 2016-02-01 09:32:11,534 [P2020/D2/T95] WARN Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedMediaCache - Could not retrieve media 10477 from Examine index, reverting to looking up media via legacy library.GetMedia method

Attempting to rebuild the index we frequently see these errors: 2016-02-01 09:42:02,721 [P2020/D2/T76] ERROR Umbraco.Web.WebServices.ExamineManagementApiController - An error occurred rebuilding index System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=105; handshake=31; [Login] initialization=0; authentication=0; [Post-Login] complete=1; ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() at System.Data.SqlClient.SqlDataReader.get_MetaData() at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource1 completion, Int32 timeout, Task& task, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method) at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior) at StackExchange.Profiling.Data.ProfiledDbCommand.ExecuteDbDataReader(CommandBehavior behavior) in c:\Code\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Data\ProfiledDbCommand.cs:line 235 at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader() at Umbraco.Core.Persistence.Database.d__71.MoveNext() at System.Collections.Generic.List1..ctor(IEnumerable1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable1 source) at Umbraco.Core.Persistence.Database.Page[T](Int64 page, Int64 itemsPerPage, String sql, Object[] args) at Umbraco.Core.Persistence.Database.Page[T](Int64 page, Int64 itemsPerPage, Sql sql) at Umbraco.Core.Persistence.Repositories.VersionableRepositoryBase2.GetPagedResultsByQuery[TDto,TContentBase](IQuery1 query, Int64 pageIndex, Int32 pageSize, Int64& totalRecords, Tuple2 nodeIdSelect, Func2 processQuery, String orderBy, Direction orderDirection, Func1 defaultFilter) at Umbraco.Core.Persistence.Repositories.ContentRepository.GetPagedResultsByQuery(IQuery1 query, Int64 pageIndex, Int32 pageSize, Int64& totalRecords, String orderBy, Direction orderDirection, String filter) at Umbraco.Core.Services.ContentService.GetPagedDescendants(Int32 id, Int64 pageIndex, Int32 pageSize, Int64& totalChildren, String orderBy, Direction orderDirection, String filter) at UmbracoExamine.UmbracoContentIndexer.PerformIndexAll(String type) at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() at Umbraco.Web.WebServices.ExamineManagementApiController.PostRebuildIndex(String indexerName) ClientConnectionId:xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx Error Number:-2,State:0,Class:11 ClientConnectionId before routing:xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx Routing Destination:xxxxxxxxx.westeurope1-a.worker.database.windows.net,11187

I'm wondering if Examine is not utilizing the Azure Transient Fault Handling stuff built into Umbraco...?

Any other ideas?

Comments

Shannon Deminick 02 Feb 2016, 11:10:14

The lookup starts with ContentService.GetPagedDescendants, all queries at the Service (repository) layer use PetaPoco and we definitely have Azure Transient Fault Handling applied there. The page size it queries with is 1000 which shouldn't cause a timeout. How many content/media nodes do you have? Does this error occur every time you try to rebuild? Is your database server and website in the same region in Azure?


Jeavon Leopold 02 Feb 2016, 11:32:52

That's interesting as I've noticed that there are only ever 1000 or almost always 2000 nodes in the Examine index after I attempt a rebuild. There are 10,500 content/media nodes in total. Yes the exception always occurs when attempting rebuild and yes the website and SQL server are in the same Azure region.


Jeavon Leopold 02 Feb 2016, 12:11:07

We have Connection Timeout=300 in our DB connection string but it seems to me that somewhere this is being ignored and it's using 15 or maybe 30 seconds instead


Shannon Deminick 02 Feb 2016, 12:32:19

I can have a look later to double/triple check that the query executed from GetPagedDescendants uses the fault handling retry logic. Otherwise if you've got time you can trace the Umb code to see if this stack trace does indeed ensure that it does get executed:

2016-02-01 09:42:02,721 [P2020/D2/T76] ERROR Umbraco.Web.WebServices.ExamineManagementApiController - An error occurred rebuilding index
System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.  This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=105; handshake=31; [Login] initialization=0; authentication=0; [Post-Login] complete=1;   ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at StackExchange.Profiling.Data.ProfiledDbCommand.ExecuteDbDataReader(CommandBehavior behavior) in c:\Code\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Data\ProfiledDbCommand.cs:line 235
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader()
   at Umbraco.Core.Persistence.Database.<Query>d__7`1.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Umbraco.Core.Persistence.Database.Page[T](Int64 page, Int64 itemsPerPage, String sql, Object[] args)
   at Umbraco.Core.Persistence.Database.Page[T](Int64 page, Int64 itemsPerPage, Sql sql)


Jeavon Leopold 02 Feb 2016, 12:42:42

I was just getting into that as I noticed that in this exception stack there is no FaultHandling.RetryPolicy but I have found other SQL exceptions that have been logged that do contain it such as the one below:

2016-02-02 12:05:42,509 [P2020/D3/T221] ERROR Umbraco.Core.Persistence.UmbracoDatabase - Database exception occurred System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=9; handshake=39; [Login] initialization=0; authentication=0; [Post-Login] complete=2; ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async, Int32 timeout, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at StackExchange.Profiling.Data.ProfiledDbCommand.ExecuteNonQuery() in c:\Code\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Data\ProfiledDbCommand.cs:line 266 at Umbraco.Core.Persistence.PetaPocoCommandExtensions.<>c__DisplayClass1.b__0() at Umbraco.Core.Persistence.FaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func) at Umbraco.Core.Persistence.Database.Execute(String sql, Object[] args) ClientConnectionId:xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx Error Number:-2,State:0,Class:11 ClientConnectionId before routing:xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx Routing Destination:xxxxx.tr3.westeurope1-a.worker.database.windows.net,11187

I've also been discussing this with Azure Support and their not very helpful suggestion is to change the retry strategy settings such as retry interval and count as they are assuming we are using Enterprise Library. I guess there might be some similar values in the Umbraco handling...?


Jeavon Leopold 02 Feb 2016, 13:26:05

@Shandem I'm still stepping through it all but I think it could be this one https://github.com/umbraco/Umbraco-CMS/blob/30f0c96052864262577c024db15b650ebbfdfab7/src/Umbraco.Core/Persistence/PetaPoco.cs#L837 which should be ExecuteReaderWithRetry() instead of ExecuteReader()


Morten Christensen 02 Feb 2016, 14:08:59

Looks like an oversight that the specific Query method does not have / use the ExecuteReaderWithRetry extension method. All methods that does the actual reading and writing within PetaPoco should have "WithRetry".


Shannon Deminick 02 Feb 2016, 14:34:39

Updated in rev: 0de3be42cf64a472433177e1ebc51ddaeb46b44b

@crumpled_jeavon you are a legend mate! This is quite an oversight and will solve a lot of mysterious issues regarding this.


Jeavon Leopold 02 Feb 2016, 14:48:03

Wow thanks @Shandem


Jeavon Leopold 02 Feb 2016, 19:42:36

@Shandem @sitereactor Ok so changing to ExecuteReaderWithRetry has certainly helped however it did not solve all of my SQL Azure timeouts. So I've continued to delve into PetaPoco today and I've discovered that regardless of what I put in the connection string PetaPoco will timeout if the query takes longer than 30 seconds. Now with the Examine indexer, it's querying nodes in 1000 chunks, I tried executing the queries in SQL management studio and the first 2 came in under 30 seconds but the third was over which is why I'm seeing 2000 nodes in Examine after a rebuild.

I have managed to fix it with a hack! In PetaPoco.cs there is CommandTimeout which if I fix the get to 300 seconds everything works and no more timeouts at all!

public int CommandTimeout { get { return 300; } set }

I've been trying to find out how to set this properly but thus far haven't worked it out, perhaps one of you know...? At the moment I am overjoyed to have Examine indexes full again :)


Jeavon Leopold 02 Feb 2016, 20:13:02

This also illustrates the low performance of S0 SQL Azure compared to standard SQL server. I will try executing the same queries with more DTU's and see what the execution time is like.


Carl Bussema 03 Feb 2016, 01:24:26

Fyi, the connection string timeout only applies to making the initial connection, after that you are stuck with the default of 30 seconds unless you, or the library authors, explicitly change it. There are other issues that this timeout impacts (e.g. Courier), and it sucks that there's no way to just up it in a config file. The whole of the Azure S tier for sql can't use parallelization, which means some queries that work perfectly fine on a local SQL instance perform terribly when you try to run in the cloud.


Jeavon Leopold 03 Feb 2016, 10:41:58

Ok, so my benchmarking with the same query against different SQL types S0 - 37 seconds - 100% DTU utilization S1 - 19 seconds - 80% DTU utilization S2 - 6 seconds - 33% DTU utilization S3 - 5 seconds - 30% DTU utilization P1 - 4 seconds - 29% DTU utilization

As a comparison I fired up a new DS2+P20 storage Azure SQL Web Edition Virtual Machine imported the database and executed the same query, result was that the query took 0 seconds and used 0.3% CPU.

This is just one longish running query from Umbraco, I have come across others that actually take long such as running the XML Content cache repair.


Shannon Deminick 03 Feb 2016, 10:55:57

Thanks @crumpled_jeavon this is certainly useful information. We'll be implementing some SQL monitoring for some sites on UaaS over a period of time to see if we can have some quick wins. Hopefully will have something goes in a week or so. Of course in the meantime, I'm aware that there's probably some really good ways we can optimize some of these, just need to take the time to analyze what they are.


Jeavon Leopold 03 Feb 2016, 11:00:52

@Shandem using UaaS sounds like a great way to gather intel on which queries might be optimizable. In the meantime what do you think about fixing the CommandTimeout in PetaPoco to 300 seconds?


Shannon Deminick 03 Feb 2016, 11:02:46

Yeah, we can look at making that configurable somehow, can you open a feature request for that and link here ?


Jeavon Leopold 03 Feb 2016, 11:07:29

Absolutely, what type of SQL Azure does UaaS provision (is it configurable)?


Morten Christensen 03 Feb 2016, 12:04:40

@crumpled_jeavon we use elastic pools so its a lot more flexible. Once we have some more intel we'll be more to provide better guidance on which sql azure tier to use, and as Shannon mentioned we'll add some query monitoring, which can be used to improve the Core.


Jeavon Leopold 03 Feb 2016, 12:27:17

@sitereactor Elastic pools are a great thing for UaaS for sure! My feeling is that S2 should probably be the recommended SQL type for normal Umbraco but with U4-7887 it should work fine with all types again :)


Jeavon Leopold 03 Feb 2016, 12:46:19

I've added a feature request for making the CommandTimeout configurable here http://issues.umbraco.org/issue/U4-7887


Priority: Normal

Type: Bug

State: Fixed

Assignee:

Difficulty: Normal

Category:

Backwards Compatible: True

Fix Submitted:

Affected versions: 7.3.4

Due in version: 7.3.7

Sprint:

Story Points:

Cycle: