I have some data showing how long it took someone to play N games of solitaire. For this particular set of data, the number of wins varies, but the number of losses is always 3. Here are a few results: Code: Mins/ # Minutes Wins Games Game 5 82.42 20 23 3.58 4 33.09 6 9 3.68 3 35.07 8 11 3.19 2 27.54 6 9 3.06 1 16.61 3 6 2.77 Using only this data, I would like to calculate the average times for each winning game and each losing game. I believe that losing games take longer, because they require more machinations, but they could take less time, because winning games have to be played all the way to the end. I started with the general formula for the game times: Code: Tt = W*Tw + L*Tl (1) Where Code: W = The number of games won. (From the data) L = The number of games lost. (From the data) Tt = The average time for a game. (From the data) Tw = The average time for a winning game. (To be calculated) Tl = The average time for a losing game. (To be calculated) Solving (1) for Tw we get Code: Tw = Tt/W – (L/W)*Tl The constants Tt/W & L/W can be calculated for each data point and then averaged. For simplicity, let's replace then with A & B. Code: Mins/ (A) (B) # Minutes Wins Games Game Tt/W Ave L/W Ave 5 82.42 20 23 3.58 4.12 4.83 0.15 0.51 4 33.09 6 9 3.68 5.52 5.01 0.50 0.59 3 35.07 8 11 3.19 4.38 4.84 0.38 0.63 2 27.54 6 9 3.06 4.59 5.06 0.50 0.75 1 16.61 3 6 2.77 5.54 5.54 1.00 1.00 Now I get stuck. Is this approach workable? If so, what's the next step?

Your initial formula is wrong. If Tt = average time for a game, Tw = average time for a win, Tl = average time for a loss, and W and L are number of Wins and Losses, then Tw * W + Tl * L = total time played, not the average for a game, as your formula suggests. But to answer your question, given the data you have provided you can not determine average time for a win or a loss. Each line of your data can be resolved by a range of times for wins and losses. How do you then average them when you don't know where in that range the actual times are? For example, your first line could be the result of a win taking 1 minute and losses taking 20.81 minutes. Or the result of losses taking 1 minute and wins taking 3.97 minutes. Your second line could be a win taking 1 minute and losses 9.03, or the result of losses taking 1 minute and wins taking 5.02 minutes. Because there is no strict relationship between each set of data, it is not possible to arrive at an average for a win vs loss. You could of course come up with a range of answers, simply by summing the total time (194.73), total wins (43) and total losses (15) and plotting the line of 43x + 15y = 194.73 You could also plot the lines of each individual data set in a similar fashion, and see where they approximately congregate. But looking for the actual average from your dataset is not possible, I don't think, unless you make some further assumptions.

Suppose I had timed each individual game, win or lose, in this data set. I could then easily calculate the average times for wins and for losses. So it seems to me that there is an average time for wins and an average time for losses. My approach may be off, but it seems to me that there ought to be a way to estimate those two values from the data provided. Your plotting suggestion may point the way. Looking at the first two results, we get a system of 2 equations in 2 unknowns: Code: 3x + 3y = 16.61 (1) 6x + 3y = 27.54 (2) Subtracting (1) from (2) we get Code: 3x = 10.93 x = 3.6433 Substituting back into either (1) or (2), we get Code: y = (16.61-3x)/3 = 1.8933 y = (27.54-6x)/3 = 1.8933 Based on just 2 results, the estimated times are 3.6433 minutes for a win and 1.8933 minutes for a loss. I could repeat this for results 2 & 3, 3 & 4, etc. I could write a program to calculate all n*(n+1)/2 pairings and average those values. But there must be a better way.

Tt is the total time for the play given W wins and L losses. 166973/43800 ≈ 3.81217 is the expected minutes per win and 269867/131400 ≈ 2.05378 is the expected minutes per loss. Tt = (166973/43800) W + (269867/131400) L = 3.81217 W + 2.05378 L = 3.81217 W + 6.16135 Which you can compute exactly in Mathematica as Code: PseudoInverse[{{20, 3}, {6, 3}, {8, 3}, {6, 3}, {3, 3}}] . {{82.42}, {33.09}, {35.07}, {27.54}, {16.61}} Or approximately in Matlab as: Code: [ 20, 3 ; 6, 3 ; 8, 3 ; 6, 3 ; 3, 3 ] \ [ 82.42 ; 33.09 ; 35.07 ; 27.54 ; 16.61 ] These are linear models in the sense they are strictly proportional to their inputs so naturally predict 0 total minutes for 0 wins and 0 losses. https://en.wikipedia.org/wiki/Moore–Penrose_pseudoinverse http://mathworld.wolfram.com/Moore-PenroseMatrixInverse.html We can (as it happens) get the exact same result longhand via the simple least squares linear regression and the simplifying choice of allocating the constant term B to the three constant losses per per trial. This is an affine model where X is the number of wins, Y is the total minutes, and B is the number of total minutes with zero wins which we attribute to having 3 losses. N = 5 SumX = 43 SumY = 194.73 SumXX = 545 SumXY = 2342.57 SumYY = 10152.2531 M = ( N × SumXY − SumX × SumY ) / ( N × SumXX − SumX × SumX ) = ( 5 × 2342.57 − 43 × 194.73 ) / ( 5 × 545 − 43 × 43 ) = 166973/43800 B = (SumY – M × SumX ) / N = (194.73 – (166973/43800) × 43 ) / 5 = 269867/43800 B/3 = 269867/131400 https://en.wikipedia.org/wiki/Simple_linear_regression

I played over 4000 games of the Brainier iPad solitaire perfectly (with unlimited undo) and won them over 66% of the time. After playing about 2000 of those games, I became able to easily predict whether a game was winnable or not (with unlimited undo) after playing about half of the moves of a maximally challenging game by inspection of whether or not a column of cards could be moved or not. And my final average play time (win / lose) was just about 4 min 12 sec.

Wow. Very impressive. I don't have Mathematica or Matlab, even though I have wanted one of them for a long time. Is there a similar linear model available in Excel?

I have developed a Solitaire strategy that gets at least 70% wins with just a single undo and unlimited passes through the deck, one card at a time. It's a bit of a complicated strategy that requires keeping track of all of the visible cards. I make some mistakes. I think a computer simulation could get close to 75% or possibly even more. With unlimited undoes, I would think a much higher percentage should be possible, maybe 80%?

Excel can do the simple linear regression. Python and NumPy came with my Mac and are easy to get on many systems for free. And while it works with approximate math like MatLab, in this case it can recover the exact ratios the same as Mathematica. Code: >>> import numpy as np >>> a = np.array([ [20, 3 ], [ 6, 3 ],[ 8, 3 ], [ 6, 3 ], [ 3, 3 ]]) >>> b = np.array([[ 82.42 ], [ 33.09 ], [ 35.07 ], [ 27.54 ], [ 16.61 ]]) >>> x, residuals, rank, s = np.linalg.lstsq(a,b) >>> x.flatten().tolist() [3.8121689497716913, 2.0537823439878222] >>> from fractions import Fraction >>> print Fraction.from_float(x.flatten().tolist()[0]).limit_denominator() 166973/43800 >>> print Fraction.from_float(x.flatten().tolist()[1]).limit_denominator() 269867/131400

The actual percentage depends heavily on how statistically "random" the cards are shuffled when you begin play. Some earlier versions of games I tried by other vendors were not shuffled randomly, and instead resulted in repetition of games that were selected for in order to make the results appear random when they really weren't. I'm pretty certain, the percentage of the game I selected to play (for like, a solid month out of my life) was really statistically random, and that the play was not padded with known repetitive variations of unwinable games simply to keep the percentages of wins within expected or prespecified bounds. 66 ± 3 % is my final answer, arrived at empirically. Math people elsewhere have told me that an exact solution to the problem by means of calculation is close to impossible because the necessary factorial calculations on the order of 52 cards results for all intents and purposes in a number of possible games that are infinite.

Python, take two Code: import numpy as np from fractions import Fraction original_data = np.array([[ 5, 82.42, 20, 23, 3.58], [ 4,33.09,6,9,3.68], [3,35.07,8,11,3.19], [2,27.54,6,9,3.06], [1,16.61,3,6, 2.77]]) columnOfTotalTime = original_data[:,1:2] columnOfWins = original_data[:,2:3] columnOfLosses = original_data[:,3:4] - columnOfWins WLcolumns = np.concatenate((columnOfWins,columnOfLosses),axis=1) modelCoefficients, ss_residuals, rank, s = np.linalg.lstsq(WLcolumns, columnOfTotalTime) print 'Model Coefficients:' print modelCoefficients[0,0], u'\u2248', Fraction.from_float(modelCoefficients[0,0]).limit_denominator() print modelCoefficients[1,0], u'\u2248', Fraction.from_float(modelCoefficients[1,0]).limit_denominator() # Alternate way to get ss_residuals predictions = modelCoefficients[0,0] * columnOfWins + modelCoefficients[1,0] * columnOfLosses residuals = columnOfTotalTime - predictions print '' print 'Sum of Squares of Residuals from Predictions:' print ss_residuals[0], "=", np.linalg.norm(residuals)**2 # Alternate way to get modelCoefficients, via singular value decomposition. svd_padding = WLcolumns.shape[0] - WLcolumns.shape[1] svd_u, svd_s, svd_v = np.linalg.svd(WLcolumns) svd_padded_s = np.pad(np.diag(svd_s), ((0,svd_padding),(0,0)), 'constant') svd_inv_v = np.linalg.inv(svd_v) svd_inv_s = np.linalg.inv(np.diag(svd_s)) svd_inv_u = np.linalg.inv(svd_u) svd_wlc = svd_u.dot(svd_padded_s).dot(svd_v) svd_wlc_pseudoinverse = np.pad(svd_inv_v.dot(svd_inv_s), ((0,0), (0,svd_padding)), 'constant').dot(svd_inv_u) svd_mc = svd_wlc_pseudoinverse.dot(columnOfTotalTime) # The differences refelect tiny differences in how approximate floating point math is done. print '' print 'Differences between methods magnified by 2.25e15:' print ( ss_residuals[0] - np.linalg.norm(residuals)**2 ) * 2**51 print ( svd_s - s ) * 2**51 print ( svd_wlc - WLcolumns ) * 2**51 print (svd_mc - modelCoefficients) * 2**51 Which produces: Code: Model Coefficients: 3.81216894977 ≈ 166973/43800 2.05378234399 ≈ 269867/131400 Sum of Squares of Residuals from Predictions: 22.1813757991 = 22.1813757991 Differences between methods magnified by 2.25e15: 16.0 [ 0. 0.] [[-8. -2.] [-2. 1.] [-2. 0.] [-2. 0.] [ 0. 1.]] [[ 0.] [-2.]]

Based on this linear model and assuming the only errors are the +/- 0.005 measurement error, we can be 95% sure that the "true" model for this data lies in a narrow ellipse. Code: >>> errors = np.concatenate([ svd_wlc_pseudoinverse.dot(np.random.random((5,1)) * 0.01 - 0.005).transpose() for k in range(100000) ]) >>> cov = np.cov(errors, rowvar=0) >>> eigenvalues, eigenvectors = np.linalg.eig(np.cov(errors, rowvar=0)) >>> print cov [[ 4.75699801e-08 -1.36816119e-07] [ -1.36816119e-07 5.78973220e-07]] >>> print eigenvalues [ 1.44137722e-08 6.12129428e-07] >>> print eigenvectors [[-0.97186854 0.23552397] [-0.23552397 -0.97186854]] http://www.wolframalpha.com/input/?...(y - 269867/131400))^2/6.12129428e-07 = 5.991 One such point inside this ellipse is x = 61/16, y = 39/19 which gives the model (61/16) W + (39/19) L which differs from the prediction of the best model by no more than 0.004 minutes.

It should be possible to write a computer program to play most (perhaps all) solitaire games. With such a program, you could play millions of games and estimate how long it would take a human to play. Without the undo option it should be an easy job.