Friday, February 24, 2017

Python Optimization Modeling: optlang

Example with Transportation Model from (1) taken from the optlang docs:

from optlang import Variable, Constraint, Objective, Model

# Define problem parameters
# Note this can be done using any of Python's data types. Here we have chosen dictionaries
supply = {"Seattle": 350, "San_Diego": 600}
demand = {"New_York": 325, "Chicago": 300, "Topeka": 275}

distances = {  # Distances between locations in thousands of miles
    "Seattle": {"New_York": 2.5, "Chicago": 1.7, "Topeka": 1.8},
    "San_Diego": {"New_York": 2.5, "Chicago": 1.8, "Topeka": 1.4}
}

freight_cost = 9  # Cost per case per thousand miles

# Define variables
variables = {}
for origin in supply:
    variables[origin] = {}
    for destination in demand:
        # Construct a variable with a name, bounds and type
        var = Variable(name="{}_to_{}".format(origin, destination), lb=0, type="integer")
        variables[origin][destination] = var

# Define constraints
constraints = []
for origin in supply:
    const = Constraint(
        sum(variables[origin].values()),
        ub=supply[origin],
        name="{}_supply".format(origin)
    )
    constraints.append(const)
for destination in demand:
    const = Constraint(
        sum(row[destination] for row in variables.values()),
        lb=demand[destination],
        name="{}_demand".format(destination)
    )
    constraints.append(const)

# Define the objective
obj = Objective(
    sum(freight_cost * distances[ori][dest] * variables[ori][dest] for ori in supply for dest in demand),
    direction="min"
)
# We can print the objective and constraints
print(obj)
print("")
for const in constraints:
    print(const)

print("")

# Put everything together in a Model
model = Model()
model.add(constraints)  # Variables are added implicitly
model.objective = obj

# Optimize and print the solution
status = model.optimize()
print("Status:", status)
print("Objective value:", model.objective.value)
print("")
for var in model.variables:
    print(var.name, ":", var.primal)

Some other Python based options are Pulp, Pyomo and PyMathprog (2).

References
  1. G.B.Dantzig, Linear Programming and Extensions, Princeton University Press, 1963
  2. PyMathProg: another Python based modeling tool: http://yetanothermathprogrammingconsultant.blogspot.com/2016/12/pymathprog-another-python-based.html 

Tuesday, February 21, 2017

Multiplication of binary with continuous variable between zero and one

A small tidbit. The standard linearization for the multiplication of two binary variables \(z=x \cdot y\) with \(x,y \in \{0,1\}\) is:
\[\boxed{\begin{align}&z\le x\\&z \le y\\& z\ge x+y-1\\&z \in \{0,1\}\end{align}}\]

What about if \(y\) is continuous between zero and one, \(y\in[0,1]\) (while \(x\) remains a binary variable)? In that case we can use the formulation in (1) where we consider a continuous variable \(y \in [y^{lo},y^{up}]\):

\[\boxed{\begin{align}&\min\{0,y^{lo}\} \le z \le \max\{0,y^{up}\} \\
&y^{lo} \cdot x \le z \le y^{up} \cdot x \\
&y − y^{up} \cdot (1 − x) \le z \le y − y^{lo} \cdot (1 − x)\end{align}}\]

When we plug in \(y^{lo}=0\) and \(y^{up}=1\), we end up with the same thing:

\[\boxed{\begin{align}&z\le x\\&z \le y\\& z\ge x+y-1\\&z \in [0,1]\end{align}}\]

We can easily verify the correctness by checking what happens when \(x=0\) or \(x=1\):

\[\begin{align}
    &x=0 & \implies & \begin{cases}
                              z\le 0\\
                              z \le y &\text{(NB)}\\
                              z \ge y-1 &\text{(NB)} \\
                              z \in [0,1]
                             \end{cases} & \implies  & z=0 \\
    &x=1 & \implies & \begin{cases}
                              z\le 1 & \text{(NB)}\\
                              z \le y \\
                              z \ge y \\
                              z \in [0,1]
                             \end{cases} & \implies  & z=y
\end{align}\]

where \(\text{NB}\) means non-binding by which I mean we can ignore this constraint.

References
  1. Multiplication of a continuous and a binary variable, http://yetanothermathprogrammingconsultant.blogspot.com/2008/05/multiplication-of-continuous-and-binary.html

Saturday, February 18, 2017

Solving Kakuro puzzles as a MIP

The Kakuro puzzle is another logical puzzle similar to Sudoku, KenKen and Takuzu. A small example from (1) is:

image 

The rules are as follows:

  1. Each (white) cell must be filled with a value \(1,..,9\).
  2. Each (white) cell belongs to a horizontal and/or a vertical block of contiguous cells.
  3. The values of the cells in each block add up to a given constant. These constants are to the left (or top) of the block. A notation a\b is used: a is the sum for the block below and b is the sum for the block to the right.
  4. The values in each block must be unique (i.e. no duplicates).
Model

The actual model is rather simple. Similar to (2) and (3) I define a binary decision variable:

\[x_{i,j,k} = \begin{cases}1 & \text{if cell $(i,j)$ has the value $k$} \\
     0& \text{otherwise}\end{cases}\]

where \(k=\{1,..,9\}\). Of course, we can select only one value in a cell, i.e.:

\[ \sum_k x_{i,j,k} = 1 \>\>\forall i,j|IJ(i,j)\]

where \(IJ(i,j)\) is the set of white cells we need to fill in. To make the model simpler we also define an integer variable \(v_{i,j}\) with the actual value:

\[v_{i,j} = \sum_k k\cdot x_{i,j,k}\]

The values in each block need to add up the given constant. So:

\[\sum_{i,j|B(b,i,j)} v_{i,j} = C_b \>\>\forall b\]

where the set \(B(b,i,j)\) is the mapping between block number and cells \((i,j)\). The uniqueness constraint is not very difficult at all:

\[\sum_{i,j|B(b,i,j)} x_{i,j,k} \le 1 \>\>\forall b,k\]

Note that (1) uses a different numbering scheme for the cells, but otherwise the models are very similar. My version of the model is smaller in terms of nonzero elements in the matrix, due to my use of intermediate variables \(v_{i,j}\).

Data entry

A compact way to enter the data for the example problem in GAMS is:

parameter
  block(b,i,j)
/
   
block1.r1.(c1*c2)   16  ,   block13.(r1*r3).c1  23
   
block2.r1.(c5*c7)   24  ,   block14.(r6*r7).c1  11
   
block3.r2.(c1*c2)   17  ,   block15.(r1*r4).c2  30
   
block4.r2.(c4*c7)   29  ,   block16.(r6*r7).c2  10
   
block5.r3.(c1*c5)   35  ,   block17.(r3*r7).c3  15
   
block6.r4.(c2*c3)    7  ,   block18.(r2*r3).c4  17
   
block7.r4.(c5*c6)    8  ,   block19.(r5*r6).c4   7
   
block8.r5.(c3*c7)   16  ,   block20.(r1*r5).c5  27
   
block9.r6.(c1*c4)   21  ,   block21.(r1*r2).c6  12
   
block10.r6.(c6*c7)   5  ,   block22.(r4*r7).c6  12
   
block11.r7.(c1*c3)   6  ,   block23.(r1*r2).c7  16
   
block12.r7.(c6*c7)   3  ,   block24.(r5*r7).c7   7
 
/

The sets \(IJ(i,j)\) and parameter \(C_b\) can be easily extracted from this.

Results

After adding a dummy objective, we can solve the model as a MIP (Mixed Integer Programming) Model. The output looks like:

----     75 VARIABLE v.L 

            c1          c2          c3          c4          c5          c6          c7

r1           9           7                                   8           7           9
r2           8           9                       8           9           5           7
r3           6           8           5           9           7
r4                       6           1                       2           6
r5                                   4           6           1           3           2
r6           8           9           3           1                       1           4
r7           3           1           2                                   2           1

It is noted that although I used the declaration \(x_{i,j}\), only the subset of cells that we actually need to fill in is used in the equations. GAMS will only actually generate the variables that appear in the equations, so all the “black” cells are not part of the model.

Cplex can solve the model in zero iterations and zero nodes: it solves the model completely in the presolve.

Interestingly in (1) somewhat disappointing performance results are reported. They use CBC and a solver called eplex (never heard of that one). I checked the above small instance with CBC and it can solve it in zero nodes, just like Cplex.

A large, difficult example

This example is taken from (4):

image

For this problem Cplex has to do a little bit of work. The log looks like:

---   3,665 rows  4,921 columns  19,189 non-zeroes
---   4,920 discrete-columns

. . .

Dual simplex solved model.

Root relaxation solution time = 0.44 sec. (102.53 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

      0     0        0.0000   325                      0.0000     1727        
      0     0        0.0000   188                    Cuts: 39     2562        
      0     0        0.0000   233                    Cuts: 49     3018        
      0     0        0.0000   227                    Cuts: 18     3403        
      0     0        0.0000   147                    Cuts: 64     4037        
*     0+    0                            0.0000        0.0000             0.00%
Found incumbent of value 0.000000 after 4.42 sec. (1071.33 ticks)
      0     0        cutoff              0.0000        0.0000     4423    0.00%
Elapsed time = 4.44 sec. (1072.36 ticks, tree = 0.01 MB, solutions = 1)

. . .

Proven optimal solution.

MIP Solution:            0.000000    (4423 iterations, 0 nodes)

To be complete, CBC takes a little more time:

Search completed - best objective 0, took 55117 iterations and 140 nodes (42.00 seconds)
Strong branching done 1748 times (104829 iterations), fathomed 9 nodes and fixed 122 variables
Maximum depth 18, 0 variables fixed on reduced cost

This model does not solve completely in the presolve. However the computation times look quite good to me, especially when compared to what is reported in (1).

The solution looks like:

image

Proving uniqueness

We can prove the uniqueness of this solution by adding the constraint:

\[\sum_{i,j,k|IJ(i,j)} x^*_{i,j,k} x_{i,j,k} \le |IJ| –1 \]

where \(x^*\) is the previously found optimal solution, and \(|IJ|\) is the cardinality of the set \(IJ(i,j)\) (i.e. the number of “white” cells). Note that \(\sum x^*_{i,j,k}=\sum x_{i,j,k} = |IJ|\). With this constraint we should either find a different solution (so the solution was not unique), or the model should be declared “infeasible”.

References
  1. Helmut Simonis, Kakuro as a Constraint Problem, Cork Constraint Computation Centre (4C), University College Cork, 2008.
  2. MIP Modeling: From Sukodu to Kenken via Logarithms, http://yetanothermathprogrammingconsultant.blogspot.com/2016/10/mip-modeling-from-sudoku-to-kenken.html 
  3. Solving Takuzu puzzles as a MIP using xor, http://yetanothermathprogrammingconsultant.blogspot.com/2017/01/solving-takuzu-puzzles-as-mip-using-xor.html
  4. http://pagesperso.g-scop.grenoble-inp.fr/~cambazah/page1/page7/assets/data_generated_hard.ecl has some larger data sets.
  5. http://www.gecode.org/doc-latest/MPG.pdf, Chapter 21 is about solving Kakuro using Gecode, a constraint programming solver.
  6. José Barahona da Fonseca, A novel linear MILP model to solve Kakuro puzzles, 2012. This has a (partial) GAMS model (stylistically I would write things differently; the model can be written more cleanly).

A script to generate power pivot tables

In (1) I provided a GAMS/VBS script to generate an Excel spreadsheet containing a pivot table that used a CSV file as external data source. Essentially the pivot table provides a view of the data in the CSV file. Here I show a script that allows to store the data inside the Excel data model: no external CSV file is used after loading the data into the data model.

$ontext

  
Example code that creates a pivot table with large data loaded

  
into Excel's data model.

$offtext

sets c,cty,yr,attr;
parameter psd(c,cty,yr,attr) 'production, supply and distribution'
;

$gdxin psd_alldata.gdx
$loaddc c=dim1 cty=dim2 yr=dim3 attr=dim4 psd

parameter
TotalProduction(c,yr);
TotalProduction(c,yr) =
sum(cty,psd(c,cty,yr,'Production'
));


alias
(Commodities,Countries,Year,Attributes,*);
parameter
Cube(Commodities,Countries,Year,Attributes);

Cube(c,cty,yr,attr) = psd(c,cty,yr,attr);
Cube(c,
'World',yr,'Production'
) = TotalProduction(c,yr);

$set gdxfile         cube.gdx
$set csvfile         cube.csv

execute_unload "%gdxfile%"
,Cube;
execute "gdxdump %gdxfile% output=%csvfile% format=csv symb=Cube"
;


$set xlsfile         %system.fp%pivotcube.xlsx
$set scriptname      script.vbs

execute "cscript %scriptname% //Nologo"
;


$onecho > %scriptname%
Wscript.Echo "Creating pivot table..."

Wscript.Echo "This will take a minute."
Wscript.Echo ""
Set xl = CreateObject("Excel.Application")
xl.DisplayAlerts=False
Wscript.Echo "  Creating Workbook..."
Set wb = xl.Workbooks.Add

Wscript.Echo "  Creating Query..."
wb.Queries.Add "cube", _
 
"let Source = Csv.Document(File.Contents(""%system.fp%%csvfile%""),[Delimiter="",""])," & _
 
"  #""Headers"" = Table.PromoteHeaders(Source)," & _
 
"  #""Type"" = Table.TransformColumnTypes(#""Headers"",{{""Val"", Number.Type}}) in #""Type"""

Wscript.Echo "  Creating Connection (loading data)..."
Set con = wb.Connections.Add2("Query - cube", _
       
"Connection to the 'cube' query in the workbook.", _
       
"OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=cube" _
       
, """cube""", 6, True, False)

Wscript.Echo "  Creating PivotCache..."
Set pc = wb.PivotCaches.Create(2,con,6)

If wb.Sheets.count = 0 Then
  
Set sh = wb.Sheets.Add()
Else
  
Set sh = wb.Sheets(1)
End If

Wscript.Echo "  Creating PivotTable..."
Set pt = pc.CreatePivotTable(sh.Range("A1"))
pt.SmallGrid = False

Wscript.Echo "  Initial Layout..."
pt.CubeFields("[cube].[Commodities]").Orientation=3
pt.CubeFields("[cube].[Attributes]").Orientation=3
pt.CubeFields("[cube].[Year]").Orientation=2
pt.CubeFields("[cube].[Countries]").Orientation=1

xlSum = -4157
call pt.CubeFields.GetMeasure("[cube].[Val]",xlSum,"Sum of Val")
call pt.AddDataField(pt.CubeFields("[Measures].[Sum of Val]"), "Sum of Val")

pt.PivotFields("[cube].[Commodities].[Commodities]").ClearAllFilters
pt.PivotFields("[cube].[Commodities].[Commodities]").CurrentPageName = _
  
"[cube].[Commodities].&[Coffee, Green]"
pt.PivotFields("[cube].[Attributes].[Attributes]").ClearAllFilters
pt.PivotFields("[cube].[Attributes].[Attributes]").CurrentPageName = _
   
"[cube].[Attributes].&[Production]"

pt.TableStyle2 = "PivotStyleDark7"

Wscript.Echo "  Saving %xlsfile%..."
wb.SaveAs("%xlsfile%")
wb.Close
xl.Quit

$offecho

The steps are as follows:

  1. Load data into GAMS.
  2. Create a 4 dimensional cube parameter suited for display as a pivot table. The dimensions are commodity, country, year, attribute.
  3. Export the cube as GDX file.
  4. Convert the GDX file to a CSV file.
  5. Run a VBscript script that talks to Excel and performs the following steps:
    1. Load the data into the data model.
    2. Create a pivot table for this data.
    3. Create an initial layout that makes sense for a user of the data (opposed to an empty pivot table).

The result will look like:

image

We see three regions. On the left we see the actual pivot table:

image

The most right panel allows to change the layout of the pivot table:

image

Note that our 4 dimensional cube parameter has become a 5 column table: the values are a separate column.

The smaller panel in the middle shows the queries in data model:

image

Data model

The step to load the CSV data into the data model of Excel is actually written in what is called the M language:

  1. Source = Csv.Document(File.Contents("cube.csv"),[Delimiter=","])
  2. #"Headers" = Table.PromoteHeaders(Source)
  3. #"Type" = Table.TransformColumnTypes(#"Headers",{{"Val", Number.Type}}) in #"Type"

Interestingly we can follow these steps after the fact in the Excel Query Editor. After the first step Source we see the raw CSV data has been imported:

image

The second step promotes the headers:

image

and the last step changes the type of the value column to numeric:

image

File Sizes

The method in (1) requires both an .xlsx file and a .csv file to be present. For this example, we end up with:

02/18/2017  12:49 PM         6,397,317 pivotcube.xlsx
02/18/2017  12:48 PM        62,767,308 cube.csv

The method shown here produces a self-contained .xslx file. We have:

02/18/2017  12:52 PM         2,958,573 pivotcube.xlsx

I am not sure why this version is even smaller than the pivot table with external CSV data, but apparently the internal data model can be squeezed very efficiently. Some of it may have to do with the internal Vertipaq Compression Engine which allows for very efficient column storage of the data in the data model (3). 

References
  1. http://yetanothermathprogrammingconsultant.blogspot.com/2016/04/export-large-result-sets-to-pivot-table.html contains a script to generate a pivot table with a CSV file as an external data source.
  2. http://yetanothermathprogrammingconsultant.blogspot.com/2016/07/excel-pivot-table-and-large-data-sets.html shows how to load CSV data into the data model and create a pivot table by hand.
  3. Alberto Ferrari and Marco Russo, The VertiPaq Engine in DAX, https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3 (chapter from The Definitive Guide to DAX, Business intelligence with Microsoft Excel, SQL Server Analysis Server, and Power BI, Microsoft Press, 2015).

Thursday, February 9, 2017

New book: Testing R Code

Testing R Code book cover

Table of contents

  1. Introduction
  2. Run-Time Testing with assertive
  3. Development-Time Testing with testthat
  4. Writing Easily Maintainable and Testable Code
  5. Integrating Testing into Packages
  6. Advanced Development-Time Testing
  7. Writing Your Own Assertions and Expectations

190 pages


Price: $0.26 per page (Amazon)

I liked it at first sight when leafing through the pages. Will try to use this in my next R project.

Wednesday, February 8, 2017

Failing to use optimal solution as initial point

In a somewhat complex non-linear programming model, I observed the following. I do here two solves in a row:

solve m2 minimizing zrf using dnlp;
solve
m2 minimizing zrf using dnlp;

We solve the model, and then in the second solve we use the optimal solution from the first solve as an initial point. This should make the second solve rather easy! But what I see is somewhat different.

The first solve has no problems. (I fed it a close to optimal initial point). The log looks like:

--- Generating DNLP model m2
--- testmodel6.gms(83981) 156 Mb  22 secs
--- testmodel6.gms(84048) 160 Mb
---   139,443 rows  139,450 columns  746,438 non-zeroes
---   8,504,043 nl-code  467,552 nl-non-zeroes
--- testmodel6.gms(84048) 116 Mb
--- Executing CONOPT: elapsed 0:01:44.102
CONOPT 3         24.7.3 r58181 Released Jul 11, 2016 WEI x86 64bit/MS Windows
 
 
    C O N O P T 3   version 3.17A
    Copyright (C)   ARKI Consulting and Development A/S
                    Bagsvaerdvej 246 A
                    DK-2880 Bagsvaerd, Denmark
 
 
   Iter Phase Ninf   Infeasibility   RGmax    NSB   Step InItr MX OK
      0   0        8.7137917110E-04 (Input point)
 
                   Pre-triangular equations:   9720
                   Post-triangular equations:  129722
 
      1   0        3.0000000000E-05 (After pre-processing)
      2   0        4.6875000000E-07 (After scaling)
      3   0     0  7.3256941462E-17               1.0E+00      F  T
 
** Feasible solution. Value of objective =   0.905233579345
 
   Iter Phase Ninf     Objective     RGmax    NSB   Step InItr MX OK
      4   3        9.0523357374E-01 1.9E-03     6 1.0E+00    3 T  T
      5   3        9.0523357341E-01 1.0E-03     3 1.0E+00    2 T  T
      6   3        9.0523357341E-01 1.3E-12     3
 
** Optimal solution. Reduced gradient less than tolerance.

We are hopeful the second solve should be even easier, but we see:

--- Generating DNLP model m2
--- testmodel6.gms(83981) 69 Mb
*** Error at line 83981: overflow in * operation (mulop)
--- testmodel6.gms(83981) 155 Mb 1 Error  22 secs
--- testmodel6.gms(84049) 116 Mb 1 Error
*** SOLVE aborted

How can an optimal solution be an invalid initial point?

My analysis is the following:

  • The equation we cannot generate is part of what Conopt calls “Post-triangular equations”. These equations are set aside by Conopt when solving the model, and are reintroduced just before reporting the solution because Conopt knows these equations will not change the solution. Typically “accounting rows” are such equations. These are constraints that could have been implemented as assignments in post-processing as they just calculate some additional information about the solution. In most cases these post-triangular equations do not require the evaluation of gradients.
  • When we generate the model again for the second time, GAMS will try to evaluate the gradients. It is here where we encounter the overflow. In this model things are way more complicated but think about the function \(f(x)=\sqrt{x}\) at \(x=0\). The function itself can be evaluated at \(x=0\), but the gradient can not. (Actually GAMS will patch the derivative of \(\sqrt{x}\) at \(x=0\) to fix this, but it is not smart enough to do this on my functions).
  • This also means that there is a architectural flaw here: we should do the presolve before evaluating gradients. Here GAMS will evaluate gradients, and subsequently CONOPT will presolve the model. This is exactly in the wrong order. Note that AMPL will do a presolve in the generation of the model opposed to GAMS which leaves it to the solver. I believe AMPL is doing the right thing here.

A work-around is:

solve m2 minimizing zrf using dnlp;
x.l(jd) = max(x.l(jd),1e-6);
solve
m2 minimizing zrf using dnlp;