Friday, December 9, 2016

Reading CSV files in R: read.csv vs read_csv

There are a number of very fast CSV file readers available in R and Python. Lets have a quick test to see how they compare.

Generating CSV file

I generated a very simple, but large CSV file with 100 million records using a GAMS script as follows:


parameter d(i,j,k,l);
d(i,j,k,l) = uniform(0,1);

$setenv gdxcompress 1
execute_unload "d.gdx"
execute "gdxdump d.gdx output=d.csv symb=d delim=comma format=csv"

The generated CSV file looks like:

D:\tmp\csv>head d.csv

D:\tmp\csv>dir d.*
Volume in drive D is My Passport
Volume Serial Number is 74B7-6DCC

Directory of D:\tmp\csv

12/08/2016  03:42 PM     3,656,869,678 d.csv
12/08/2016  03:30 PM       806,199,476 d.gdx
               2 File(s)  4,463,069,154 bytes
               0 Dir(s)  1,099,214,491,648 bytes free


We also see the CSV file is much larger than the intermediate (compressed) GAMS GDX file.

R read.csv

This is the default CSV reader in R.

> system.time(d<-read.csv("d.csv"))
   user  system elapsed 
1361.61   50.56 1434.39 

R read_csv

read_csv is from the readr package, and it is much faster for large CSV files:

> system.time(d<-read_csv("d.csv"))
Parsed with column specification:
  i = col_character(),
  j = col_character(),
  k = col_character(),
  l = col_character(),
  Val = col_double()
|================================================================================| 100% 3487 MB
   user  system elapsed 
 186.23    5.66  196.20 

Would it help to read a compressed CSV file?

> system.time(d<-read_csv("d2.csv.gz"))
Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : 
  negative length vectors are not allowed
Timing stopped at: 57.53 4.43 62.29 

Bummer. I have no idea what went wrong here. May be we hit some size limit (note the CSV file is larger than 2 gb; other compression formats gave the same result).

Python pandas.read_csv

Quite fast:


The paratext library should be even faster.

  1. readr 1.0.0,
  2. Damian Eads, ParaText: CSV parsing at 2.5 GB per second,

Thursday, December 8, 2016

Management Science on Vox

Fuel is cheap. Why are we still paying to check bags?


  1. Mariana Nicolae and Mazhar Arıkan and Vinayak Deshpande and Mark Ferguson, Do Bags Fly Free? An Empirical Analysis of the Operational Implications of Airline Baggage Fees, Management Science, August 2016 (online), 

Tuesday, December 6, 2016

[VIDEO] A Huge Debate: R vs. Python for Data Science

Interesting video: [Video] A Huge Debate: R vs. Python for Data Science (presentation by Eduardo Ariño de la Rubia)

Packages discussed:



Friday, December 2, 2016

Cplex 12.7 and Benders Decomposition

The new version of Cplex (12.7) has some interesting new facilities to make it easier to use a Benders Decomposition approach (1,3). Paul Rubin shares some experiences with this in (2).


The original paper by Jacques Benders can be found in (4). These old papers are always an interesting read. The style and notation is often very different from how papers look like these days, although in this case the notation is quite modern.

  1. Paul Shaw, What’s new in Cplex 12.7?,
  2. Paul Rubin, Support for Benders Decomposition in CPLEX
  3. Xavier Nodet, IBM CPLEX Optimization Studio 12.7 - Benders, Modeling Assistance, etc.,
  4. J.F. Benders, Partitioning procedures for solving mixed-variables programming problems, Numerische Mathematik 4, pp. 238-252, 1962,

Thursday, December 1, 2016

Table Assignment for Kindergartners

From this post we have a problem stated as:

My wife teaches AM and PM kindergarten classes. AM has 14 students and PM 11. At the beginning of each month, she puts out a new seating chart where she rotates students in such a way that they (ideally) sit at a different table and with different students for that month.

There are 3 students per table, but if numbers force the issue, the last may have more or less. We realize that, by the end of the year, there will be some unavoidable situations where students end up sitting with each other or at the same tables again, and this is okay.

Something like this I imagine:


A somewhat related problem is discussed in (1). As usual we have a set of binary variables that indicate the assignment of students to tables: 

\[x_{s,t,m} = \begin{cases}1&\text{if student $s$ is seated at table $t$ in month $m$}\\

The first set of equations are somewhat straightforward. Each student sits at exactly one table:

\[\sum_t x_{s,t,m} = 1 \>\>\forall s,m\]

We cannot exceed the capacity of a table:

\[\sum_s x_{s,t,m} \le cap_t \>\> \forall t,m\]

In the complete model below I used an equality constraint for this: there is no slack capacity. Secondly we want to keep track of how many times students sit at the same table. We introduce binary variables \(meet_{s1,s2,t,m}\in\{0,1\}\) indicating if two students \(s1\) and \(s2\) sit at the same table. The obvious nonlinear equation would be:

\[meet_{s1,s2,t,m} = x_{s1,t,m} \cdot x_{s2,t,m}\>\> \forall s1<s2,t,m\]

Note that we can exploit symmetry here: if we already compared \(s1,s2\) we do not need to bother about \(s2,s1\). The standard linearization is

\[\begin{align} & meet_{s1,s2,t,m} \le x_{s1,t,m}\\
& meet_{s1,s2,t,m} \le x_{s2,t,m}\\
& meet_{s1,s2,t,m} \ge x_{s1,t,m}+x_{s1,t,m}-1\end{align}\]

As we push variable \(meet\) down, we can drop the first two inequalities and just keep the last. Note that variable \(meet\) is now a bound instead of the product \(x_{s1,t,m} \cdot x_{s2,t,m}\). When reporting we should use the underlying variables \(x\).

A count of the number of times students \(s1,s2\) sit at the same table is a simple aggregation:

\[meetcount_{s1,s2} = \sum_{t,m} meet_{s1,s2,t,m}\]

An obvious objective is to minimize the expression \(\max meetcount_{s1,s2}\). This is easily linearized:

&maxmeetcount \ge meetcount_{s1,s2} \>\>\forall s1<s2\end{align}\]

To test this we use 14 students, 11 months (I assume one month vacation period). The capacities of the 5 tables is 3 except for one table with 2 students (this is to make the capacities sum up to 14).


The equations are exactly the same as discussed above:


To help the solver we fix the table assignment for the first month. In addition I assume student 1 is always sitting at table 1. This makes the model easier to solve. Note that I only fix some variables to 1. As a result a lot of other variables should be zero. E.g. if \(x_{s1,t1,m}=1\) we know that \(x_{s1,t2,m}=0\) and similar for other tables than \(t1\).  Instead of fixing all these these myself to zero, I leave it to the MIP presolver to do that for me. This fixing step helps a lot.

We achieve a quick solution with an objective of 2. I.e. two students will sit at most twice at the same table.


In the above model we minimized the maximum number of times students sit at the same table, The optimal value is 2. However this means that we allow the number of times two students meet (the meet count, “meets” in the tables below) to float between 0 and 2 without a further incentive to decrease the average meet count within this band width. This leads to a distribution as follows:


On average two students sit at the same table with another student 1.57 times. Detail: note that we need to base the calculations on the optimal values of \(x\) instead of \(meetcount\) as \(meetcount\) is using variable \(meet\) which is just a bound on the product \(x_{s1,t,m} \cdot x_{s2,t,m}\).

We can try to bring the average down by minimizing the sum of the meet counts. The results show, this will not actually bring the average down. But the distribution is certainly different. This objective will produce a few student pairs that have a high meet count:


We can try to minimize the number of student pairs meeting twice or more while still pushing down the maximum by using a composite objective:

\[\min maxmeetcount + \frac{count2}{100}\]

where \(count2\) is the number of pairs meeting twice or more. This gives a slightly different distribution:


A depiction of this distribution:


Another interesting objectives:

  1. Use a quadratic cost \(\sum meetcount_{s1,s2}^2\). This will penalize larger meet counts. This objective is somewhere between minimizing the sum and minimizing the max. MIQPs (Mixed Integer Quadratic Programming problems) are often more difficult to solve however compared to linear models.
  2. Try to have students meet each other for the second time only after some months, i.e. separate pairs sitting at the same table twice over time. This is not so simple to model. The constraint to enforce say: at least 2 months between before sitting at the same table again is not so difficult (see below), but really maximizing the space in between is not so easy.
When do we meet again?

When we make a picture of when student pairs sit at the same table we get something like this:


We color coded the cells whether we have 0, 1, 2 or 3 or more months in between (the cells with a ‘3’ indicate that meets are separated by 3 or more months). Note we did not display the pairs that only sit at the same table once.

To forbid the cases 0 and 1 in the above solution we can add the constraints:

&meetm_{i1,i2,m} = \sum_t meet_{i1,i2,t,m}\\
&meetm_{i1,i2,m}+meetm_{i1,i2,m+1}+meetm_{i1,i2,m+2} \le 1\end{align} \]                 

Here \(meetm\) is a simple aggregation of the variable \(meet\). The second constraint only allows one time the pair \((i1,i2)\) sit at the same table in each three month period.

Note that now we have the variable \(meetm\) we can optimize constraint that calculates the meet count:

\[meetcount_{s1,s2} = \sum_m meetm_{s1,s2,m}\]

This will reduce the number of nonzero elements in this equation, so this is a good idea.

The new picture


shows only 2s and 3s (where again 3 means at least 3 months in between). I.e. we successfully removed all cases where student pairs sit at the same table with zero or one month in between.

  1. A more complex table seating problem: 
  2. A related problem:'s_schoolgirl_problem

Hypo-Variance Brain Storm Optimization?

I never heard of this before, but here is a Matlab application. Looks like “Brain Storm Optimization” is yet another heuristic. (Not so easy to keep track, every academic seems to have the urge to invent his/her own heuristic these days).

  2. Yuhui Shi, Brain Storm Optimization Algorithm, 2011 [link]
  3. Cheng, S., Qin, Q., Chen, J. et al, Brain storm optimization algorithm: a review, 2016 [link]

Monday, November 28, 2016

Reformulation of non-differentiable term

In a non-linear programming model with a rather complicated objective I see a term

\[\frac{p}{2}\left(|x| + x\right)\]

that is minimized. It is also noted that \(p>0\).


From a picture of \(y=\displaystyle\frac{|x|+x}{2}\) we can see this is just a more complicated way of writing:

\[p \max\{0,x\}\]

I don’t see much advantage in using the first form. Both these expressions are non-differentiable at \(x=0\). Almost always a better formulation is to replace \(\max\{0,x\}\) by a new positive variable \(y\ge 0\), and add a constraint:

\[y \ge x\]

Although we are making the model slightly larger (we add a variable and an equation) in general this is good trade-off: we now have a simpler, even linear term in the objective.

Sunday, November 27, 2016


NEOS allows you to run GAMS jobs on a remote server, so you don’t need a GAMS system installed. The web based submission is a useful tool for self-contained models. That is when everything is in the GAMS model: data, equations, solve statement and reporting through display statements.

For larger jobs where data is coming from different sources and where we need to do more complex post-processing you may need other approaches. If you have GAMS/BASE license you can develop and run the GAMS part locally and only do the solver part remotely by using the kestrel solver. Basically add to the model something like option LP=kestrel and run the model. If you want to change some defaults (e.g. by specifying a non-default solver), you can create a kestrel.opt option file.

If you don’t have a GAMS license file and want to run more complicated models we need to split the complete model in three parts:


To run the data step and report step we need an installed GAMS system but we need no license file.

Data Step

The data step is a GAMS file that basically contains set and parameter declarations and data manipulation steps. It can also read data from spreadsheets and databases. We add at the end the following statements:

$setenv gdxcompress 1
execute_unload "in.gdx";

This will save all data in a compressed GDX file. This file can be run just from the command line or from the GAMS IDE on the local machine.

Model Step

The model step first needs to read the GDX file in.gdx. The quickest way to do this is to run gdxdump in.gdx nodata and copy the output of this tool into the .gms file containing the model. Unfortunately NEOS does not allow include files, so we need to physically copy the lines into the .gms file.

The rest of the file can contain variable and equation declarations and the equation definitions. Finally we can add the model and solve statements.

NEOS will save everything in a GDX file call out.gdx so we don’t have to add this ourselves.

To run the model step, upload the .gms file and the in.gdx file to NEOS. I used the following options:



For larger results the e-mail will not provide a link to the zip file with the gdx file out.gdx so I did not bother to fill out the e-mail address. After waiting a little bit you will see


   NEOS Server Version 5.0
   Job#     : 5150902
   Password : EwfLRsCT
   Solver   : nco:MINOS:GAMS
   Start    : 2016-11-27 05:52:35
   End      : 2016-11-27 05:52:55
   Host     : NEOS HTCondor Pool


   This information is provided without any express or
   implied warranty. In particular, there is no warranty
   of any kind concerning the fitness of this
   information  for any particular purpose.
[. . .] 
**** SOLVER STATUS     1 Normal Completion         
**** MODEL STATUS      2 Locally Optimal           
**** OBJECTIVE VALUE            12942.8050
[. . .]

Additional Output:

You may want to check the solver and model status, and then download the zip file. The zip file will contain the file out.gdx with all the results.

Report Step

The report step needs to read the file out.gdx. The simplest is again to do gdxdump out.gdx nodata and include the output of this command in the report file. Now you can locally run the report.

How to script this

There are probably two ways to automate this:

  1. Use the NEOS API to write a script that executes the model on NEOS
  2. Use something like COM automation for Internet Explorer or a web scraping tool to script the browser steps

Monday, November 21, 2016

Cord Diagrams in R


A nice Cord Diagram produced with just two statements in R (see (1)):

par(mar = c(1, 1, 1, 1), bg="violetred4")
circlize::chordDiagram(matrix(1, 20, 20),
                       symmetric = TRUE,
                       transparency = 0.85,
                       annotationTrack = NULL)

A collection of more meaningful graphs is in (2). Particularly impressive is: