Abstract

The progress of metaheuristic techniques, big data, and the Internet of things generates opportunities to performance improvements in complex industrial systems. This article explores the application of Big Data techniques in the implementation of metaheuristic algorithms with the purpose of applying it to decision-making in industrial processes. This exploration intends to evaluate the quality of the results and convergence times of the algorithm under different conditions in the number of solutions and the processing capacity. Under what conditions can we obtain acceptable results in an adequate number of iterations? In this article, we propose a cuckoo search binary algorithm using the MapReduce programming paradigm implemented in the Apache Spark tool. The algorithm is applied to different instances of the crew scheduling problem. The experiments show that the conditions for obtaining suitable results and iterations are specific to each problem and are not always satisfactory.

1. Introduction

With the increase of different kinds of electronic devices, social networks, and the Internet of Things, the datasets are growing fast in volume, variety, and complexity. Currently, big data is emerging as a trend and working with large datasets typically aimed at extracting useful knowledge from them. To address this problem, different programming models have been developed, in which MapReduce is one of the most powerful [1].

In complex industrial systems, the engineers face challenges daily where their job is to make decisions on how to improve the production and reduce costs. They are continuously selecting where, how, when, and what it must do to achieve efficiency in the processes. Normally, these decisions are based on an optimization problem. On the other hand, nowadays, a greater data quantity is available and therefore we can build robust optimization models that support these decisions. However, this increase in data volume and variety implies an increase in the complexity of the calculations and therefore in the convergence time of the algorithms.

Moreover, computational intelligence and particularly metaheuristics have been successful in solving complex industrial problems. In the literature, we find metaheuristics that have satisfactorily solved problems of resource allocation [2, 3], vehicle routing [4], scheduling problems [5], reshuffling operations at maritime container terminals problems [6], antenna positioning problems [7], covering problems [8, 9], and also in bioinformatics problems such as protein structure prediction, molecular docking, and gene expression analysis [10]. However, in the big data era, the integration of metaheuristics into the decision-making process presents two fundamental difficulties: the first one is to get from computational intelligence algorithms, suitable results, and convergence times when dealing with large datasets, because much of the decisions must be close to real time. The second one relates to the programming model differences usually used in computational intelligence and big data algorithms. These difficulties motivate the design and study of computational intelligence algorithms in programming models used in big data.

A recent framework in the big data area is the Apache Spark which has been widely used to solve industry problems [11]. This framework has advantages over the traditional MapReduce model, since it uses an abstraction called resilient distributed dataset (RDD). This abstraction allows to carry out operations in memory with high fault tolerance, being indicated for the use of iterative algorithms [12]. This work is mainly focused in the behavioural performance analysis of metaheuristic algorithms implemented with the big data Apache Spark tool. The specific objective is the reduction of their convergence times, to support the decision-making in complex industrial systems at the right times. For the design of the experiments, we will use the population size of the metaheuristic and the number of executors within the Apache Spark. To perform the evaluation, the average value, number of iterations, and speed up will be used. The following scenarios will be studied: (1)The evaluation of the average value through the variation of the solutions number.(2)The evaluation of iteration number through the solution number used to solve problems.(3)The evaluation of algorithm scalability through executor number.

These analyses aim to understand which metaheuristic algorithm conditions, related to the solutions and executors number, can obtain suitable results and times to support the decision-making process in complex industrial problems. For this study, it was decided to use the metaheuristic Cuckoo Search; however, the method presented in this article could be applied to different problems of the complex industrial systems.

Cuckoo search is a relatively new metaheuristic that currently has been widely used in solving different types of optimization problems [13]. Some examples of solved problems by the cuckoo search algorithm are the problems in satellite image segmentation [14], the resource allocation problems [3, 15], the optimal power system stabilizers design problems [16], and the optimal allocation of wind based distributed generator problems [17] among others.

In order to carry out the experiments, two types of datasets were chosen. The first one is a benchmark dataset associated to the known set covering problem and a second dataset is associated with the large-scale railway crew scheduling problems, where the number of columns fluctuates between fifty thousand and one million. The results show that adequate scalability and convergence times are not always obtained, what depends on the dataset type and the number of solutions that are being used.

The remainder of this paper is organized as follows. Section 2 briefly introduces the crew scheduling problem. Section 3 details the cuckoo search algorithm. The state of the art of binarization techniques is described in Section 4. In Section 5, we explain the Apache Spark framework. In Sections 6 and 7, we detail the binary and distributed versions of our algorithm. The results of numerical experiments are presented in Section 8. Finally, we provide the conclusions of our work in Section 9.

2. Crew Scheduling Problems

In the crew scheduling problem (CrSP), a group of crew members is assigned to a set of scheduled trips. This allocation must be such that all trips necessarily are covered, while the safety rules and collective agreements must be respected. These allocation and restrictions make the CrSP one of the most difficult problems to solve in the transportation industry [18].

When a bibliographic search is performed, it was found that CrSP is a problem of great importance at present, appearing variations of the original problem associated mainly to the restrictions. As an example, we found CrSP applied to railway. In [19], CrSP with attendance rates was solved; a version of CrSP with fairness preferences was solved in [20]. Crew scheduling problem applications were also found for airlines and bus transportation. In a public transport of buses, in [21] a variation of CrSP was resolved. A new heuristic was proposed in [22] to solve a crew pairing problem with base constraints. In [23], a large-scale integrated fleet assignment and crew pairing problem were solved.

In this work, due to the addition of big data concepts, we will approach the CrSP in its original form. The problem is defined as follows: given a timetable of transport services which are executed every day in a certain period of hours. Each service is divided into a sequence of trips. A trip is performed by a crew, and it is characterized by a departure station, a departure time, an arrival time, and an arrival station. Given a period of time, a crew performs a roster. This is defined as a cyclical travel sequence and each roster assigns a cost.

The CrSP then consists in finding a roster subset that covers all trips, satisfying the constraints imposed and at a minimal cost. The problem is broken down into two phases: (1)Pairing generation: a very large number of feasible pairings is generated. A pairing is defined as a sequence of trips which can be assigned to a crew in a short working period. A pairing starts and ends in the same depot and is associated with a cost.(2)Pairing optimization: a selection is made of the best subset of all the generated pairings to guarantee that all the trips are covered at minimum cost. This phase follows quite a general approach, based on the solution of set-covering or set-partitioning problems.

In this research, we will assume that the pair generation phase has already been performed because we will use a benchmark dataset. Therefore, we will focus efforts in resolving the pairing optimization phase. The pairing optimization phase requires the determination of a min-cost subset of the generated pairings covering all the trips and satisfying additional constraints. Usually it is solved through the set covering problem, and depending on the specific modeled problem, it is added as some type of constraint.

The set covering problem (SCP) is well known to be NP-hard [24]. Nevertheless, different algorithms for solving it have been developed. There exist exact algorithms that generally rely on the branch-and-bound and branch-and-cut methods to obtain optimal solutions [25, 26]. These methods, however, need an effort for solving an SCP instance that grows exponential with the problem size. Then, even medium-sized problem instances often become intractable and cannot be solved anymore using exact algorithms. To overcome this issue, the use of different heuristics has been proposed [27, 28].

For example, [28] presented a number of greedy algorithms based on a Lagrangian relaxation (called the Lagrangian heuristics); Caprara et al. [29] introduced relaxation-based Lagrangian heuristics applied to the SCP. Metaheuristics have also been applied to solve SCP, some examples are genetic algorithms [30], simulated annealing [31], and ant colony optimization [32]. More recently, swarm-based metaheuristics as cat swarm [33], artificial bee colony [34], and black hole [9] were also proposed.

The SCP can be formally defined as follows. Let , be a zero-one matrix, where a column cover a row if , besides a column is associated with a nonnegative real cost . Let and be the row and column set of , respectively. The SCP consists in searching a minimum cost subset for which every row is covered by at least one column , that is, where if , otherwise.

3. Cuckoo Search Algorithm

The cuckoo search is a bioinspired algorithm derived from some cuckoo bird species with an obligate brood parasitism, who lay their eggs in the nests of other bird species [13]. For simplicity, the cuckoo search algorithm is described using the following idealized rules: (1)Each cuckoo lays one egg at a time and dumps it in a randomly chosen nest.(2)The best nests with high-quality eggs will be carried over to the next generations.(3)The number of available host nests is fixed, and the egg laid by a cuckoo is discovered by the host bird with a probability . In this case, the host bird can either get rid of the egg or simply abandon the nest and build a completely new nest.

The basic steps of the CS can be recapitulated as the pseudocode shown in Algorithm 1.

 Objective function: ,
 Generate an initial population of m host nests
while (t<MaxGeneration) or (stop criterion)
  Get a cuckoo randomly (say, i) and replace its solution by performing Lévy flights
  Evaluate its fitness
  Choose a nest among n (say, j) randomly
  if < then
   Replace j by the new solution
  end if
  a fraction of the worse nests are abandoned and new ones are built
  Keep the best nests
  Rank the nests and find the current best
  Pass the current best solutions to the next generation
end while

The updated cuckoo search solutions are shown in (2), in which corresponds to the step size, and corresponds to the entry-wise multiplications. A random number denominated as Levy (κ) is given by the distribution shown in (3).

The search engine of the cuckoo search algorithm performs naturally in continuous spaces. Nevertheless, the crew scheduling problems are solved in discrete or binary spaces, forcing the adaptation of the original algorithm. A state of the art of main techniques used in the binarization of swarm intelligence continuous metaheuristics is presented in Section 4.

4. Binarization Methods

There exists two main categories for binarization techniques [35]. General binarization frameworks are part of one of these groups in which exists a mechanism that allows the binary transformation of any continuous metaheuristic without altering the operators. The most used of these frameworks are the transfer functions and the angle modulation. The binarizations designed specifically for a metaheuristic are the second group of binarization methods that include techniques such as the set-based approach and the quantum binarization.

The most used binarization method is the transfer function introduced by [36]. This function is an inexpensive operator that provides the probability values and models the solution positions of the transition. The transfer function is the beginning of the binarization method that allows to map the solutions in [0,1]n solutions. The S shaped and the V shaped are the most used transfer functions, well described in [37, 38]. The next step is applying a rule to binarize the transfer function results, which could include the binarization rules elitist, the static probability, the complement, or the roulette [37].

The sizing optimization of the capacitor banks in radial distribution feeders was performed previously using a binary particle swarm optimization [39]. For the reliability analysis of the bulk power system, a transfer function based on swarm intelligence was used [40]. A binary coded firefly algorithm that solves the set covering problem was performed using the same transfer function [37]. A binary cuckoo search algorithm for solving the set covering problem was applied previously [41]. An improved firefly and particle swarm optimization hybrid algorithm was applied to the unit commitment problem [38]. A cryptanalytic attack on the knapsack cryptosystem was approached using the binary firefly algorithm [42]. The network and reliability constrained unit commitment problem was solved using a binary real coded firefly algorithm [43]. Similarly, using the firefly algorithm, the knapsack problem was solved [44].

The angle modulation method uses four parameters which control the frequency and shift of a trigonometric function as is shown in (4).

Using a set of benchmark functions, the angle modulation method was first applied in the particle swarm optimization. Assuming a n-dimensional binary problem and as a solution. The first step uses a four-dimensional space, in which each dimension corresponds to a coefficient of (4). The solutions are linked to a trigonometric function. The rule 6 is used for each element :

Now for each four-dimensional initial solution , we obtain a feasible n-dimensional solution binarized for our n-binary problem . Several applications of the angle modulated method have been developed. This include the implementation of angle modulate using a binary PSO to solve network reconfiguration problems [45]. Another implementation is a binary adaptive evolution algorithm applied to multiuser detection in multicarrier cdma wireless broadband system [46]. An angle modulate binary bat algorithm was also previously applied for the mapping of functions when handling binary problems using continuous-variable-based metaheuristics [47].

Evolutionary computing (EC) and quantum computing are two research areas involving the use of three algorithms categories [48]. First, the quantum evolutionary algorithms are focused on the application of EC algorithms in a quantum-computing environment. The evolutionary-designed quantum algorithms are focused in the automatic manufacturing of new quantum algorithms. The quantum-inspired evolutionary algorithms use some concepts and bases of quantum computing to generate new EC algorithms. (1)Quantum evolutionary algorithms: these algorithms focus on the application of EC algorithms in a quantum-computing environment.(2)Evolutionary-designed quantum algorithms: these algorithms try to automate the generation of new quantum algorithms using evolutionary algorithms.(3)Quantum-inspired evolutionary algorithms: these algorithms concentrate on the generation of new EC algorithms using some concepts and principles of quantum computing.

The quantum binary approach is part of this last category, in which the algorithms are adapted to be used on normal computers, integrating the concepts of q-bits and superposition. In this method, each achievable solution has a position and the quantum q-bits vector . Q stands for the probability of take the value 1. For each dimension j, a random number between [0,1] is obtained and compared with , if , then , else . The mechanism of Q vector updating is distinct to each metaheuristic.

The application of quantum swarm optimization has been used in different problems including combinatorial optimization [49], cooperative approach [50], knapsack problem [51], and power quality monitoring in [52]. The application of quantum differential evolution is also observed in the knapsack problem [53], combinatorial problems [54], and methods of image thresholding [55]. A quantum algorithm using cuckoo search metaheuristic was applied to the knapsack problem [56] and bin packing problem [57]. An application to image thresholding using quantum ant colony optimization is reported in [55]. Two quantum binarization applications to the knapsack problem are reported previously using harmony search in [58] and monkey algorithm in [59]. The quantum differential evolution algorithm was applied to the knapsack problem in [53], combinatorial problems [54], and image threshold methods in [55]. Using the cuckoo search metaheuristic, a quantum algorithm was applied to the knapsack problem [56] and bin packing problem [57]. A quantum ant colony optimization was applied to image threshold in [55]. Using Harmony Search in [58] and Monkey algorithm in [59], quantum binarizations were applied to the knapsack problem.

The unsupervised learning K-means clustering method is used to perform binarization in different problems as is shown in Figure 1. This method starts with the cuckoo search algorithm generating the pair in a continuous space, in which is the position and the velocity of the solution (Figure 1(a)). All the velocity module elements are considered, and the K-means is applied (Figure 1(b)). For each cluster k, we link a value of the transition probability (Figure 1(b)). Finally, the transition is performed using the (6). In this equation, corresponds to the complement of . These transitions occur in the binary space (right pannel). If cluster, then .

In a previous work, we solved the knapsack problem by applying the transition probability function shown in (7) [3]. In this equation, α = 0.1, β = 1, and N (xi) corresponds to the cluster that belongs xi. corresponds to the transition probability; can take values between . The initial probability is run by α, and then β carries on the probability jump between the different groups.

5. Spark Distributed Framework

The purpose of this section is to present the Spark distributed framework that it has a target to work with big volumes of data. This framework will be used later in Section 7.

The Spark framework provides an interface of friendly work that allows using of the good way the storage, the memory, and CPU and a set of servers that have as their purpose processing large amounts of data in memory [11]. The requirement of processing large amounts of data is a need that is expressed in the last time, given principally by the low that has shown the cost of data storage which leads to a new need that is to obtain knowledge of this information gathered across the time. This new need arising out of the available storage capacity allowed to find a new line of action for researchers, since the amount, diversity, and complexity of the data [6064], they are not capable of being tackled by the traditional methods of automatic learning.

Spark has a high performance in parallel computing, being used in machine learning algorithms [65], imaging processing [66], bioinformatics [67], computational intelligence [9], astronomy [68], medical information [69], and so on.

A pioneer in address the treatment of bulk data based on the principle of the locality of the data [70] was MapReduce framework [1] which has the disadvantage of being insufficient for applications that need to share information across several steps or for iterative algorithms [71]. The Spark framework has been very successful becoming a platform for generic use, such as batch processing, iterative process, interactive analysis, flow processing, automatic learning, and computer graphics.

The units of central data of Spark are the resilient distributed datasets (RDD). These units are distributed and are immutable, that is, the transformation of the RDD are RDD and abstraction of memory fault tolerant. Principally, there are two types of operations: transformations that take RDD and produce RDD and actions that take RDD and produce values. To execute Spark, there are several options of administration of cluster that can be used, from the simple independent solutions of Spark, Apache Mesos, and Hadoop YARN [72].

In our case and based on the engineering applications, we decide to use the management Hadoop YARN, being the latest implementation that uses cloud computing [73], that has the characteristic of putting at disposal large number of devices to provide such services as computation and storage on demand that represent a lower cost of hardware, software, and maintenance [73].

6. Binary Cuckoo Search Algorithm

The general performance of the binary cuckoo search algorithm is summarized in this section. First, the algorithm creates the initial solutions with the operator. Once this happens, the algorithm evaluates compliance with the stop criterion. Maximum iteration number and obtaining the optimal value are the two stop criteria. When one of these criteria is not obtained, the K-means transition operator is executed to perform the binarization (detailed in Section 6.2). When the transitions are already obtained, a repair operator must be applied whether the solutions do not accomplish with the problem restrictions (detailed in Section 6.3). This iterative process is evaluated until the stop criterion is accomplished. A general diagram of the process described is detailed in Figure 2.

6.1. Initial Solution Operator

To obtain a new solution, the process begins with the random choice of a column. It is then queried whether the current solution covers all rows. The heuristic operator (Section 6.4) is run to add a new column, until all rows are covered, if the previous part does not happen. The final step is to delete a column, if there are columns that all their rows are covered by more than one column. The initialization process to obtain the solution is detailed in Algorithm 2.

1: Function Initialization()
2: Input
3: Output Initialized solution Sout
4: S ← SelecRandomColumn()
5: while All row are not covered do
6:  S.append (Heuristic(S))
7: end while
8: S ← deleteRepeatedItem(S)
9: SoutS
10: returnSout
6.2. K-Means Transition Operator

Cuckoo search is a continuous swarm intelligence metaheuristic. The solutions position at each iteration needs to be updated due its iterative nature. This update is performed in space when the metaheuristic is continuous. The solution position update can be expressed in a general form for any continuous metaheuristics as is shown in (8). In this equation, corresponds to the position of the solution at time . This position is obtained from the position at time plus a function calculated at time . The function is due to each metaheuristic and generates values in . In cuckoo search, for example, , in black hole and in the firefly, bat, and PSO algorithms, can be expressed in simplified form as .

The movements generated by the cuckoo search algorithm in each dimension for all solutions are considered in the K-means transition operator. is the displacement magnitude of the in the ith position for the solution at time . Using , the magnitude of the displacement, the displacements are subsequently grouped. The K-means method is used to do this, where K represents the number of clusters used. In the final step, a generic function given in (9) is proposed to assign a transition probability. In this case, is the group obtained when quotient by , that is to say, , where each element of the group identifies each of the clusters. Since is a probability, it take values in [0,1].

Through the function , a transition probability is assigned to each group. We use the linear function given in (10) as a first approximation. In this equation, corresponds to the location of the group to which belongs. The coefficient allows defining the transition probability value for all the clusters. This increases proportional to . For our particular case, corresponds to elements belonging to the group that has the lowest values and therefore smaller transition probabilities will be assigned to them.

The K-means transition operator begins with the calculation for each solution of the (Algorithm 3). The solutions are then grouped using K-means clusterization and the as magnitude of distance. We obtain the transition probability with the group assigned to each solution using (10). Subsequently, the transition of each solution is performed. The rule 12 for the cuckoo search is used to perform the transition, where is the complement of . In the final step, each solution is composed using the repair operator detailed in Algorithm 4.

1: Function K-meansTransition(ListX (t))
2: Input List solutions t (ListX (t))
3: Output List solution t + 1 (ListX (t + 1))
4: ∆i Listgeti (ListX (t), MH)
5: Xi Groups ← K-means (∆i List, K)
6: forX(t) in ListX (t)
7:  forXi (t) in X(t)
8:   Xi Groups ← get Xi Groups (i, X(t), Xi Groups)
9:    getTransitionProbability(Xi Group)
10:    applyTransitionRule()
11:  end for
12: end for
13: for in
14:   Repair()
15: end for
16: return
1: Function Repair(Sin)
2: Input Input solution Sin
3: Output The Repair solution Sout
4: SSin
5: while needRepair(S) == True do
6:  S.append (Heuristic(S))
7: end while
8: S ← repeatedItem(S)
9: SoutS
10: returnSout
6.3. Repair Operator

Using the K-means transition and the perturbation operators, the repair operator objective is to repair the solutions generated. The operator to perform the repairing process has as input parameter the solution Sin to repair and as output parameter the repaired solution Sout. We iteratively use the heuristic operator for the execution of the process, which specify the column that must be added. Once all the rows are covered, the deletion is applied to the columns that have all their rows covered by other columns.

6.4. Heuristic Operator

To repair the solutions that do not comply with the constraints is used the heuristic operator. The heuristic operator aims to select a new column for the cases that a solution needs to be built or repaired. The operator considers as input parameter the solution Sin which needs to be completed, and in the case of being a new solution . With the list of columns belonging to Sin, you get the set of rows R not covered by the solution. With the set of rows not covered and using (12), we obtain in line 4 the best 10 rows to be covered. With this list of rows (listRows) on line 5, we obtain the list of the best columns according to the heuristic indicated in (13). Finally, as a random process, we obtain in line 6 the column to incorporate.

1: Function Heuristic(Sin)
2: Input Input solution Sin
3: Output The new column Cout
4: listRows ← getBestRows(Sin, =10)
5:  getBestColumns(, =5)
6:  getColumn()
7: return

7. Binary Cuckoo Search Big Data Algorithm

In this section, we describe the distributed version of the algorithm developed with Apache Spark. The key in each of the map transformations and collect actions used corresponds to the solution identifier that will be denoted by idS. When the identifier is used as a key during the execution, it allows the calculations associated to a solution to be executed always in the same partition for the different stages and therefore to be more efficient regarding the data transfer between different workers. In Figure 3, the flow diagram for the distributed algorithm is shown, and in Algorithm 6, the pseudo-code of an iteration is detailed.

1: Function: distribuitedBinaryCuckoo(lSol)
2: Input: List of solution (lSol)
3: Output: Iterated list of solution (lSol)
4: lSol ← lSol.map (lambda Solution: (idS, iteratedSolution(Sol))
5: lSol ← lSol.map (lambda Solution: (idS, K-meansTransition(Sol))
6: lSol ← lSol.map (lambda Solution: (idS, repair(Sol))
7: lSol ← lSol.collect()
8: returnlSol

LSol contains the solution list to be iterated with the cuckoo search algorithm. Each of these solutions has the position and velocity information. The first step is to iterate the solutions using the cuckoo search algorithm; this is done at line 4. The key corresponds to the idS particle identifier, and the returned value corresponds to the iterated solution Sol, in which the velocity values have been updated. The next step is to perform an iteration of the positions. For this, the K-means transition operator described in Section 6.2 and executed at line 5 of Algorithm 6 is used. With the K-means transition operator, the velocities obtained in the previous step are used to get the new binary values of the solution position. Subsequently since there is a possibility that the iterated solutions do not meet with the constraints, a repair operator is applied. This operator acts on the positions and updates them to fulfill with the constraints. The detail of the repair algorithm is described in Section 6.3. Finally, the list of solutions is collected and stored for further analysis.

8. Results

In this section, we present computational experiments with the proposed Spark binary cuckoo search algorithm. We test the algorithm on two classes of well-known problems. (1)OR-Library benchmarks: this class includes 65 small and medium size randomly generated problems that were frequently used in the literature. They are available in the OR-Library and are described in Table 1.(2)Railway scheduling problems: this class includes seven large-scale railway crew scheduling problems from Italian railways and are available in OR-Library.

Binary cuckoo search big data algorithm was implemented in python using Spark libraries. It was executed in Azure platform, Spark 1.6.1 and Hadoop 2.4.1 versions. To perform the statistical analysis in this study, the Wilcoxon signed-rank nonparametric test was used. For the results, each problem was executed 30 times.

The first stage corresponds to perform the parameter configuration used by the algorithm. To develop this activity, the methodology described in [3] was used. In this methodology, four standard measures are used: the worst case, the best case, the average case, and the average execution time. With these four measurements, the area under the radar chart curve is obtained to define the best configuration. The dataset used to determine the best configuration corresponds to the first problem of each group . The results are shown in Table 2. In this table, the range column corresponds to the evaluated ranges and the value column to the value that will be used. The value of the parameters γ, κ, and iteration number corresponds to those frequently used by the cuckoo search algorithm in the literature. The parameters α and K are specific to the K-means binarization method and are referenced to (10).

8.1. Evaluation of Result Quality through the Variation of the Solution Number

The goal of this section is to evaluate the number of solutions to be used by the binary cuckoo search big data algorithm (BCSBA) with respect to the quality of the results. For the execution of this experiment, the other parameters used by the algorithm were the values described in the value column of the Table 2. In Table 3, the results are shown for cases that consider 5, 10, 20, 50, 100, and 500 solutions using the OR-Library dataset. From the table we observe that the results for cases 50, 100, and 500 are superior to the rest, nevertheless between them, they are very similar. Additionally, to see the significance, the Wilcoxon test was performed, comparing CSBA(5) with respect to other cases, obtaining that in all cases there is a significant difference. To complement the above analysis, violin charts were used to compare the distributions of the results through their shapes and interquartile ranges. The results are shown in Figure 4. The x-axis corresponds to the number of solutions used to solve the problem and the y-axis to defined in (14). In the distributions, the superiority of the cases 50, 100, and 500 over the rest is appreciated. When we compare the cases 50, 100, and 500, between them, we see there is a similarity in the shape of their distributions as well as in the interquartile ranges.

In Table 4, the results for the railway scheduling problems are displayed. In this table, a behavior similar to the previous analysis is observed. The cases in which it uses 100 and 500 solutions obtained better results than the other cases. When comparing BCSBA-100 with BCSBA-500, similar results are observed.

8.2. Evaluation of Algorithm Convergence Time through the Solution Number

In this section, the convergence of the BCSBA algorithm with respect to the number of solutions is evaluated. For this analysis, the problems were grouped into 4 groups: the small group, which considers problems 4, 5, and 6; the median group, which considers problems A, B, C, D, E, and F; problem group G; and problem group H. Table 5 and Figures 5 and 6 show the results for different groups. In the table, it is observed that BCSBA has better convergence in cases 100 and 500 than in the rest of the cases, the result being very similar between 100 and 500. In Figures 5 and 6 the x-axis corresponds to the number of average iterations and the y-axis is the average of the defined in (14). The data was collected every 10 iterations in the small and medium groups and every 20 iterations in the G and H groups. For the case of the small and medium groups, although the convergence curves are better in cases 100 and 500, the difference is quite small, which does not justify the increase in the number of solutions. For the case of the groups G and H, this difference becomes much more notorious.

8.3. Evaluation of Algorithm Scalability through Core Number

This last experiment aims to evaluate the scalability of our algorithm when considering more than one core for calculation. In Section 8.1 and Section 8.2, we see that increasing the number of solutions improves the results and decreases the number of iterations. However, the increase in the number of solutions has a computation cost. In this section, we evaluate whether the cost of computing can be diminished by the use of more processing cores.

In this section, we evaluate whether the cost of computing can be diminished by the use of more processing cores.

For the Spark configuration, three parameters were considered: num-executors which controls the number of executor requested, executor-cores property which controls the number of concurrent tasks an executor can run, and executor-memory which corresponds to the memory per executor. For the proper use of an executor, it is recommended to use between 3 and 5 cores. The considered Spark settings are shown in Table 6.

In Figures 7, 8, and 9 we show the results of speed up charts for BCSBA using different numbers of solutions and considering between 1 and 16 executors. From the charts, it is observed that the best scalability is obtained for the case of 100 particles and in the problems G and H. For smaller problems, scalability is significantly reduced. The worst scalability was obtained for the algorithm using 5 particles. Another interesting fact is observed in the 500-particle chart where scalability was superior in G and H problems than in the rest; however, the performance is lower than in the case of 100 particles.

9. Conclusions

In this work, we have presented a binary cuckoo search big data algorithm applied to different instances of crew scheduling problem. We used an unsupervised learning method based on the K-means technique to perform binarization. Later, to develop the distributed version of the algorithm, Apache Spark was used as framework. The quality, convergence, and scalability of the results were evaluated in terms of the number of solutions used by the algorithm. It was found that quality, convergence, and scalability are affected by the number of solutions; however, these depend additionally on the problem that is being solved. In particular, it is observed that for medium size problems, the effects are not very relevant as opposed the large problems such as G and H, where the effect of the number of solutions is much more significant. On the other hand, when evaluating the scalability, we observe that it is also dependent on the number of solutions used by the algorithm and the size of the problems. The best performances were for problems G and H considering solutions between 20 and 500.

As a future work, it is interesting to investigate the proposed algorithm with other NP-hard problems with the intention of observing similar behaviours to observe in the case of CrSP. Also, we want to investigate how is the performance of autonomous search tuning algorithms [74] in big data environment. Finally, we also want to explore the performance of other metaheuristics in Big data Frameworks.

Conflicts of Interest

The authors declare that they have no conflicts of interest.