Assessment of arsenic concentration in stream water using neuro fuzzy networks with factor analysis

https://doi.org/10.1016/j.scitotenv.2014.06.133Get rights and content

Highlights

  • A novel hybrid model (ANFIS-Gamma test) is used to estimate arsenic concentration.

  • Data scarcity is overcome by key factor selection and cross-validation technique.

  • Gamma test identifies 3 key input factors by evaluating factor occurrence frequency.

  • Impacts of key factors on arsenic variation are drawn by ANFIS membership degree.

  • The proposed method gives a quick and reliable way of estimating arsenic concentration.

Abstract

We propose a systematical approach to assessing arsenic concentration in a river through: important factor extraction by a nonlinear factor analysis; arsenic concentration estimation by the neuro-fuzzy network; and impact assessment of important factors on arsenic concentration by the membership degrees of the constructed neuro-fuzzy network. The arsenic-contaminated Huang Gang Creek in northern Taiwan is used as a study case. Results indicate that rainfall, nitrite nitrogen and temperature are important factors and the proposed estimation model (ANFIS(GT)) is superior to the two comparative models, in which 50% and 52% improvements in RMSE are made over ANFIS(CC) and ANFIS(all), respectively. Results reveal that arsenic concentration reaches the highest in an environment of lower temperature, higher nitrite nitrogen concentration and larger one-month antecedent rainfall; while it reaches the lowest in an environment of higher temperature, lower nitrite nitrogen concentration and smaller one-month antecedent rainfall. It is noted that these three selected factors are easy-to-collect. We demonstrate that the proposed methodology is a useful and effective methodology, which can be adapted to other similar settings to reliably model water quality based on parameters of interest and/or study areas of interest for universal usage. The proposed methodology gives a quick and reliable way to estimate arsenic concentration, which makes good contribution to water environment management.

Introduction

The potential effects of arsenic (As) on human health and ecosystems have raised serious concerns. It still remains a great challenge to effectively model the temporal variation of As concentration in a river due to the inherent rigorous complex hydro-geochemical relation of As with limited water samples. As is a ubiquitous metalloid naturally present in rocks, soils and water. However, large amounts of As could be introduced into freshwater by anthropogenic activities such as agricultural fertilizers, combustion of fossil fuels, and/or domestic waste incineration. As has become a major environmental and human-health preoccupation due to its high bioavailability and toxicity. High As concentration in natural water has turned into a global problem in the USA, China, Bangladesh, Taiwan, Mexico, Argentina, Czech, Canada, Japan and India (Mandal and Suzuki, 2002, Mohan and Pittman, 2007, Armienta et al., 2008, Kulp et al., 2008, Kuo and Chang, 2009, Miyashita et al., 2009, Concha et al., 2010, Novak et al., 2010). Millions of people in the South and Southeast Asia routinely consume groundwater containing unsafe levels of As, in which As concentration is higher than the permissible limit of 10 μg/l (Polizzotto et al., 2008). In the Yun-Lin Country of Taiwan, the Blackfoot disease (BFD) is known to be caused through the direct drinking of As-contaminated groundwater (Chang et al., 2010). Previous studies indicated that the rapid infiltration of surface water would increase groundwater elevation in wet seasons and produce reducing conditions, inducing a reductive dissolution of As-bearing Fe (hydr)oxides (Wang et al., 2011). High As concentration is found in shallow groundwater (< 60 m) of southern Choushui River alluvial fan in the Yun-Lin County, and reducing conditions created by infiltrated rainfall water in shallow groundwater yield the release of As ions via the reductive dissolution of As-rich Fe oxy-hydroxides (Costa Goncalves et al., 2007, Wang et al., 2011). In turn, nearby biota is exposed to As and certain As compounds tend to accumulate in animal tissues (Peshut et al., 2008). The release of As into water, soil or biologic media results from both geologic (e.g., As is a major element in different types of ore deposits) and anthropogenic sources (e.g., As derives from the percolation of fertilizer residues) (Amini et al., 2008). As is a trace element of particular interest from the perspective of water quality assessment. A better understanding of As in river systems is essential for water quality modeling and water resources management. Nevertheless, hydro-geochemical processes are usually very complex and highly nonlinear, in which high degrees of spatial and temporal variability exist. Modeling complicated processes with unknown factors is a very challenging task.

There are basically two approaches for modeling: the theory-driven (conceptual and physically-based) approach; and the data-driven (empirical and statistical) approach. Theory-driven models represent general internal sub-processes and physical mechanisms, and their parameters are usually site-specific and are generally assumed as a lumped representation of basic characteristics. When building water quality models, it, however, requires extensive surveys and vast information on various hydrological sub-processes to calibrate models and compute final results. Data-driven models are commonly implemented with techniques developed in areas such as statistics, soft computing, computational intelligence and machine learning, and they tend to explore and establish the relationships between historical inputs and outputs. Artificial neural networks (ANNs), a class of data-driven techniques, have been recognized as an alternative tool to traditional methods for modeling dynamic nonlinear systems, where input–output mechanisms may not precisely exhibit. ANNs have been applied with success in many fields, such as hydrological systems (May and Sivakumar, 2009, Alvisi and Franchini, 2011, Rajaee, 2011, Adeloye et al., 2012, Cavalcante et al., 2013), groundwater issues (Nikolos et al., 2008, Chang et al., 2010), air pollution prediction (Heo and Kim, 2004), and water quality assessment (Singh et al., 2009).

An important strength of ANNs is to infer complex relationships without prior knowledge of a system. Noori et al. (2009) indicated that high-dimensional, irrelevant, redundant or noisy variables might be meaningless and the influential degrees of variables might not explicitly exhibit in observed data sets. An appropriate selection of variables can enhance the effectiveness and the domain interpretability of an inference model, which can be beneficial to improve prediction performance and provide a more effective predictor through reducing the number of variables. Consequently, the selection of input variables that are the most relevant to outputs is a crucial step in modeling ANN applications, especially in environmental science, which usually needs to handle extremely complex nonlinear relations between parameters with available monitoring data limited in size. In addition to the construction of estimation models, this study will also focus on selecting the subsets of features that are useful to improve prediction performance and enhance the understanding of the underlying concepts in the models.

In feature selection, it has been recognized that a combination of good features does not necessarily lead to good performance. That is to say, “the m best features are not the best m features” (Peng et al., 2005). There are indirect or direct means to reduce the redundancy among features and select features with the minimal redundancy. Factor analysis is a methodology that uses a subset of variables for illustrating the variability among these variables. The information of the interdependency among variables in a dataset can be obtained and then be used to reduce the dimension of the data set. Therefore, factor analysis is a tool to turn high-dimensional problems into problems with simpler structures and it has been implemented in hydrogeological systems, biosciences, and other applied sciences that deal with large numbers of variables (Love et al., 2004, Bandalos and Boehm-Kaufman, 2009). The Gamma test (GT) is a nonlinear factor selection tool and is widely used to assess the input–output relationship in a numerical data set for identifying the best combination of model inputs (Tsui et al., 2002, Noori et al., 2011, Chang et al., 2013).

In this study, we propose a systematical process that incorporates the GT into the adaptive network-based fuzzy inference system (ANFIS) to form the ANFIS(GT) model for effectively identifying the important factors affecting As concentration and reliably estimating As concentration in the Huang Gang Creek (Taipei, Taiwan) based on limited hydrological and water quality data collected at environmental monitoring stations in the river basin. The behavior analysis of important factors affecting As concentration is further conducted for delivering a better understanding of the complex composition of As pollution in the Huang Gang Creek (a hot spring creek).

Section snippets

Methodologies

The purpose of this study is to model As concentration based on limited water samples bearing inherent rigorous complex relations through selecting a subset of features that are useful to build a good predictor. The proposed approach comprises two parts: extract important factors affecting As concentration by the GT; and configure an estimation model of As concentration by the ANFIS coupled with cross-validation techniques. The architecture of the proposed approach is illustrated in Fig. 1, and

Study material

According to previous investigation, the Huang Gang Creek has the highest As concentration over the Tamsui River basin in northern Taiwan (Fig. 2).

As concentration ranges from 1 to 30 μg/l in this area. Obviously, this creek is As-contaminated with concentration higher than the world average value (0.62 μg/l; Gaillardet et al., 2003) and even exceeding the World Health Organization's (WHO) drinking water standard (10 μg/l) (WHO, 2011) for certain periods (such as drought periods and/or irregular

Results and discussion

In this study, modeling As concentration under limited water samples is explored by the ANFIS. The results and discussion of the ANFIS(GT) model and two comparative models are addressed as follows.

Conclusion

There is a growing need for suitably modeling As concentration in watersheds through efficiently determining important factors for environmental resources management; nevertheless it is difficult to accurately model the complex nonlinear relationship between As and hydrological variable/water quality parameters, in particular As only has weak linear relationship with water quality parameters (absolute correlation coefficients range between 0.062 and 0.246) in the study area. This study proposes

Acknowledgments

This study was partially supported by the Ministry of Science and Technology, Taiwan, ROC (Grant No. 101-2923-B-002-001-MY3) in collaboration with the ‘Agence Nationale de la Recherche’, project TWIN-RIVERS (ANR-11-IS56-0003). The authors sincerely appreciate Professor Filip M. G. Tack and the anonymous reviewers for their valuable comments and constructive suggestions.

References (51)

  • Y.H. Kao et al.

    Hydrochemical, mineralogical and isotopic investigation of arsenic distribution and mobilization in the Guandu wetland of Taiwan

    J Hydrol

    (2013)
  • D. Love et al.

    Factor analysis as a tool in groundwater quality management: two southern African case studies

    Phys Chem Earth

    (2004)
  • B.K. Mandal et al.

    Arsenic round the world: a review

    Talanta

    (2002)
  • D.B. May et al.

    Prediction of urban stormwater quality using artificial neural networks

    Environ Model Software

    (2009)
  • S. Miyashita et al.

    Rapid determination of arsenic species in freshwater organisms from the arsenic-rich Hayakawa River in Japan using HPLC-ICP-MS

    Chemosphere

    (2009)
  • D.D. Mohan et al.

    Arsenic removal from water/wastewater using adsorbents—a critical review

    J Hazard Mater

    (2007)
  • R. Noori et al.

    Results uncertainty of solid waste generation forecasting by hybrid of wavelet transform-ANFIS and wavelet transform-neural network

    Expert Syst Appl

    (2009)
  • R. Noori et al.

    Uncertainty analysis of developed ANN and ANFIS models in prediction of carbon monoxide daily concentration

    Atmos Environ

    (2010)
  • R. Noori et al.

    Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction

    J Hydrol

    (2011)
  • M. Novak et al.

    Increasing arsenic concentrations in runoff from 12 small forested catchments (Czech Republic, Central Europe): patterns and controls

    Sci Total Environ

    (2010)
  • P.J. Peshut et al.

    Arsenic speciation in marine fish and shellfish from American Samoa

    Chemosphere

    (2008)
  • T. Rajaee

    Wavelet and ANN combination model for prediction of daily suspended sediment load in rivers

    Sci Total Environ

    (2011)
  • R. Rodriguez et al.

    Groundwater arsenic variations: the role of logical geology and rainfall

    Appl Geochem

    (2004)
  • C. Shu et al.

    Regional flood frequency analysis at ungauged sites using the adaptive neuro-fuzzy inference system

    J Hydrol

    (2008)
  • K.P. Singh et al.

    Artificial neural network modeling of the river water quality—a case study

    Ecol Model

    (2009)
  • Cited by (19)

    • Artificial intelligence-based single and hybrid models for prediction of water quality in rivers: A review

      2020, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      The NDEI is defined as the ratio of the RMSE to the standard deviation (SD) of the target time series [100]. A smaller NDEI value indicates a more accurate estimation [84]. The bias is the average value of residuals between the observed and predicted values of the variable and represents the mean of all the individual errors and indicates whether the model overestimates or underestimates the parameter [14].

    • Wind turbine power output very short-term forecast: A comparative study of data clustering techniques in a PSO-ANFIS model

      2020, Journal of Cleaner Production
      Citation Excerpt :

      The SC clustered ANFIS model performed best among the three clustering techniques considered either as standalone or hybrid. This validates its recommendation in the literature (Casalino et al., 2014; Chang et al., 2014; Chang and Chang, 2006) for data clustering. The accuracy of the SC-clustered model is hinged on an optimal selection of the radius of influence.

    • Tackling environmental challenges in pollution controls using artificial intelligence: A review

      2020, Science of the Total Environment
      Citation Excerpt :

      Obtaining early-warning information and assessment results based on various input conditions, and exploring the influence of each factor on water bodies to determine the primary factors are key points of relevant researches. For assessment of arsenic (As) concentration in a river of northern Taiwan, Chang et al. (2014) used monthly monitoring data to develop AI prediction models. The dataset consisted of 37 datasets of one-month antecedent rainfall (R) from a rainfall gauge station, and 13 water quality parameters collected at a water quality monitoring station each month for three years.

    View all citing articles on Scopus
    View full text