We present cna, a package for performing Coincidence Analysis (CNA). CNA is a configurational comparative method for the identification of complex causal dependencies—in particular, causal chains and common cause structures—in configurational data. After a brief introduction to the method’s theoretical background and main algorithmic ideas, we demonstrate the use of the package by means of an artificial and a real-life data set. Moreover, we outline planned enhancements of the package that will further increase its applicability.
Configurational comparative methods (CCMs) subsume techniques for the identification of complex causal dependencies in configurational data using the theoretical framework of Boolean algebra and its various extensions (Rihoux and Ragin 2009). For example, Qualitative Comparative Analysis (Ragin 1987, 2000, 2008)—hitherto the most prominent representative of CCMs—has been applied in areas as diverse as business administration (e.g., Chung 2001), environmental science (Vliet et al. 2013), evaluation (Cragun et al. 2014), political science (Thiem 2011), public health (Longest and Thoits 2012) and sociology (Crowley 2013). Besides three stand-alone programs based on graphical user interfaces, three R packages for QCA are currently available, each with a different scope of functionality: QCA (Thiem and Duşa 2013b,c; Dusa and Thiem 2014), QCA3 (Huang 2014) and SetMethods ((Quaranta 2013); an add-on package to (Schneider and Wagemann 2012)).
A novel technique called Coincidence Analysis (CNA) has recently joined the family of CCMs (Baumgartner 2009a,b; 2013b). Like QCA, CNA searches for rigorously minimized sufficient and necessary conditions of causally modeled outcomes, and it implements the same regularity theory of causation as QCA, that is, the theory most prominently advanced by Mackie (1974). Contrary to QCA, however, CNA can treat any number of factors in a processed data set as endogenous (outcomes), and it does not eliminate redundancies from sufficient and necessary conditions by means of Quine-McCluskey optimization (Quine 1959; McCluskey 1965), but by means of an optimization algorithm that is custom-built for causal modeling. As a direct consequence of these differences, CNA can identify common cause and causal chain structures. Moreover, the algorithm does not need to be told which factors are endogenous and which ones exogenous; it can infer that from the data. What is more, limited data diversity does not force CNA to resort to counterfactual additions to the data. Finally, while the QCA programs fs/QCA (Ragin and Davey 2014), fuzzy (Longest and Vaisey 2008), Tosmana (Cronqvist 2011) and Kirq (Reichert and Rubinson 2014) often fail to find all data-fitting models (Thiem and Duşa 2013a; cf. Baumgartner and Thiem 2014; Thiem 2014b), the R implementation of CNA presented in this paper (cna; (Ambuehl et al. 2015)) not only ensures—as does QCA—that all single-outcome models are identified but additionally recovers the whole space of multiple-outcome models that fit the data.
After an introduction to the theoretical and algorithmic background of CNA, we demonstrate the potential of the cna R package by means of an artificial and a real-life data set. In the final section, we outline planned enhancements of cna that will further increase its applicability.
CCMs search for causal dependencies as defined by so-called regularity theories of causation, whose development dates back to David Hume (1711-1776) and John Stuart Mill (1806-1873). By implementing techniques of Boolean algebra, modern regularity theories spell out the notion of causal relevance in terms of redundancy-free (minimized) sufficiency and necessity relations among the elements of analyzed sets of factors (Mackie 1974; Graßhoff and May 2001; Baumgartner 2008, 2013a).
The crucial component of the regularity theoretic definiens of causal
relevance is the notion of a minimal theory. A minimal theory of a
factor
A minimal theory represents the causally interpretable dependencies of
sufficiency and necessity among the factors contained in a data set
CNA aims to infer minimal theories from
Similarly, to determine whether a complex necessary condition
CNA does not presuppose that certain factors in
Recovered minimal theories of the elements of
As causally analyzed data tend to be noisy, that is, confounded by
uncontrolled (unmeasured) causes of endogenous factors, it often happens
that no configuration of factors is strictly sufficient or necessary for
a given
The cna package by Ambuehl et al. (2015) implements the methodological protocol of CNA as sketched above. For more details on the background assumptions of CNA, its minimization algorithm, and its relation to other configurational methods such as QCA, we refer interested readers to Baumgartner (2009a).
In the following, we illustrate the main steps in using the cna
package. First, we employ a hypothetical data set from Baumgartner (2009a)
to investigate the causal dependencies among five factors hypothesized
to constitute a causal structure behind the overall level of education
in western democratic countries. These five factors are “strong unions”
(
Case | |||||
1 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 1 | 0 | 1 |
3 | 1 | 0 | 1 | 1 | 1 |
4 | 1 | 0 | 1 | 0 | 1 |
5 | 0 | 1 | 1 | 1 | 1 |
6 | 0 | 1 | 1 | 0 | 1 |
7 | 0 | 0 | 0 | 1 | 1 |
8 | 0 | 0 | 0 | 0 | 0 |
The cna package comes with an integrated bundle of six data sets from
various areas of the social sciences. That bundle also includes the data
in Table 1 as the data frame d.educate
.
Accordingly, the first step to causally model
Table 1 by means of CNA is to load the cna
package along with the d.educate
data.
> library(cna)
> data(d.educate)
The heart of the cna package is constituted by the cna()
function.
It is the function that identifies and minimizes dependencies of
sufficiency and necessity in the data, which can be given to cna()
either in terms of a Boolean data frame or of a truth table as produced
by the truthTab()
function. Essentially, truthTab()
simply merges
multiple rows of a data frame featuring the same configuration into one
row, such that each row of the resulting truth table corresponds to one
determinate configuration. The number of occurrences (cases) and an
enumeration of the cases are saved as attributes n
and cases
,
respectively. As Table 1 does not contain
multiple rows with identical configurations, the application of
truthTab()
is uncalled for and we can directly pass d.educate
on to
cna()
. Moreover, let us assume that we have no prior causal knowledge
about the underlying causal structure, such that we cannot additionally
supply a causal ordering. The following is the default output returned
by cna()
.
> cna(d.educate)
--- Coincidence Analysis (CNA) ---
Factors: U, D, L, G, E
Minimally sufficient conditions:
--------------------------------
Outcome D:
condition consistency coverage
L*u -> D 1.000 0.500
E*g*u -> D 1.000 0.250
Outcome E:
condition consistency coverage
L -> E 1.000 0.857
D -> E 1.000 0.571
G -> E 1.000 0.571
U -> E 1.000 0.571
Outcome G:
condition consistency coverage
d*E*u -> G 1.000 0.250
E*l -> G 1.000 0.250
Outcome L:
condition consistency coverage
D -> L 1.000 0.667
U -> L 1.000 0.667
E*g -> L 1.000 0.500
Outcome U:
condition consistency coverage
d*L -> U 1.000 0.500
d*E*g -> U 1.000 0.250
Atomic solution formulas:
-------------------------
Outcome E:
condition consistency coverage
D + G + U <-> E 1.000 1.000
G + L <-> E 1.000 1.000
Outcome L:
condition consistency coverage
D + U <-> L 1.000 1.000
Complex solution formulas:
--------------------------
condition consistency coverage
(D + G + U <-> E) * (D + U <-> L) 1.000 1.000
(G + L <-> E) * (D + U <-> L) 1.000 1.000
First, cna()
lists all minimally sufficient conditions of all factors
in d.educate
, second, it reports the atomic solution formulas for the
factors that can be modeled as endogenous factors, and third, it
specifies the resulting complex solutions. All solution types come with
corresponding consistency and coverage scores. In case of
Table 1, these scores reach maximal values for
both atomic and complex solution formulas. Thus, the d.educate
data
are as good as configurational data can possibly get.
The above results show that the causal structure generating
Table 1 features two endogenous factors,
viz. “strong left parties” (cna()
infers that the d.educate
data can be modeled in
terms of the two complex structures depicted in
Figure 1.
Graph 1(left) represents a common cause
structure, in which “high level of disparity” (
|
|
Common cause structure | Causal chain |
d.educate
data.
This subsection illustrates further functionalities of the cna package
on the basis of a real-life data set. To this end, we choose the study
by Lam and Ostrom (2010), who analyze the effects of an irrigation experiment in the
course of development interventions on the Indrawati River watershed in
the central hills of Nepal. Among other things, the authors investigate
the causal relevance of five exogenous factors on “persistent
improvement in water adequacy at the tail end in winter” (d.irrigate
.
> data(d.irrigate)
> d.irrigate
A R F L C W
1 0 1 0 1 1 1
2 0 1 0 1 1 0
3 0 1 1 1 1 1
.. . . . . . .
<rest omitted>
Lam and Ostrom (2010) assume that cna()
by
means of the argument ordering
, which takes a list of character
vectors referring to the factors in the data frame as input. In case of
d.irrigate
, the intended ordering is this:
ordering = list(c("A", "R", "F", "L", " C"), "W")
. It determines that
cov
) to csf()
.
> sol1 <- cna(d.irrigate, ordering = list(c("A", "R", "F", "L", "C"), "W"), cov = 0.9)
> csf(sol1)
condition consistency coverage
1 (a + f*R + L <-> C) * (A*C + a*f*r + F*R + l*R <-> W) 1.000 0.917
2 (a + f*R + L <-> C) * (A*C + a*l + F*R <-> W) 1.000 0.917
3 (a + f*R + L <-> C) * (A*C + C*f*r + F*R + l*R <-> W) 1.000 0.917
4 (a + f*R + L <-> C) * (A*C + C*l + F*R <-> W) 1.000 0.917
5 (a + f*R + L <-> C) * (a*f*r + A*L + F*R + l*R <-> W) 1.000 0.917
6 (a + f*R + L <-> C) * (a*f*r + A*R + F*R + l*R <-> W) 1.000 0.917
7 (a + f*R + L <-> C) * (a*l + A*L + F*R + l*R <-> W) 1.000 0.917
8 (a + f*R + L <-> C) * (a*l + A*R + F*R <-> W) 1.000 0.917
9 (a + f*R + L <-> C) * (A*L + C*f*r + F*R + l*R <-> W) 1.000 0.917
10 (a + f*R + L <-> C) * (A*L + C*l + F*R <-> W) 1.000 0.917
11 (a + f*R + L <-> C) * (A*R + C*f*r + F*R + l*R <-> W) 1.000 0.917
12 (a + f*R + L <-> C) * (A*R + C*l + F*R <-> W) 1.000 0.917
This output of cna()
shows that not only cna()
returns one atomic solution formula for
To generate models for negative outcomes, cna()
provides the argument
notcols
, which takes a character vector of factors to be negated as
input. In the following analysis, we set cov
to print()
function, which provides arguments determining the number of solutions
to print (nsolutions
) and what elements of the solution to print
(what
). The what
argument takes a character vector as input, where
"t"
prints the truth table, "m"
the minimally sufficient conditions,
"a"
the atomic solution formulas, "c"
the complex solution formulas,
and "all"
returns all solution elements.
> sol2 <- cna(d.irrigate, ordering = list(c("A", "R", "F", "L", "c"), "w"),
notcols = c("C", "W"), cov = 0.66)
> print(sol2, nsolutions = 3, what = "a,c")
--- Coincidence Analysis (CNA) ---
Causal ordering:
A, R, F, L, c < w
Atomic solution formulas:
-------------------------
Outcome R:
condition consistency coverage
A*C + f*L <-> R 1.000 0.667
A*F + f*L <-> R 1.000 0.667
A*L + f*L + F*l <-> R 1.000 0.667
Outcome w:
condition consistency coverage
A*r + F*r <-> w 1.000 0.667
A*r + L*r <-> w 1.000 0.667
c*f + F*r <-> w 1.000 0.667
... (total no. of formulas: 6)
Complex solution formulas:
--------------------------
condition consistency coverage
(A*C + f*L <-> R) * (A*r + F*r <-> w) 1.000 0.667
(A*F + f*L <-> R) * (A*r + F*r <-> w) 1.000 0.667
(A*L + f*L + F*l <-> R) * (A*r + F*r <-> w) 1.000 0.667
... (total no. of formulas: 18)
Finally, the condition()
function provides assistance to inspect the
properties of sufficient and necessary conditions in a data frame, most
notably, of minimally sufficient and necessary conditions that appear in
solution formulas returned by cna()
. It takes a vector of strings
specifying Boolean functions as input, reveals which configurations and
cases instantiate a given condition or solution, and lists consistency,
coverage, as well as unique coverage scores (cf. Ragin 2008 63–68).
Below, we investigate the properties of the first atomic solution for
outcome
> condition("A*r + F*r <-> w", d.irrigate)
A*r+F*r -> w :
A*r+F*r w n cases
0 0 1 1
0 1 1 2
0 0 2 3,4
0 0 2 5,6
0 0 2 7,8
0 0 1 9
0 0 1 10
1 1 1 11
1 1 1 12
0 0 1 13
0 0 1 14
0 0 1 15
Consistency: 1.000 (2/2)
Coverage: 0.667 (2/3)
Total no. of cases: 15
Unique Coverages: A*r : 0.333 (1/3)
F*r : 0.333 (1/3)
The first two columns of the table returned by condition()
indicate
the configurations instantiating (1
) and not instantiating (0
) the
disjunction
We have presented cna, an R package implementing Coincidence Analysis (CNA), which is a method for the identification of multi-outcome structures in configurational data. CNA not only differs from QCA—the standard method of configurational causal modeling—by relaxing the single-outcome restriction but also by not drawing on Quine-McCluskey optimization for the elimination of redundancies from sufficient and necessary conditions. Instead, CNA employs its own minimization algorithm that is custom-built for causal modeling purposes.
At this stage of development, cna still requires bivalent variables. Planned future enhancements include the capability to process multivalent factors that generate crisp sets (Thiem 2013) and bivalent factors with fuzzy sets (Smithson and Verkuilen 2006). Possibilities to merge these constructs in multivalent factors with fuzzy sets, as has recently been suggested in the context of QCA (Thiem 2014c), will be explored as well. In this connection, aspects of alternative procedures proposed in the context of minimization with fuzzy sets may be incorporated where appropriate (Eliason and Stryker 2009). Finally, functionality for sensitivity diagnostics that facilitates robustness tests is envisaged (Thiem 2014a).
Complex causal structures are communicated most effectively to readers of scientific articles in the form of graphs rather than formulas. This is all the more true for multivalent factors. In this regard, functionality that translates cna solutions into corresponding graphs enjoys high priority on the list of future enhancements.
This work was generously supported by the Swiss National Science Foundation, grant number PP00P1_144736.
QCA, SetMethods, cna
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Baumgartner & Thiem, "Identifying Complex Causal Dependencies in Configurational Data with Coincidence Analysis", The R Journal, 2015
BibTeX citation
@article{RJ-2015-014, author = {Baumgartner, Michael and Thiem, Alrik}, title = {Identifying Complex Causal Dependencies in Configurational Data with Coincidence Analysis}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {1}, issn = {2073-4859}, pages = {170-184} }