This document walks the reader through a simple example, from setting up project directories, through loading data and performing analyses. We assume that the package has been installed in an environment with all the necessary components (as described in Installation.md). As an example, we will load the xml results file from Georgia in the repository at tests/000_data_for_pytest/2020-General/Georgia/GA_detail_20201120_1237.xml.
The package offers a fair amount of flexibility in the directory structures used. For this sample session, we assume the user will call the program from a working directory with the following structure and files:
.
+-- input_directory
| +-- Georgia
| | +-- GA_detail_20201120_1237.xml
+-- archive_directory
+-- reports_and_plots_directory
+-- run_time.ini
Note that during processing the package uses information from the repository. In other words, the repository contains not only the code necessary to compile the package, but also files called by the package as it functions -- files with information about jurisdictions, mungers, results and result files. So the user will need to know the absolute path to the repository content root src
. Below we will call this path <path/to/src>
.
[electiondata]
results_dir=input_results
archive_dir=archive_directory
repository_content_root=<path/to/src>
reports_and_plots_dir=reports_and_plots
[postgresql]
host=localhost
port=5432
dbname=ga_test
user=postgres
password=
You may wish to check that these postgresql credentials will work on your system via the command psql -h localhost -p 5432 -U postgres postgres
. If this command fails, or if it prompts you for a password, you will need to find the correct connection parameters specific to your postgresql instance. (Note that the dbname
parameter is arbitrary, and determines only the name of the postgresql database created by the package.)
Copy the file of the same name in the repository: 000_template.mungerGA_detail_20201120_1237.xml
>>> import electiondata as ed
>>> ed.load_or_reload_all()
After this command executes successfully, you will see a database in your postgres instance whose name matches the value of dbname
given in run_time.ini
. The file structure will now be:
.
+-- archive_directory
| +-- Georgia_2020-11-10
| | +-- GA_detail_20201120_1237.xml
+-- input_directory
+-- reports_and_plots_directory
| +-- compare_to_Georgia_xxxx_xxxx
| | +-- parameters.txt
| | +-- not_found_in_db.tsv
| | +-- ok.tsv
| | +-- wrong.tsv
| +-- load_or_reload_all_xxxx_xxxx
| | +-- Georgia_jurisdiction_dictionary.txt.warnings
+-- run_time.ini
Note that the Georgia
results folder has been moved to the archive directory, with a date stamp from the date of the results file. The reports_and_plots_directory
contains:
- a subdirectory
compare_to_Georgia_xxxx_xxxx
with the results of the comparison to the reference results - a subdirectory
load_or_reload_all_xxxx_xxxx
with warnings from the data uploading. You may wish to look at the fileGeorgia_jurisdiction_dictionary.txt.warnings
, which lists the contests present in the xml results file that were not recognized during processing. If these contests and their candidates had been added to the Georgia-specific information in the repository (see "Creating Jurisdiction files" below, or in the User Guide), these warnings would not appear.
To pull data out, you will need to use the Analyzer class:
>>> an = ed.Analyzer()
You can export results in tabular form:
>>> an.export_election_to_tsv("GA_results.tsv", "2020 General")
The file GA_results.tsv
containing the tab-separated results will be created in the working directory:
.
+-- archive_directory
| +-- Georgia_2020-11-10
| | +-- GA_detail_20201120_1237.xml
+-- GA_results.tsv
+-- input_directory
+-- reports_and_plots_directory
| +-- compare_to_Georgia_xxxx_xxxx
| | +-- parameters.txt
| | +-- not_found_in_db.tsv
| | +-- ok.tsv
| | +-- wrong.tsv
| +-- load_or_reload_all_xxxx_xxxx
| | +-- Georgia_jurisdiction_dictionary.txt.warnings
+-- run_time.ini
The program (v.2.0.1 and higher) can also produce a string of data in the NIST Common Data Format Version 2.0, in either json or xml format:
>>> an.export_nist_xml_as_string("2020 General", "Georgia")
>>> an.export_nist_json_as_string("2020 General", "Georgia")
To draw pictures automatically, you will need orca
installed on your system. Plots will be exported to the reports_and_plots_directory
and may also appear in a browser window.
If orca
is not installed, the system will not automatically create pictures. Note that in any case the routines return text information sufficient to create plots in any system you may wish to use.
You can create scatter plots of results by county. For example, create a jpeg comparing Biden's vote totals to Trump's vote totals with:
>>> biden_v_trump = an.scatter("Georgia","2020 General","Candidate total","Joseph R. Biden","2020 General","Candidate total","Donald J. Trump",fig_type="jpeg")
Or compare Biden's votes on election day with votes on absentee mail ballots:
>>> biden_eday_v_abs = an.scatter("Georgia","2020 General","Candidate election-day","Joseph R. Biden","2020 General","Candidate absentee-mail","Joseph R. Biden",fig_type="jpeg")
In each case the value returned is a string with all the information necessary to draw the plot in a string (following the python dictionary format). To return such a string without calling orca
, simply omit the fig_type
, e.g.,
>>> biden_eday_v_abs = an.scatter("Georgia","2020 General","Candidate election-day","Joseph R. Biden","2020 General","Candidate absentee-mail","Joseph R. Biden")
The arguments (in order) are:
- the jurisdiction ("Georgia")
- three items to define the horizontal-axis count:
- the election ("2020 General")
- the count category, one of:
Candidate early Candidate election-day Candidate provisional Candidate total Contest absentee-mail Contest early Contest election-day Contest provisional Contest total Party absentee-mail Party early Party election-day Party provisional Party total ```
- the specific count within the category
- three items to define the vertical-axis count (same as for horizontal)
Note that once the category has been chosen, e.g., "Party absentee-mail", the list of possibilities for the specific count can be obtained from this line of code:
>>> [entry["name"] for entry in an.display_options("count",["2020 General","Georgia","Party absentee-mail"])]
Use any category name in place of "Party absentee-mail" to see counts available for that category.
Categories starting with "Contest" give number of votes tallied in that contest in each county, lumping all candidates together. Categories starting with "Party" give number of votes tallied for members of that party in a particular contest type (e.g., "Libertarian congressional").
The system attempts to find interesting one-county outliers within the election results. The specific algorithm is described in an article by Singer, Srungavarapu & Tsai in MAA Focus, Feb/March 2021, pp. 10-13
For example:
>>> outliers = an.bar("2020 General","Georgia",contest_type="congressional",fig_type="png")
This will export up to three bar charts (to reports_and_plots_directory
) showing a pair of candidates for which the vote shares in one county, for some type of ballot, differs significantly from the vote shares in other counties in the same district. The output of the command above includes: a chart showing how Clarke County differs from other Georgia counties in the 9th US House District.
The output variable outliers
contains even more information, including the contest margin and an estimate of the votes at stake if the outlier were brought in line with the other counties.
Options for contest_type
for this data set are: congressional
, state
(for statewide contests), state-house
and state-senate
.
The program offers difference-in-difference analysis where results are available by vote type, following Herron's analysis of congressional contests. The following code will create a tab-separated file GA_diffs.tsv
in the working directory:
>>> (did_frame, missing) = an.diff_in_diff_dem_vs_rep("2020 General")
>>> did_frame.to_csv("GA_diff_in_diff.tsv","\t")
(Note: the variable missing
is a list of diff-in-diff comparisons that failed.)
The sample session above uses the information already in the repository about Georgia and the particular results file. If you wish to create these files yourself from scratch, follow these optional steps.
The data loading process includes checking contest totals against reference totals in src/reference_results/Georgia.tsv. You may wish to add some reference totals to this file.
If you wish to practice creating munger file, follow the steps below.
Your munger file should live in the folder src/mungers, and its name needs to have the extension .munger
.
- (Optional) Delete the munger file ga_xml.munger from your local copy of the repository. This step is not strictly necessary, but will help ensure that you don't accidentally use the existing munger.
- Make a copy of src/mungers/000_template.munger named
my_Georgia_test.munger
, inside the same folder. - Fill in required parameter values to specify the organization of the results file
000_template.munger
file_type=xml
indicates that the file is in xml format.count_location=ElectionResult/Contest/Choice/VoteType/County.votes
indicates where the vote counts are to be found within the xml nesting structure.- Specify the location of the info defining each vote count. Each location must start with one of the nodes in
count_location
* Because the file contains a variety of contests, candidates, parties, vote types (a.k.a. CountItemTypes) and geographies (a.k.a. ReportingUnits), there is no need to use theconstant_over_file
parameter. * In the[munge formulas]
section, specify where the other information is found. While the ReportingUnit, CandidateContest and CountItemType can be read simply from quoted strings in the file, (ReportingUnit
is inCounty.name
,CandidateContest
is inContest.text
andCountItemType
is inVoteType.name
), theCandidate
andParty
must both be read out ofChoice.text
with python's regular expression ("regex") syntax. In the package syntax, the location of the string in the file is enclosed in angle brackets<>
, and if a regular expression is needed, the location and the regular expression are given as a pair within braces{}
.
[format]
file_type=xml
count_location=ElectionResult/Contest/Choice/VoteType/County.votes
[munge formulas]
ReportingUnit=<County.name>
Party={<Choice.text>,^.* \((.*)\)$}
CandidateContest=<Contest.text>
Candidate={<Choice.text>,^(.*) \(.*\)$}
CountItemType=<VoteType.name>
If you wish to practice creating an initialization file for results, follow the steps below. Otherwise the system will use the initialization file ga20g_20201120_1237.ini.
Because your input_directory
folder has a subfolder Georgia
, the dataloading routines will look to the folder src/ini_files_for_results/Georgia for information about any results files in input_directory/Georgia
in your working directory.
- Delete ga20g_20201120_1237.ini from the repository.
- Copy src/ini_files_for_results/single_election_jurisdiction_template.ini to a file with extension
.ini
in the folder src/ini_files_for_results/Georgia. - Define the parameters in the
[election_results]
section.
- If you did not create a new munger file, use
munger_list=ga_xml
instead ofmunger_list=my_Georgia_test
- Use the actual download date instead of
2021-11-09
[election_results]
results_file=Georgia/GA_detail_20201120_1237.xml
munger_list=my_Georgia_test
jurisdiction=Georgia
election=2020 General
results_short_name=any_alphanumeric_string
results_download_date=2021-11-09
results_source=electiondata repository
results_note=
is_preliminary=False
If you choose to create your Georgia jurisdiction files from scratch (rather than using the ones provided in the repository) you will need to specify various information about Georgia in a file called jurisdiction_prep.ini
in your working directory. The working directory will have this structure:
.
+-- jurisdiction_prep.ini
+-- run_time.ini
+-- input_directory
| +-- Georgia
| | +-- GA_detail_20201120_1237.xml
+-- archive_directory
+-- reports_and_plots_directory
and the contents of jurisdiction_prep.ini
will be:
[electiondata]
name=Georgia
reporting_unit_type=state
abbreviated_name=GA
count_of_state_house_districts=180
count_of_state_senate_districts=56
count_of_us_house_districts=14
Because the input_directory
folder has a subfolder Georgia
, the dataloading routines will look for information specific to Georgia in the repository folder src/jurisdictions/Georgia.
- Delete the folder src/jurisdictions/Georgia.
- Navigate to your working directory.
- Follow the instructions in the "Create or Improve a Jurisdiction" section of the User_Guide. As you work through them, the following examples may be helpful.
- Sample county lines in
Reporting.txt
:
Georgia;Appling County county
Georgia;Atkinson County county
Georgia;Bacon County county
- Sample county lines in
GA_starter_dictionary.txt
(ordictionary.txt
):
ReportingUnit Georgia;Appling County Appling
ReportingUnit Georgia;Atkinson County Atkinson
ReportingUnit Georgia;Bacon County Bacon
- Try loading your results!
>>> import electiondata as ed
>>> ed.load_or_reload_all(move_files=False)
It is useful to prevent the system from archiving files while you are dealing with warnings with the move_files=False
option.
Errors and warnings are par for the course, so don't be discouraged if the result is something like this:
Jurisdiction Georgia did not load. See .error file.
Jurisdiction errors written to reports_and_plots_directory/load_or_reload_all_1113_0757/_jurisdiction_Georgia.errors
>>>
Take a look at the file(s) indicated, which should give you a good idea of what needs to be fixed. (If it doesn't please report the unhelpful message(s) and your question as an issue on GitHub). Repeat this step -- loading and looking at errors -- as often as necessary! (In rare cases, you may want to start over with a new database, either by erasing the existing ga_test
from your postgres instance, or changing the value of dbname
in the run_time.ini
file.)
If no contests were recognized, or no candidates were recognized, the system reports an error:
Jurisdiction errors (Georgia):
Could not add totals for 2020 General because no data was found
No contest-selection pairs recognized via munger my_Georgia_test.munger
Focus one contest, and one candidate in that contest. Look in the .errors
and .warnings
files. If the name of the contest or the candidate appears, the file will tell you what went wrong. If the name of the contest or the candidate does not appear in the .errors
or .munger
file, then there is an issue with the munger named in the results initialization file.
Contests that were parsed from the file but not recognized in the dictionary will be listed in a .warnings
file, e.g.:
CandidateContests (found with munger my_Georgia_test) not found in dictionary.txt :
US Senate (Perdue)
US Senate (Loeffler) - Special
Statewide Referendum A
State Senate District 39 - Special Democratic Primary
State Senate Dist 9/Senador Estatal Dist 9
You may very well choose to omit certain contests (such as Statewide Referendum A and the Special Democratic Primary). For the other contests, you will need to make an entry in dictionary.txt
-- and maybe CandidateContest.txt
and Office.txt
as well.
* 'State Senate Dist 9/Senador Estatal Dist 9': if you see this after following the instructions in this sample session, the corresponding Office and CandidateContest 'GA Senate District 9' should already exist, so simply add another line to the dictionary.txt
file: CandidateContest GA Senate District 9 State Senate Dist 9/Senador Estatal Dist 9
. (It may save time to add this version of all the GA Senate districts at this point.)
* 'US Senate (Perdue)' and 'US Senate (Loeffler) - Special': Be sure to disambiguate these, either by creating two separate Offices and corresponding CandidateContests for the two US Senate positions, or by creating two separate CandidateContests corresponding to the single office 'US Senate GA'. We choose the latter, adding a row to CandidateContest.txt
(US Senate GA (partial term) 1 US Senate GA
) and two rows to dictionary.txt
:
CandidateContest US Senate GA (partial term) US Senate (Loeffler) - Special
CandidateContest GA Attorney General GA Attorney General
Candidates not found in the dictionary will be listed in a .warnings
file. Copy and paste these names into the Candidate.txt
file as a single BallotName column:
BallotName
Zulma Lopez
Zachary Perry
Yasmin Neal
and add corresponding rows to dictionary.txt
:
Candidate Zulma Lopez Zulma Lopez
Candidate Zachary Perry Zachary Perry
Candidate Yasmin Neal Yasmin Neal
You are free to choose another convention for the candidate names in the database, as long as you specify the mapping in the dictionary. E.g., you could instead have
BallotName
Lopez, Zulma
Perry, Zachary
Neal, Yasmin
if your dictionary has
Candidate Lopez, Zulma Zulma Lopez
Candidate Perry, Zachary Zachary Perry
Candidate Neal, Yasmin Yasmin Neal
Parties not recognized will also be listed in the .warnings
file, e.g.:
Rep
Lib
Ind
Grn
Dem
All these should be mapped in dictionary.txt
Party Republican Party Rep
Party Libertarian Party Lib
Party Independent Party Ind
Party Green Party Grn
Party Democratic Party Dem
to your chosen internal database names which should be in the name column of Party.txt
:
Name
Democratic Party
Republican Party
Libertarian Party
Green Party
Independent Party
Note: Some of the routines in the analyze
submodule assume that every party name ends with ' Party'.