Assignment 2

library(haven)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
TEDS_2016 <- read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")

Exploratory Data Analysis

In my exploratory data analysis, I decided to create frequency graphs for each one of the variables recorded in the dataset.

ggplot(TEDS_2016, aes(x = Sex)) +
geom_bar()

1 = Male, 2 = Female

ggplot(TEDS_2016, aes(x = Age)) +
geom_histogram(binwidth = 0.5)

1 = 20-29 age range

2 = 30-39 age range

3 = 40-49 age range

4 = 50-59 age range

5 = 60-69 age range( and more? Not specified on in the data set)

Data set seems to skew towards in the 5 range

ggplot(TEDS_2016, aes(x = Edu)) + 
geom_histogram(binwidth = 0.5)

1 = Below elementary

2 = Junior High

3 = Senior High

4 = College

5 = ??? Data set does not seem to specify

9 = ?????? Assume that is a numerical error, 9 does not seem to be defined in the dataset

The values seem to most often congregate around senior high education and 5, which an educated assumption could be made that it is Masters/Graduate Education.

Problems Encountered

A significant problem with the data set is the lack of definitions for certain values. For example, in the age range response values 1-4 are defined, but 5 is not. For the education response range, 1-4 are again defined, but 5/6 are not defined (and for some reason a 9 exists).

As a result of this, it is much more difficult to properly interpret the data, especially given that much of the data is more descriptive than numerical. Consequently, it is also more difficult to generate proper research questions when we cannot properly interpret what the data means.

How To Work With Missing Values

Due to the previous missing definitions/values, although in certain situations we can perhaps make educated guesses as to the nature of certain types of data (ex. my assumption that “5” in the Edu set stood for Graduate education), ultimately due to the unreliable nature of such assumptions I believe we should only use such variables when taken in context of a whole dataset, as then we can contextualize values even without exact definition.

For example, although response values 5-6 in Edu are undefined, we can still make the basic assumption that they stand for higher levels of education than values 1-4. As such, we can still use the entire dataset for comparative statistical analysis (such as covariation) of other factors, with the caveat that analysis of each individual variable may not be worthwhile given lack of exact definition of some values.

Exploring Relationship Between Tondu and Variables

ggplot(TEDS_2016, aes(x = Tondu, fill = female)) +
geom_bar()
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

0 = Not Female (presumably male), 1 = Male

Tondu Variables:

TEDS_2016$Tondu<-as.numeric(TEDS_2016$Tondu,labels=c("Unification now”, “Status quo, unif. in future”, “Status quo, decide later", "Status quo forever", "Status quo, indep. in future", "Independence now”, “No response"))
ggplot(TEDS_2016, aes(x = Tondu)) + 
geom_freqpoly(binwidth = 1/4)

ggplot(data = TEDS_2016) + 
geom_bar(mapping = aes(x = Tondu))