The word "data" is a general purpose word denoting a collection of measurements. "Data points" refer to individual instances of data. A "data set" is a well-structured set of data points. Data points can be of several "data types," such as numbers, or text, or date-times. When we collect data on similar objects in similar formats, we bundle the data points into a "variable." We could give a variable a name such as 'age,' which could represents the list of ages of everyone in a room. The data points associated with a variable are called the "values" of the variable. These concepts are foundational to understanding data science. There is some quirkiness in the way variables are treated in the R programming language.
The Wiktionary defines data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.
The Wiktionary defines datum as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation.
For our purposes, the key components of these definitions are that data are observations that are measured and communicated in such a way as to be intelligible to both the recorder and the reader. So, you as a person are not data, but recorded observations about you are data. For example, your name when written down is data; or the digital recording you speaking your name is data; or a digital photograph of your face or video of you dancing are data.
Rather than call a single measurement by the formal word '"datum," we will use what the Wikipedia calls a data point. We may talk about a single data point or several data points. Just remember that when we talk of "data," what we mean is a set of aggregated data points.
The Wiktionary, unhelpfully, defines a data set as a "set of data." Let us define a data set as a collection of data points that has been observed on similar objects and formatted in similar ways. Thus, a compilation of the written names and the written ages of a room full of people is a data set. In computing, a data set is stored in a file on a disk. Storing the data set in a file makes it accessible to analysis.
As illustrated earlier, data can exist in many forms, such as text, numbers, images, audio, and video. People who work with data have taken great care to very specifically define different data types. They do this because they want to compute various operations on the data, and those operations only make sense for particular data types. For example, addition is an operation we can compute on integer data types (2+2=4), but not on text data types ("two"+"two"=. ). Concatenation is an operation we can compute on text. To concatenate means to put together, so: concatenate(two, two) = twotwo . For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types—like images, audio, and video—to more advanced courses. Data scientists use the various data types from mathematics, statistics, and computer science to communicate with each other.
We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.
Data scientists understand that the kind of mathematical operations they may perform depends on the data types reflected in their data.
We will introduce just the most commonly used data types in statistics, as defined in the Wikipedia. There are a few more data types in statistics, but we'll save those for more advanced courses.
Data scientists know that the kind of statistical analysis they will perform is determined by the kinds of data types they will be analyzing.
We will introduce just the most commonly used data types in Computer Science, as defined in the Wikipedia. There are many more, but we'll save those for more advanced courses.
Key | Value |
---|---|
1 | Get haircut |
2 | Buy groceries |
3 | Take shower |
Lists can become and often do become very complex. The keys do not have to be numeric, but could be words, such as "one," "two," and "three." The values do not have to be a single data point. The value could be a series of numbers, or a matrix of numbers, or a paragraph. For example the first key in a list could be "Romeo and Juliet," and the first value in the list could be the entire play of Romeo and Juliet. The second key in the list could be "Macbeth," and the second value in the list could be the entire play of Macbeth. Finally, a value in a list could even be another list. At this point do not go down the rabbit hole of "a list within a list within a list . . ." We will leave that to graduate students in computer science.
Data scientists understand the importance of how data is represented in computer science, because it affects the results they are generating. This is especially true when small rounding errors accumulate over a large number of iterations.
There are at least 24 data types in the R language. [ 2 ] We will just introduce you to the 9 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The nine are:
You can get R to tell you what type a particular data object is by using the typeof() command. If you want to know what a particular data object was called in the original definition of the S language [ 3 ] you can use the mode() command. If you want to know what object class a particular data object is in the C programming language that was used to write R, you can use the class() command. For the purposes of this book, we will mostly use the typeof() command.
Data scientists must know exactly how their data are being represented in the analysis package, so they can apply the correct mathematical operations and statistical analysis.
Let us start by noting the opposite of a variable is a constant. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a variable, that means Y can have more than one value (see the Wiktionary entry for "variable"). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.
Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and the top of the second column we put the label "age." We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable." The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A variable is a symbol that represents multiple data points which we also call values. Other words that have approximately the same meaning as "value" are measurement and observation. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.
The word "variable" is a general purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is vector. In computer science, another word that approximates the meaning of the term "variable" is array. In statistics, another word that approximates the meaning of the term "variable" is distribution. Data scientists will often use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.
Let us think again of the term data set (defined above). A data set is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set, and stored in a file on a disk, it is ready to be analyzed.
The R programming language is a little quirky when it comes to data types, variables, and data sets. In R we sometimes use the term "vector" instead of "variable." When we combine and store multiple vectors (variables) into a data set in R, we call it a data frame. When R stores vectors into a data frame, it assigns a role to indicate how the data will be used in subsequent statistical analyses. So in R data frames, for example, the "logical," "date/time," and "character" data types are assigned the role of Factor. The "double" data type are assigned the role of num and "integers" are assigned the role of int. (The "complex" data type is assigned the role of "cplx," but don't worry about that now.) These roles correspond to the statistical data types as follows: Factor = nominal, int = ordinal, and num = interval. (We usually transform the ratio data type into an interval data type before doing statistical analysis. This is normally done by taking the logarithm of the ratio data. More on this in later chapters.) We can discover the roles each variable will play within a data frame by using the structure command in R: str() . We will explain what "factors" are in latter chapters.
This assignment should be done in a group of 3 or 4 students. The groups need to be composed of different people from the previous two homework groups. All should interact with the R programming language. The group can help each other both learn the concepts and figure out how to make R work. Practice with R by trying out different ways of using the commands that are described below.
Use the typeof() command to verify data types. See if you can guess what the output will look like before you press the enter key.
> a as.integer(1) > typeof(a) > a > b as.double(1) > typeof(b) > b > d as.character(1) > typeof(d) > d > e as.logical("true") > typeof(e) > e > f as.complex(-25) > typeof(f) > f > g as.null(0) > typeof(g) > g > h as.Date("2012-07-04") > typeof(h) > class(h) > h > i as.POSIXct("2012/07/04 10:15:59") > typeof(i) > class(i) > i > j as.POSIXlt("2012/07/04 10:15:59") > typeof(j) > class(j) > j > k list("Get haircut", "Buy Groceries", "Take shower") > typeof(k) > k
If you don't specifically specify a data type through the as.* commands, R tries to figure out what data type you intended. It does not always guess your mind correctly. Play around with R, assigning some values to some variables and then use the typeof() command to see the automatic assignments of data types that R made for you. Then see if you can convert a value from one data type to another.
The R language is based on an object-oriented programming language. Thus, things in R are called objects. So, when we assign a value to the letter "X," in R we would say we have assigned a value to the object "X." Objects in R may have different properties from each other, depending on how they are used. For this exercise, we will concern ourselves with objects that behave like variables. Those types of objects are called vector objects. So, when we talk—in the language of data science—about the variable "X," in R we could call it the vector "X." As you remember, a variable is something that varies. Let's create a character vector in R and assign it three values. We will use the concatenate c() command in R. Let's also create an integer vector using the same concatenate command.
> name c("Maria", "Fred", "Sakura") > typeof(name) > name > age as.integer(c(24,19,21)) > typeof(age) > age
Both vectors now have three values each. The character string "Maria" is in the first position of the vector "name," "Fred" is in the second position, and "Sakura" is in the third position. Similarly, the integer 24 is in the first position of the vector "age," 19 is in the second position, and 21 is in the third position. Let's examine each of these individually.
> name[1] > name[2] > name[3] > age[1] > age[2] > age[3]
The number with in the brackets is called the index or the subscript.
If we had observed the actual names and ages of three people so that name[1] corresponded to age[1] , we would have a data set that looks like the following.
Name | Age |
---|---|
Maria | 24 |
Fred | 19 |
Sakura | 21 |
Let us put our data set into an R data frame object. We need to think of a name for our data frame object. Let's call it "project." After we put our data set into the data frame, we will inspect it using R's "typeof," "class," "ls," and "structure" commands, str() . Remember, upper and lower cases are meaningful.
> project data.frame(name, age) > typeof(project) > class(project) > ls(project) > str(project)
The typeof() function told us we had created a list object. The class() function told us it is a special type of list object known as a data.frame. The ls() function tells us what "key-value" pairs exist inside our list object. Please don't worry too much about all of that detail right now. What is important is what the str() function tells us.
The structure command tells us we have three observations and two variables. That is great. It tells us the names of the variables are $name and $age . This tells us that when we put a data set into an R data frame list object, we need to reference the variable WITHIN the data frame as follows: project$name and project$age . The structure command also tells us that project$name was assigned a the role of a "Factor" variable and that project$age was assigned the role of "int." These correspond to the "nominal" and "ordinal" data types that statistitians use. R needs to know the role variables play in order to perform the correct statistical functions on the data. One might argue that the age variable is more like the statistical interval data type than the statistical ordinal data type. We would then have to change the R data type from integer to double. This will change its role to "number" within the data frame.
Rather than change the data type of project$age , it is a good practice to create a new variable, so the original is not lost. We will call the new variable project$age.n , so we can tell that is the transformed project$age variable.
> project$age.n as.double(project$age) > str(project)
We can now see that project$age and the project$age.n variables play different roles in the data frame, one as "int" and one as "num." Now, confirm that the complete data set has been properly implemented in R by displaying the data frame object.
> project name age age.n 1 Maria 24 24 2 Fred 19 19 3 Sakura 21 21
Now let's double check the data types.
> typeof(project$name) > typeof(project$age) > typeof(project$age.n)
Whoops! We see some of the quirkiness of R. When we created the variable "name," it had a data type of "character." When we put it into a data frame not only did R assign it the role of a "Factor" but it also changed its data type to "integer." What is going on here? This is more than you want to know right now. We will explain it now, but you really don't have to understand it until later.
> project$name.c as.character(project$name) > typeof(project$name.c) > str(project) 'data.frame': 3 obs. of 4 variables: $ name : Factor w/ 3 levels "Fred","Maria",..: 2 1 3 $ age : int 24 19 21 $ age.n : num 24 19 21 $ name.c: chr "Maria" "Fred" "Sakura"
We can now see that project$name.c has a data type of character, and has been assigned a data frame role of "chr."
Under the following conditions: