Stata Tidbits

These tidbits contain bits and pieces of information I hope you find helpful to use Stata more effectively. You can receive notifications of new tidbits as they are added (via email) by clicking on the subscribe box at the left. (Every email has an unsubscribe link, making it a snap to unsubscribe.)
« Cloning variables | Memory allocation for Stata 11.1 under Windows/XP »
Tuesday
Jun152010

Floats and Doubles

This week we will look at two commonly used data types for storing numeric data, float and double. These are the two data types that are used for storing non-whole numbers (i.e., numbers that can have a value after the decimal point). Numbers stored as a float have about 7 digits of accuracy. This means that a number like 1234567 or 1.234567 could be stored accurately. Note the placement of the decimal is not important. A number like 12345678 or 1.2345678, if stored as a float would have a loss of precision. By contrast, numbers stored using the double type has about 16 digits of accuracy. Let's explore these two data types using the gasctrysmall data file.

. use gasctrysmall

Using the describe command we can see that the variables gas and infl are stored as type double and the variables ctry and year are stored as type float.

. describe Contains data from gasctrysmall.dta obs: 8 vars: 4 26 Jan 2010 14:28 size: 256 (99.9% of memory free) ---------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------- ctry float %9.0g Country ID year float %9.0g year gas double %9.0g Gas price infl double %9.0g Inflation factor ---------------------------------------------------------------------------------------------------- Sorted by: ctry

This is such a small dataset, we can list out the entire file, as shown below.

. list +--------------------------+ | ctry year gas infl | |--------------------------| 1. | 1 1974 .78 1.32 | 2. | 1 1975 .83 1.4 | 3. | 2 1971 .69 1.15 | 4. | 2 1972 .77 1.2 | 5. | 2 1973 .89 1.29 | |--------------------------| 6. | 3 1974 .42 1.14 | 7. | 4 1974 .82 1.12 | 8. | 4 1975 .94 1.18 | +--------------------------+

Suppose that we use the generate command to make a copy of the variable infl, calling it infl2. Note that the original variable is of type double, but since the default data type is float, the copy is created as a float.

. generate infl2 = infl . describe infl infl2 storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------- infl double %9.0g Inflation factor infl2 float %9.0g

Let's list these variables out side by side.

. list ctry infl infl2 +---------------------+ | ctry infl infl2 | |---------------------| 1. | 1 1.32 1.32 | 2. | 1 1.4 1.4 | 3. | 2 1.15 1.15 | 4. | 2 1.2 1.2 | 5. | 2 1.29 1.29 | |---------------------| 6. | 3 1.14 1.14 | 7. | 4 1.12 1.12 | 8. | 4 1.18 1.18 | +---------------------+

These look identical. Let's list out the cases where these two variables are equal to each other.

. list if infl == infl2

Oh dear! This is kind of perplexing. I would have expected all of the observations to have been displayed, but actually none are displayed. Let's try displaying the variables infl and infl2 but displaying many more digits after the decimal point.

. format infl infl2 %25.20f . list ctry infl infl2 +--------------------------------------------------------+ | ctry infl infl2 | |--------------------------------------------------------| 1. | 1 1.32000000000000010000 1.32000005245208740000 | 2. | 1 1.39999999999999990000 1.39999997615814210000 | 3. | 2 1.14999999999999990000 1.14999997615814210000 | 4. | 2 1.20000000000000000000 1.20000004768371580000 | 5. | 2 1.29000000000000000000 1.28999996185302730000 | |--------------------------------------------------------| 6. | 3 1.13999999999999990000 1.13999998569488530000 | 7. | 4 1.12000000000000010000 1.12000000476837160000 | 8. | 4 1.17999999999999990000 1.17999994754791260000 | +--------------------------------------------------------+

Now we see why no observations were displayed. When we look out with 20 digits after the decimal place, we can see that these numbers are not exactly the same. This is the nature of storing fractional numbers using computers. Such fractional numbers are rarely stored with perfect precision. There is usually a little bit of slop, but it is so tiny that it is not a problem. Consider the display command below that shows the value of 1/10. There is a tiny amount of imprecision.

. display %25.20f 1/10 0.10000000000000001000

The above number is shown using double precision. But consider the amount of imprecision if the number is displayed as a float using the float() function.

. display %25.20f float(1/10) 0.10000000149011612000

This imprecision in the double or float values is not a problem, except if we try and compare to two values. Then, the values are not the same. When you compare 0.1 (with double precision) to 0.1 (with float precision), the two values are not the same. What if we compare the value of a variable to a specific number? For example, let's list out the observations where infl is equal to 1.12

. list ctry infl if infl == 1.12 +-------------------------------+ | ctry infl | |-------------------------------| 7. | 4 1.12000000000000010000 | +-------------------------------+

Now let's try the same comparison for infl2.

. list ctry infl infl2 if infl2 == 1.12

When we type a number (like 1.12 above), this is represented as a double precision value. So, this compares the value 1.12 (stored as double) with the value of 1.12 (stored as float), and none of the observations meet this condition. Instead, let's make this comparison by asking for 1.12 to be represented using float precision, by specifying float(1.12).

. list ctry infl infl2 if infl2 == float(1.12) +--------------------------------------------------------+ | ctry infl infl2 | |--------------------------------------------------------| 7. | 4 1.12000000000000010000 1.12000000476837160000 | +--------------------------------------------------------+

Now, the variable infl2 is represented with float precision, and 1.12 is represented with float precision, and the comparison successfully finds the equal value. Likewise, we can compare infl2 to float(infl) and we see that all of these are equal.

. list ctry infl infl2 if infl2 == float(infl) +--------------------------------------------------------+ | ctry infl infl2 | |--------------------------------------------------------| 1. | 1 1.32000000000000010000 1.32000005245208740000 | 2. | 1 1.39999999999999990000 1.39999997615814210000 | 3. | 2 1.14999999999999990000 1.14999997615814210000 | 4. | 2 1.20000000000000000000 1.20000004768371580000 | 5. | 2 1.29000000000000000000 1.28999996185302730000 | |--------------------------------------------------------| 6. | 3 1.13999999999999990000 1.13999998569488530000 | 7. | 4 1.12000000000000010000 1.12000000476837160000 | 8. | 4 1.17999999999999990000 1.17999994754791260000 | +--------------------------------------------------------+

Rather than demoting the precision of the created variable (infl2), we could have specified that we wanted the copy of infl to be stored as a double. Below we create infl3 that is stored with double precision.

. generate double infl3 = infl

Now, let's show the observations where infl is equal to infl3.

. format %25.20f infl3 . list ctry infl infl3 if infl3 == infl +--------------------------------------------------------+ | ctry infl infl3 | |--------------------------------------------------------| 1. | 1 1.32000000000000010000 1.32000000000000010000 | 2. | 1 1.39999999999999990000 1.39999999999999990000 | 3. | 2 1.14999999999999990000 1.14999999999999990000 | 4. | 2 1.20000000000000000000 1.20000000000000000000 | 5. | 2 1.29000000000000000000 1.29000000000000000000 | |--------------------------------------------------------| 6. | 3 1.13999999999999990000 1.13999999999999990000 | 7. | 4 1.12000000000000010000 1.12000000000000010000 | 8. | 4 1.17999999999999990000 1.17999999999999990000 | +--------------------------------------------------------+

Or, we could specify that we want variables to be created using double precision using the set type double command.

. set type double

After issuing this command, subsequent variables that are created (for the duration of this Stata session) would be created using type double. If we wanted to adopt this as a permanent setting, we could add the permanently option.

. set type double, permanently (set type preference recorded)

You can return to the default setting with the following command.

. set type float, permanently (set type preference recorded)

You might be rightly concerned that storing all variables using double precision might be overkill. Does a dummy (0/1) variable need to be stored with 16 digits of precision? Does birth year (a whole number) need to be stored with 16 digits of precision? The answer, of course, is no. But, the compress command is very handy for taking variables and storing them with the smallest storage type that will not lead to any loss of information. Let's apply this to the current dataset.

. compress ctry was float now byte year was float now int

The variables ctry was stored as a float, but now is a byte and the variable year was a float and now is an int. Using set double on, permanently, I think, is a great way to store variables with the highest level of precision possible. Then, later, you can use the compress command to identify and convert variables to a more frugal method of storage. For more details, you can see help data types and help set type. You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify net get stowdata, replace to overwrite the existing files.

net from http://www.MichaelNormanMitchell.com/storage/stowdata net get stowdata
If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at http://www.michaelnormanmitchell.com/ .

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (3)

Readers may also want to try the statement -clonevar infl3=infl- and see what -list if infl == infl3- returns.

June 15, 2010 | Unregistered CommenterMartin Weiss

Other useful sources on this issue are:

William Gould (2006) "Mata matters: precision" The Stata Journal, 6(4):550-560.
http://www.stata-journal.com/article.html?article=pr0025

Nicholas J. Cox (2006) "Stata tip 33: Sweet sixteen: Hexadecimal formats and precision problems" The Stata Journal, 6(2): 282-283.
http://www.stata-journal.com/article.html?article=dm0022

Jean Marie Linhart (2008) "Mata matters: Overflow, underflow and the IEEE floating-point format" The Stata Journal, 8(2): 255-268.
http://www.stata-journal.com/article.html?article=pr0038

http://www.stata.com/support/faqs/data/prec.html

http://www.stata.com/support/faqs/data/float.html

http://www.ats.ucla.edu/stat/stata/faq/longid.htm

June 15, 2010 | Unregistered CommenterMaarten Buis

Dear Martin - Indeed, that is a good suggestion, and foreshadows the tidbit for next week :) .

Dear Maarten - Thank you so much for the great list of additional resources on the issues of precision and doubles vs. floats. I would encourage readers to check out these great links.

Thanks!

Michael Mitchell

June 15, 2010 | Unregistered CommenterMichael
Editor Permission Required
You must have editing permission for this entry in order to post comments.