|
Statistics is a type of data analysis which practice includes the planning, summarizing, and interpreting of observations of a system possibly followed by predicting or forecasting of future events based on a mathematical model of the system being observed. Statistics is a branch of applied mathematics specifically in the area of statistical theory which uses probability theory in the mathematical models. The implication of using probability theory is that statistical results can not provide definitive cause and effect relationships but can only show correlation relationships.
The basic tenet of statistics is that a population can be represented by a sample of the population when the sample is sufficiently large and when the sample is composed of a random selection of units (persons, components, or of whatever the population is composed) from the population. Statistical theory provides methods for determining how large a sample is needed to provide for statistically significant results.
There are two major types of statistics, descriptive statistics and inferential statistics. Descriptive statistics describe or summarize the observed measurements of a system. Inferential statistics are used to infer, predict, or forecast future outcomes, tendencies, and behaviors of a system.
Statistics are used in a wide variety of academic disciplines especially the various disciplines within the social sciences, biological sciences, and other areas of study involving complex systems. Statistics are used in business for statistical process control, quality control, marketing and other day to day activities. Statistics are used in sports to describe the expertise and abilities of sports participants. Statistics are used in government for a variety of purposes. The most well known type of statistical procedure performed by government is the census in which various statistics are collected about the national population.
Polling of a sample of a population is a typical use of statistics by political parties and the news media to determine popular opinion on particular topics. This process of sampling the population must use a process of random selection of respondents (people answering the questions of the poll) from the general population in order for the sample to have a chance of representing the general population from which it is selected.
Statistics is a partial foundation of the relatively new field of data mining, which uses a combination of statistics, pattern recognition, artificial intelligence, and other algorithms to find meaningful information in large sets of data.
Origin
The word statistics ultimately derives from the modern Latin term statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). The German Statistik, first introduced by Gottfried Achenwall (1749), originally designated the analysis of data about the state. It acquired the meaning of the collection and classification of data generally in the early nineteenth century. It was introduced into English by Sir John Sinclair. Thus, the original principal purpose of statistics was data to be used by governmental and (often centralized) administrative bodies. The collection of data about states and localities continues, largely through national and international statistical services; in particular, censuses provide regular information about the population. Today, however, the use of statistics has broadened far beyond the service of a state or government, to include such areas as business, natural and social sciences, and medicine, among others.
Statistical methods
The basic goal of a statistical research project is to make a conclusion on the effect of changes of an independent variable on a dependent variable. There are two major types of statistical studies, experimental studies and post facto or after the fact studies. In both of these types of studies, the effect of changes of an independent variable on the behavior of the dependent variable are observed. The difference between the two is in how the study is actually conducted.
An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation may have modified the values of the measurements. A post-facto study involves reviewing existing data and making a determination about a correlation between two measurements.
An example of an experimental study is the famous Hawthorne studies which attempted to test changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured productivity in the plant then modified the illumination in an area of the plant to see if changes in illumination would affect productivity. Due to errors in experimental procedures, specifically the lack of a control group, the researchers while unable to do what they planned were able to provide the world with the Hawthorne effect.
An example of a post-facto study is a study which explores the correlation between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then perform statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers and then look at the number of cases of lung cancer in each group.
There are four types of measurements or measurement scales used in statistics. The four types or levels of measurement (ordinal, nominal, interval, and ratio) have different degrees of usefulness in statistical research. Ratio measurement, where both a zero value and distances between different measurements are defined, provide the greatest flexibility in statistical methods that can be used for analysing the data. Interval measurement, with meaningful distances between measurements but no meaningful zero value (such as IQ measurements or temperature measurements in degrees Celsius), is also used in statistical research.
The basic steps for any statistical research involves
- plan the research including determining information sources, research subject selection, and ethical considerations for the proposed research and method,
- design the experiment concentrating on the system model and the interaction of independent and dependent variables,
- summarize a collection of observations to feature their commonality by suppressing details (descriptive statistics),
- reach consensus about what the observations tell us about the world we observe (statistical inference),
- document the results of the study.
Some well known statistical tests and procedures for research observations are:
Probability
The probability of an event is often defined as a number between one and zero. In reality however there is virtually nothing that has a probability of 1 or 0. You could say that the sun will certainly rise in the morning, but what if an extremely unlikely event destroys the sun? What if there is a nuclear war and the sky is covered in ash and smoke?
We often round the probability of such things up or down because they are so likely or unlikely to occur, that it's easier to recognize them as a probability of one or zero.
However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10−4 and a probability of 10−9, despite the very practical difference between them. If you expect to cross the road about 105 or 106 times in your life, then reducing your risk of being run over per road crossing to 10−9 will make you safe for your whole life, while a risk per road crossing of 10−4 will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.
Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior distribution is then forced to be 0 (or 1) as well. In other words, the data is not taken into account at all! As Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates never using prior probabilities of 0 or 1. He calls it Cromwell's rule, from a letter Oliver Cromwell wrote to the synod of the Church of Scotland on August 5th, 1650 in which he said "I beseech you, in the bowels of Christ, consider it possible that you are mistaken."
Important contributors to statistics
See also list of statisticians.
Specialized disciplines
Some sciences use applied statistics so extensively that they have specialized terminology. These disciplines include:
Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions. In these roles it is a key tool, and perhaps the only reliable tool.
Software
Modern statistics is supported by computers to perform some of the very large and complex calculations required.
Whole branches of statistics have been made possible by computing, for example neural networks.
The computer revolution has implications for the future of statistics, with a new emphasis on 'experimental' statistics.
Statistical packages in common use include:
See also
References
Lindley, D. Making Decisions. John Wiley. Second Edition 1985. ISBN 0471908088
External links
General sites and organizations
Link collections
Online courses and textbooks
Statistical software
Other resources
|