Hey! Here’s a tutorial video for you R enthusiasts out there! Eh, well, I assume that the data scientist use the tools needed for the job needed to be done?! Well, in this very nice R tutorial you will learn how to carry out negative binomial regression using R statistical programming environment. Enjoy!
Recently, I have asked on Twitter if there are any good sources for free and open data to use to learn Python (and R):
— freddy (@freddy1876) April 30, 2016
In this post I will list the suggestions I have got so far.
- Awesome Public Datasets: A huge collection of public datasets. Categorized by field (e.g., biology, economics, machine learning, etc).
- UCI Machine learning Repository: ”…currently maintain 349 data sets as a service to the machine learning community”
- https://www.kaggle.com/datasets: Also a list of publicly available datasets.
- Goverment data: govermental data. Everything from agriculture to science & research. Very interesting.
- Google Public Data: Huge collection of different data sources that are public. Seems really nice.
- Amazon public data sets: ”AWS hosts a variety of public data sets that anyone can access for free.” Seems interesting.
- Movielens: ”Learn more about movies with rich data, images, and trailers. Browse movies by community-applied tags, or apply your own tags. Explore the database with expressive search tools.” Movie lens is not really a data source in the way that I asked. However, the suggestion was that one could use the movie ratings to learn hadoop/spark/MapReduce. I may give this a try. If I ever get time.
This was the different data sources people on twitter replied to my tweet. I have myself found this very intersting: Open Psychology data. This is a journal that describes open and re-usable Psychology data. If you are interested in playing around with personality data it can be found here. Finally, APA have link to open data sets: Data Links.
I know, the title is wrong: I gave you a huge amount of different data sources to use. Some may contain overlapping links to data but I would assume that we now have data to play around with for quite some time. Do you know any more data sources that are open and free? Please leave a comment!
A few weeks ago, as I was doodling with some Companies House director network mapping code and simple Companies House chatbot ideas, I tweeted an example of Iron Maiden’s company structure based on co-director relationships. Depending on the original search is seeded, the maps may also includes elements of band members’ own personal holdings/interests. The […]
James (1890, pp. 403-404):
”Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration of consciousness are of its essence. It implies withdrawal from some things in order to deal effectively with others …”
A little bit more recently Shiffrin 1988, p. 739):
“Attention has been used to refer to all those aspects of human cognition that the subject can control … and to all aspects of cognition having to do with limited resources or capacity, and methods of dealing with such constraints”
And even more recently Cowan (1995) writes about selective attention’ in a sense that is close to James’ definition of attention. The selective attention is not necessarily voluntary. Also selective attention is a limited capacity process.
Major Confounding factors
Maturation – Mainly concerns longitudinal studies (and children) – as subjects grow older between pre- and posttreatment/test it may affect the results. The children, for instance, might get more sophisticated, get more experience, bigger, stronger, and so on, as the age. Natural maturation halso happen in other subjects. When in a new environment adults make predictable changes or adjustments over time. Diseases usually have predictive courses. This can lead to the fact that observed changes over time may be due to maturation rather than the independent variable.
History – During the course of a study, independent events that will affect the outcome can occur. Generally, threats to internal validity are due to history when there are long times between pre- and posttest measurements.
Testing – repeated testing of participants can threaten the internal validity, because the participants might get more skilled through repeated training on the measurement instrument.
Instrumentation – Findings can be due to changes in the measuring instrument over time rather than due to IV.
Regression to the Mean – when selecting subjects on the basis of their scores on a measure is extremely high or low they are usually not that extreme in a second testing. That is, their scores will regress to toward the mean. The amount of regression is contingent upon how much the performance of the test is due to variable factors. These variable factors can be, i.e., amount of study. More variable factors equals more regression.
Selection – These confounding factor appears when, for instance, comparing groups that are not equivalent before the manipulation begins.
Attrition – Attrition occurs when participants that drop out of the study due to some biasing factor. For instance, if participants drop out from one group but not from another (or not as much) one can lose important characteristics etc. It is important to not create situations or use procedures that can bias some participants against completing the study, and changing the outcome.
Diffusion of Treatment – If participants from that have different experimental conditions are able to talk with each other, some can expose the procedures to others. Test-participants might talk to control-participants that might not be aware that they are in a control group. These types of information exchanges are called diffusion of treatment and can affect the data such that the differences between groups disappear.
Sequence effects – experiences with one condition might affect responses to later conditions. If condition order is ABC systematic confounding can occur. For instance, performance in BC might reflect both the effect of the condition or the effect of already been exposed to A. To get rid of sequence effect one use more than one order.
Subject and Experimenter Effects
Expectations and biases of both the experimenter and the subjects can systematically affect the results of a study in subtle ways, thus reducing validity of the study.
Subject Effects – Participants in an experiment are not completely naïve. That is, they will have understandings, ideas and maybe misunderstandings about what to expect in the study. Different people have different reasons for participating. These reasons can be money, course credit, etc. Others might participate because they hope to learn something. Participants volunteer and carry out their role based on different motivations, understandings, expectations, and biases, which all can affect the outcome of a study. An experimental setting is not natural. When being observed people might behave differently than if they were not observed. This can lead to subject effects. Subject effects refer to any changes in behavior that was due being part of an experiment rather than experimental variables. Demand characteristics are when participants get cues on how they are expected to behave (according to hypotheses, etc). Demand characteristics usually occur unintentionally. Placebo effect, a related phenomenon, occurs when participants are expecting a specific effect.
Experimenter effects – concerns with any biasing effects that are due to actions of the researcher. Experimenter expectancies – the experimenter’s expectation about the outcome of the study. These expectations might cause researchers to bias results in many ways. The experimenter can influence the participant’s behavior in favor of the hypotheses, cherry picking data and statistical methods, and interpret results in a biased manner.
Example of ways experimenter can influence the participant: Presenting cues in the form of intonation, facial expressions, change in posture, verbally reinforce some responses and not others, or incorrectly record participants’ responses.
Pre-posttest with control group controls for history and maturation.
- Systematic between-groups variance
- Difference between groups could be due to
- Effect of the independent variable (experimental variance which is what we want!)
- Effects of confounding variables (extraneous variance)
- A combination of (1) and (2)
- Difference between groups could be due to
Natural variability that is due to sampling error will increase the group variability some.
- Nonsystematic Within-Groups Variance
- Error Variance – non-systematic within-groups variability.
Due to random factors affecting some participants more than other within a group rather than systematically reflecting all members of a group. Error variance can increase by factors that are not stable, such as participant feeling ill or uncomfortable participating… Experimenter and equipment variations can also cause measurement errors for some participants.
- Error Variance – non-systematic within-groups variability.
“In experimentation, each study is designed so as to maximize experimental variance, control extraneous variance, and minimize error variance.”
Maximizing experimental variance. Experiment variance is due to independent variables (IV) effect on dependent variables (DV). At least to levels of de IV should be present in an experiment. Experimental conditions need to be distinct! It can be useful to have a manipulation check to see that manipulation had the planned effect on p’s. One way to check if to use ratings.
To efficiently control for extraneous variables and minimize their possible different effects on the groups we must be sure that (1) the two groups (experimental and control) are AS similar as possible, (2) the groups are treated in exactly the same way EXCEPT for the IV manipulation.
Ways to control extraneous variance:
- Random assignment to groups decreases probability that the groups will differ – Best method
- Homogenous sample
- Confounding variables can be built into the experiment as an additional IV
- Matching or Within-subjets deisgn
Minimizing Error Variance.
Large error variance can hide differences between conditions due to the experimental manipulations. Measurement error is one error variance source. If participants does not respond consistent from trial to trial due to such factors the instrument is unreliable. To minimize sources of error variance carefully controlled conditions of measurement and have reliable instruments. Another source of error variance is individual differences. These types of variances minimized by within-subjects designs.
Experimental designs – Randomize when possible!
The four basic designs to test single IV using independent groups:
- Randomized, posttest-only, control-group design
Here we have two groups: Group A and Group B. The treatment in the groups are compared in the post-test only. This is made to test hypothesis that IV affect dependent measurements.
Random selection will protect external validity. Furthermore, attrition and regression to the mean are also reduced by random assignment of participants (i.e., both groups will have [roughly] the same amount of extremes). Threats to internal validity is from instrumentation, history, and maturation are minimized due to inclusion of control group.
- Randomized, pretest-posttest, control-group design
Improvement of R pt-only c-g design (the one above). Pretreatment/test
- Multilevel, completely randomized, between-subjects design
- Solomon’s four-group design. Pretests will affect participants’ responses to the treatment or to the posttest. Pretest can interact with the experimental manipulation which will produce confounding interaction effects.
T-test evaluates the size of the difference between the means of the two groups. The two means are divided by an error term. The error term is a function of the variance scores within each group and the sample sizes. Easy applied, common, and useful to test differences between two groups.
Analysis of Variance (ANOVA)
For multilevel designs with more than two groups. One-way ANOVAQ – only one independent variable. ANOVA uses both the within-groups variance and the between-group variance. Within-groups variance is a measure of nonsystematic variation within a group – error or chance variation among individual participants within a group. Due to factors such as individual differences and measurement errors. Between-groups variance is representing how variable group means are. Is a measurement of both systematic factors that affect the groups differently and of variation due to sampling error. The systematic factors include experimental variance and extraneous variance. Furthermore it also represents how variable the group means are. Approx. same means = small between-groups variance -> large difference in group means = between-groups variance is large.
The F-test is used to get statistical significance from an ANOVA. The F-test involves the ratio of the between-group mean square to the within-groups mean square.
F= mean square between groups/mean square within groups
The ratio can be increased by either increasing the between-groups mean square or by decreasing the within-groups mean square. Between-group mean squares increases by maximizing the differences between groups. The within-groups mean square is minimized by controlling as many potential sources of random error as possible. Maximization of experimental variance and minimization of error variance is what we want!
Rejection by the hypotheses that there are no systematic differences between groups UNLESS the F-ratio is larger than we would expect by chance alone.
UPDATE: I found an exceptional post on how to do one-way ANOVA using Python. In fact, there are 4 different Python methods for doing a Python ANOVA: One-Way ANOVA in Python.
Planned comparison is done to probe possible significance differences between the means. The F-ratio will only tell us that there IS a difference. Not in which direction or between which groups. This is done by the means of planned comparison/a priori comparison/contrast.
Very mind blowing and intriguing discussions on the mind. A must see!