Data Analysis #1: Data Collection and Sampling

May 19, 2010 10:47

Introduction was pleasant, presenter is funny and well spoken, my impression is that he knows his stuff and is looking to make it easy for us to learn it. He was very clear about his expectations relating to attendance, assignments, and the final exam and gave us a lot of ways to contact him. He loses points1 for using 'gestapo' to describe the exam monitors, printing half a tree of handouts, not using the smart-board (prefers white-boards and has a vast pen collection) and for being relentlessly blokey.

The text is Statistics for Managers Using Microsoft Excel and you need an Excel plugin called PHStat which I will muck around with this evening. I look forward to dumping a really big dataset in it and clicking 'Go.'

Why do Managers need to know about Sadistics Statistics?
  • Presenting information
  • Drawing conclusions about information
  • Forecasting information
  • Using information to improve things (like picking out problem areas of productivity that are outside normal deviation and resolving them)
Yes, you're surprised too, aren't you :)

Terminology:
  • A population (universe) is the collection of units under consideration. eg Entire Australian population
  • A sample is a portion of the population selected for analysis. eg MBA students in Australia
  • A parameter is a summary measure computed to describe a characteristic of the sample. eg: Female MBA students in Australia
  • A statistic is a summary measure computed to describe a characteristic of the sample eg: How many Female MBA students in Australia graduate within 3 years
  • A dataset can be both a population and a sample.
Thankfully these terms make sense to me so I'm not going to be frantically trying to translate them from 'natural language' to 'statistics' in my head all the time.

Data sources:
  • Primary
    • Observation - look at it
    • Experimentation - poke it
    • Survey - ask it questions
  • Secondary
    • Print - read it
    • Electronic - read it some more but with less photocopying
Type of Variables:
  • Categorical (qualitative)
    • Nominal (has no logical ranking eg: eye colour)
    • Ordinal (ranked eg: likert scale)
  • Numerical (quantitative)
    • Interval (continuous measure eg: temperature)
    • Ratio (discrete count eg: number of people in a room)
Pause for class exercise to practice identifying variables - potential variables to investigate to be able to predict median house prices in a suburb.

Ways to slice data:
  • Time-Series Data: values recorded in a meaningful sequence such as days, quarters or years.
    • Y (forecast variable) = T x S x C x I
    • T = trend (is the response to some facebook photos linked to age?)
    • S = seasonal (is more fanfiction written on hiatus?)
    • C = cyclical (is porn writing in fandom cycilcal?)
    • I = irregular / random (acts of god - or terrorism)
  • Cross-Sectional Data: data has no meaningful sequence such as sales figures for multiple companies
Discussion of project assignment (2,000 words) We can pick any data set or create it ourselves and can work in pairs or solo. Examples of previous assignment topics include: House buying, bank queue waiting times, wine sales, first week of takings for Johnny Depp movies, Eurovision rankings and bar waiting times - I'm picking out the more amusing ones here.

Sampling Methods:
  • Probability Samples
    • Simple Random (lotto!) - simple to use but may not be a good representation
    • Systematic (grab every 5th person)
    • Stratified (select representatives based on some significant quality of the population eg: gender, nationality, location) - may be time consuming and costly
    • Cluster (select clusters based on their representing larger population) - may be more cost-effective, but less efficient
  • Non-Probability Samples
    • Judgment
    • Quota (must ask 1,000 people!)
    • Chunk
Stuff to review for this week: 1.2, 1.3, 2.1, 7.1 & 7.2, do questions 2.6 and 2.12 pp47 & 7.12 pp 259
Stuff to read for next week: 2.4 - 2.7 & 3.1 - 3.4

Then we ran away.

1Points will be added/subtracted continuously, current score is +3. The purpose of my points systems is to measure first impressions versus final impressions, stay tuned for the 12 week update :p.
This entry was originally posted at http://samvara.dreamwidth.org/472695.html, where there are
comments.

sotbi:data analysis, sent off to be improved

Previous post Next post
Up