Solutions to Assignment 2: Data Warehousing / Data Preparation
Start date 28 January, due 4 February beginning of class.
Exercises from the Book
Complete the following exercises from the book Chapters 2, 3.
-
2.5 a): Star Schema. Some people missed that spectator was defined
to have a rate and type - attributes weren't given for others.
(1 point)
b) slice/dice on spectator.type=student and location=GM_Place, year=2000;
then Roll-up to remove game, date. (1 point)
- 2.9 a) Many good examples. Most were based on fine time granularities,
but there are others - location granularity was another. (1 point)
- 3.2: See Han Section 3.2.1. I was looking to see some level of
independent understanding. (1 point)
- 3.3 a) Bin and means: (13+15+16)/3 = 15, (16+19+20)/3 = 18, 21,
24, 27, 34, 35, 40, 56; replace each value by mean: 15, 15, 15, 18,
18, 18, 21, 21, 21, ... (0.75 point)
b) One approach would be to identify
values that differed significantly from their bin mean. One problem
is that an item may be far from its bin mean, but closer to the mean
of another bin (e.g., 46) - so distance from closest bin would make
more sense. "Significant" is a challenge, though - an age of 7 would
certainly seem an outlier in this data, but 46 doesn't seem to be -
yet both are 6 from the nearest bin. (0.25 point)
- (0.5 point each)
3.5 a) norm(x) = x-13/(70-13): 0.39
b) norm(x) = (x-avg) / sdev = (35 - 30) / 13 = 0.38
c) norm(x) = x/100 = .35
d) My preference would be decimal scaling, as it preserves concepts
of minimum and maximum age and relative distance. However, this would
fail if anyone in the data set were over 100, since all ages would now
be far from the maximum normalized value.
- 3.7 a) Several possible answers, depending on what you
chose as the base age. One is (courtesy Mike Hilligoss):
(1 point)
