6. Time Partitioning
• _PARTITIONTIME というのがあるらしいが、よく知らない…
• https://cloud.google.com/bigquery/docs/partitioned-tables
• 今の旧来のテーブルならば
という形で取れます
SELECT ...
FROM `my_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20160720' AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
①
②
6
9. WITH構⽂
SELECT ...
FROM (
SELECT ...
FROM (
SELECT ...
FROM original_data
) as A
) as B
WITH A as (
SELECT ...
FROM original_data
)
, B as (
SELECT ...
FROM A
)
SELECT ...
FROM B
■ 従来記法 ■ WITH記法
⼊れ⼦地獄から解放され、上から下に処理を書けるようになった 9
10. Analytic Function
• 以前の Window Function (今回呼び名変わった?)
• ※新しい機能ではないが紹介
• 普通のSQL(例えばMySQLの)ではできない、レコード間の演算
ができる
10
11. 11
user action time
1 A 1
1 B 1
1 X 5
1 B 8
1 Z 9
2 A 2
2 B 5
2 D 10
例えば、ユーザ毎にアクセス順序でのSequence IDを付けたい
user action time seq
1 A 1 1
1 B 1 2
1 X 5 3
1 B 8 4
1 Z 9 5
2 A 2 1
2 B 5 2
2 D 10 3
SELECT *, RANK() OVER (PARTITION BY user ORDER BY time) as seq
FROM data
ORDER BY user, seq
RANK():
1から順に連番をふる
12. 12
user action time
1 A 1
1 B 1
1 X 5
1 B 8
1 Z 9
2 A 2
2 B 5
2 D 10
例えば、ユーザ毎に前回アクセスからの経過時刻をとりたい
WITH add_last_time as (
SELECT *,
LAG(time) OVER (PARTITION BY user ORDER BY time) as last_time
FROM data
)
, add_span as (
SELECT *,
time - last_time as span
FROM add_last_time
)
select * from add_span ORDER BY user, time
user action time last_time span
1 A 1 null null
1 B 1 1 0
1 X 5 1 4
1 B 8 5 3
1 Z 9 8 1
2 A 2 null null
2 B 5 2 3
2 D 10 5 5
LAG(X):
⼀つ前のXの値を取得する
13. 13
user action time
1 A 1
1 B 1
1 X 5
1 B 8
1 Z 9
2 A 2
2 B 5
2 D 10
例えば、ユーザ毎に最初の2つのデータを残したい
WITH add_seq as (
SELECT *, RANK() OVER (PARTITION BY user ORDER BY time) as seq
FROM data
)
, omit_records as (
SELECT * FROM add_seq WHERE seq <= 2
)
SELECT * FROM omit_records
ORDER BY user, seq
user action time seq
1 A 1 1
1 B 1 2
2 A 2 1
2 B 5 2
15. 15
例えば、Actionを⾏ったユーザをAction毎にまとめる(ユーザの重複OKの場合)
user action time
1 A 1
1 B 3
1 X 5
1 B 20
1 Z 22
2 A 2
2 B 30
2 X 60
3 X 4
4 B 2
4 A 4
Row action users
1 A 1
2
4
2 B 1
1
2
4
3 X 1
2
3
4 Z 1
重複しているよ?
WITH action_users as (
SELECT action, ARRAY_AGG(user) as users
FROM data
GROUP by action
)
SELECT * FROM action_users
ORDER BY action
ARRAY_AGG():
GROUP BYと共に使い、
値をARRAY型の1レコードにまとめる
16. 16
例えば、Actionを⾏ったユーザをAction毎にまとめる(ユニークユーザの場合)
user action time
1 A 1
1 B 3
1 X 5
1 B 20
1 Z 22
2 A 2
2 B 30
2 X 60
3 X 4
4 B 2
4 A 4
WITH action_users as (
SELECT action, ARRAY_AGG(user) as users
FROM data
GROUP by action
)
, unique_action_users as (
SELECT action,
ARRAY(
SELECT DISTINCT user FROM UNNEST(users) as user
) as users
FROM action_users
)
SELECT * FROM unique_action_users ORDER BY action
Row action users
1 A 1
2
4
2 B 1
2
4
3 X 1
2
3
4 Z 1
ARRAY():
Sub Queryが単⼀列の複数(0~N)レコード
を返す場合に使う.
UNNEST():
ARRAY型を展開する。
JOIN的になる場合と
グループ単位で処理したい場合で
少し意味合いが違う感じ.
Row action users
1 A 1
2
4
2 B 1
1
2
4
3 X 1
2
3
4 Z 1
17. ■Group By する必要がある場合: ARRAY_AGG(STRUCT(...)) ... GROUP BY
SELECT ...,
ARRAY_AGG(STRUCT(col1, col2, ...))
FROM ...
GROUP BY X
複数列をArray型に⼊れる⽅法
■Group By する必要がない場合 or 何か計算処理する場合: ARRAY(SELECT STRUCT(...))
SELECT ...,
ARRAY(
SELECT STRUCT(col1 * 2, SUM(col2) OVER ... , ...) FROM ...
)
FROM ... STRUCT():
レコード型を作る。
レコード型という単⼀列になる。
19. 19
間隔ベースのセッションIDを付ける
user action time
1 A 1
1 B 3
1 X 5
1 B 20
1 Z 22
2 A 2
2 B 30
2 X 35
やりたいこと: ユーザ毎にアクセス間隔が10以上なら別のSessionIDを付けるようにしたい
1
2
1
2
user action time last_time new_session session_seq
1 A 1 null 0 1
1 B 3 1 0 1
1 X 5 3 0 1
1 B 20 5 1 2
1 Z 22 20 0 2
2 A 2 null 0 1
2 B 30 2 1 2
2 X 35 30 0 2
WITH add_last_time as (
SELECT *,
LAG(time) OVER (PARTITION BY user ORDER BY time) as last_time
FROM data
)
, add_span as (
SELECT *, IF(time - last_time >= 10, 1, 0) as new_session
FROM add_last_time
)
SELECT *, 1+SUM(new_session) OVER (PARTITION BY user ORDER BY time) as session_seq
FROM add_span ORDER BY user, time
20. user session_seq session_start_time session_end_time session_time_span actions.action actions.time
1 1 1 5 4 A 1
B 3
X 5
1 2 20 22 2 B 20
Z 22
2 1 2 2 0 A 2
2 2 30 35 5 B 30
20
ユーザやセッションの属性とイベントデータを⼀つのテーブルにする: Step1
user action time
1 A 1
1 B 3
1 X 5
1 B 20
1 Z 22
2 A 2
2 B 30
2 X 35
やりたいこと:ユーザやセッションの属性とイベントデータを⼀つのテーブルにする
WITH ... 略 ...
, add_session_seq as (
SELECT *, 1+SUM(new_session) OVER (PARTITION BY user ORDER BY time) as session_seq
FROM add_span
)
SELECT user, session_seq,
MIN(time) as session_start_time,
MAX(time) as session_end_time,
MAX(time) - MIN(time) as session_time_span,
ARRAY_AGG(STRUCT(action, time)) as actions
FROM add_session_seq
GROUP BY user, session_seq
21. user session_seq ... actions.action actions.time
1 1 A 1
B 3
X 5
1 2 B 20
Z 22
2 1 A 2
2 2 B 30
21
ユーザやセッションの属性とイベントデータを⼀つのテーブルにする: Step2
やりたいこと:ユーザやセッションの属性とイベントデータを⼀つのテーブルにする
WITH ... 略 ...
, group_by_user as (
SELECT user,
MAX(session_seq) as session_num,
ARRAY_AGG(STRUCT(
session_start_time, session_end_time, session_time_span, actions
)) as sessions
FROM group_by_session GROUP BY user
)
SELECT * FROM group_by_user
ORDER BY user
user session_num sessions.session_start_time sessions.session_end_time sessions.session_time_span sessions.actions.action sessions.actions.time
1 2 1 5 4 A 1
B 3
X 5
20 22 2 B 20
Z 22
2 2 2 2 0 A 2
30 35 5 B 30
X 35
22. 22
ユーザ毎に, あるAction Xが発⽣した後、時刻T以内にAction Zが発⽣する割合の集計
やりたいこと: Xの後に時刻10以内にZが発⽣する割合を求める
user action time
1 A 1
1 X 5
1 B 9
1 Z 11
2 A 2
2 X 15
2 Z 40
3 Z 3
3 X 6
4 X 4
4 X 5
4 Z 7
1
0 (時刻10以内でない)
0 (ZがXの後で発⽣していない)
1 (最初のXに対してのみ計算するとする)
X=4, Z=2 → 50% とカウントしたい
23. 23
ユーザ毎に, あるAction Xが発⽣した後、時刻T以内にAction Zが発⽣する割合の集計
user action time
1 A 1
1 X 5
1 B 9
1 Z 11
2 A 2
2 X 15
2 Z 40
3 Z 3
3 X 6
4 X 4
4 X 5
4 Z 7
user first_x_time action_times.action action_times.time
1 5 X 5
Z 11
2 15 X 15
Z 40
3 6 Z 3
X 6
4 4 X 4
X 5
Z 7
WITH group_by_user as (
SELECT user, ARRAY_AGG(STRUCT(action, time)) as action_times
FROM data WHERE action in ('X', 'Z')
GROUP BY user
),
pickup_first_x_timeas (
SELECT user, action_times,
(
SELECT MIN(time)
FROM UNNEST(action_times) WHERE action='Xʼ
) as first_x_time
FROM group_by_user
)
SELECT * FROM pickup_first_x_time ORDER BY user
24. 24
ユーザ毎に, あるAction Xが発⽣した後、時刻T以内にAction Zが発⽣する割合の集計
user first_x_time action_times.action action_times.time
1 5 X 5
Z 11
2 15 X 15
Z 40
3 6 Z 3
X 6
4 4 X 4
X 5
Z 7
WITH ...
, count_z as (
SELECT user, first_x_time,
(
SELECT COUNT(*)
FROM UNNEST(action_times) as a
WHERE a.action = 'Zʼ
AND a.time BETWEEN first_x_time AND first_x_time + 10
) as z_count,
action_times
FROM pickup_first_x_time
)
SELECT * FROM count_z ORDER BY user
user first_x_time z_count action_times.action action_times.time
1 5 1 X 5
Z 11
2 15 0 X 15
Z 40
3 6 0 Z 3
X 6
4 4 1 X 4
X 5
Z 7
25. 25
WITH ...
SELECT
COUNT(*) as x_cnt,
COUNTIF(z_count > 0) as z_cnt
FROM count_z
user first_x_time z_count action_times.action action_times.time
1 5 1 X 5
Z 11
2 15 0 X 15
Z 40
3 6 0 Z 3
X 6
4 4 1 X 4
X 5
Z 7
ユーザ毎に, あるAction Xが発⽣した後、時刻T以内にAction Zが発⽣する割合の集計
x_cnt z_cnt
4 2