背景:
这个问题可能有点宽泛,但希望任何使用关系数据、R、Power BI 或以上所有内容的人都会感兴趣。
我试图重新创建数据集关系模型nycflights13
书中所描述的R用科学的数据通过Wickham and Grolemund
。我正在尝试同时使用 R 和 Power BI。数据集由5个表airlines
,ariports
,flights
,weather
和planes
。在部分中13.2 nycflights13
有一段话指出:
航班通过起点(位置)和年、月、日和小时(时间)连接到天气。
关系如下图所示:
问题 1:如何在 Power BI 中设置此模型?
使用以下 R 脚本将使文件夹中的数据集可用于 Power BI c:/data
:
# install.packages("tidyverse")
# install.packages("nycflights13")
library(tidyverse)
library(nycflights13)
setwd("C:/data/")
#getwd()
airlines
df_airlines <- data.frame(airlines)
df_airports <- data.frame(airports)
df_planes <- data.frame(planes)
df_weather <- data.frame(weather)
df_flights <- data.frame(flights)
write.csv(df_airlines, file = "C:/data/airlines.txt", row.names = FALSE)
write.csv(df_airports, file = "C:/data/airports.txt", row.names = FALSE)
write.csv(df_planes, file = "C:/data/planes.txt", row.names = FALSE)
write.csv(df_weather, file = "C:/data/weather.txt", row.names = FALSE)
write.csv(df_flights, file = "C:/data/flights.txt", row.names = FALSE)
在 Power BI 中导入表后,我正在尝试建立以下关系Relationships tab
:
而我能够这样做在一定程度上,但是当我尝试连接flights
到weather
使用例如year
,我收到以下错误信息:
您无法在这两个列之间创建关系,因为其中一列必须具有唯一值。
而且我知道发生这种情况是因为主键必须包含唯一值并且不能包含空值。但是,如何在 Power BI 中建立由多个字段组成的主键?
Question 2: If there's no answer to question 1, how can you do this in R instead?
I really love this book, and It may even be described there already, but how do you establish a relationship like this in R? Or perhaps you don't need to since you can join on multiple columns or composite key using dplyr without there being 'established' a relatinship at all?
Put another way, are the relationship illustrated by the figure aboe with the arrows:
and in Power BI with the lines:
really not necessary in R as long as you have the required verbs and there actually does exist a relatinship between the data in the different tables?
Question 3 - Why is flight highlighted in the flights table:
I thought that a highlighted column name indicated that there had been established a connection between tables using that column. But as far as I can tell, that is not the case here, and there is no arrow pointing to it:
Does it perhaps indicate that it is a primary key in the flights
table without any connection to another table?
I know this is a bit broad, but I'm really curious about these things so I'm hoping some of you will find it interesting!
I can comment on Power BI part.
The key issue here is that Power BI requires Dimensional Model, not relational one. There is a huge difference.
As described, the model from the book is not suitable for BI tools, it must be redesigned. For example, table "Weather" in the book is presented as a "dimension", while in reality it must be a fact table (similar to table "Flights"). As a result, "Flights" and "Weather" should never have direct connections - they must share common dimensions, such as:
Similarly, multiple keys and multiple connections between tables are very rare exceptions and are frowned upon (usually, they are indications of design mistakes). In a properly designed model, you should never see them.
If you want to understand the issue more, read this book: Star Schema Complete Reference
为了具体回答您的 Q3,在维度建模中,“航班”(我假设它是航班号)被称为“退化维度”。通常,它本来是维度表的键,但如果它不存在,它将作为孤立键保留在事实表中。这种情况常见于订单号、发票号等。
总体而言,您走在正确的轨道上 - 如果您弄清楚如何将书中的模型转换为合适的星型模式,然后在 R 和 PowerBI 中使用它,您会对新功能印象深刻 - 这是值得的。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句