你想用你的地理位置数据产生有意义的见解吗?您是否尝试以Petabyte的规模运行这些查询?加入本次演讲,了解如何利用Databricks扩展ESRI的地理空间专业知识。
鉴于2020年全球冠状病毒大流行,我们将看看如何分析运动数据,并确定在此期间人类运动的影响。在我们的讲座中,我们将展示几个关键的技术概念——利用地理索引降维,利用三角洲湖进行地理空间查询性能,以及使用人类运动指数量化人类运动带来的风险。
在本课程结束时,您将更好地了解如何深入了解大规模的人类运动,这是一个可重复的模式,在各行业都高度适用。
-好的,欢迎来到星火峰会。首先,我想欢迎大家来参加这次会议。简单介绍一下。我叫吉姆·杨。我是Esri商业部门的业务发展主管和合作伙伴主管。bob体育外网下载我住在俄勒冈州的波特兰市,今天和我在一起的是Joel McCune,他代表我们的GOAI团队,致力于将地理与人工智能相结合的解决方案。所以今天我们要和大家谈谈我们是如何结合Databricks和Esri的力量,根据人员流动建立COVID风险指数的。让我们开始吧。
我们都知道位置数据无处不在。无论你是在构建一个简单的应用程序,只是在地图上找到咖啡,还是尝试做一些更复杂的事情,比如让飞机在空中飞行,地理位置都很重要。越来越多的数据科学家,像你们一样,开始看到将地理视角应用到分析中的价值,这很好。因为地理是真正为你的数据带来上下文和理解的东西,无论是通过在地图上可视化它,还是通过建立位置数据来增强解释变量,甚至只是引入上下文数据来帮助你正确地看待数据。这都是地理因素,是一种额外的视角。
因此,尽管你不需要地图来利用地理,但地图对于理解来说是一个强大的隐喻,因为我们的大脑在进化过程中是天生的,能够从空间和物理上理解复杂的2D和3D数据集。以这种方式消费数据是很自然的。因此,无论您是试图了解天气对销售的影响,还是试图根据射频传播找出下一个发射塔的位置,甚至是如何规划物流套件,地图都可以帮助您理解和决定,正如乔尔喜欢说的,地图是原始的信息图。
-所以,当我们考虑到地理位置的问题时,特别是在大流行的背景下,比如我们目前面临的COVID危机,除非有人与人之间的接触,否则COVID是无法传播的。吉姆现在在波特兰,而我在华盛顿的奥林匹亚,除非我们面对面,否则疾病不会传播。显然,这就是社会距离有效的原因,但最终,如果我们想了解一个地区的风险有多大,无论是从去那里的角度还是从人们来自哪里的角度,我们都需要了解关于这个问题的几个因素,所以,我们一直在研究这个问题,并检查现有的社会距离指标,其中一个具有挑战性的事情是,其中许多指标已经标准化了。所以这意味着,它可以让我们了解,与纽约市相比,堪萨斯州中西部的人们在社交距离方面有多好。但最终从风险的角度来看,这是两个非常不同的地方,因为人口密度和正在发生的相互作用的数量。所以当我们想要从大流行的角度来量化风险时,我们需要考虑几件不同的事情。我们想要计算这个风险指数通过考虑体积,相互作用的数量,以及人们为这种相互作用所走的距离,因为距离越远,就越有可能连接两个本来不相关的地理位置。-所以当我们开始考虑这个风险指数时,我们的研究表明,目前的社会距离指标没有充分考虑地理因素。在很多情况下,它们是由人口标准化的,这就去掉了乔尔描述的这个体积,这就从方程中去掉了人口密度。同样,作为距离的度量,一个人走得越远,风险就越大。 So if I go to downtown Portland, that’s sort of one level of risk. If I travel to Sao Paolo, influencing that population that is distinct and unique from my population, that connectivity represents a whole higher level of risk. And most models also don’t even cover this idea of significant group clusters. So we’ve been focused on building this risk index that considers distance and volume of people moving. – So when we talk about these movement risk factors, this idea that there’s two things we want to consider.
包括事件发生的距离,以及相互作用发生的体积。我们所调查的或想要量化的是人类运动数据。
更简单地说,我们可以简单地称之为手机跟踪数据。每个人都有一部手机,你能做的就是获取追踪这些设备位置的数据。这种情况发生的方式,特别是在我们这里使用的Veraset数据的情况下,这是后台应用程序跟踪。当你安装,比方说,一个天气应用程序,它问你,你是否允许应用程序跟踪位置?这种位置跟踪就是Veraset所使用的。你可以想象,这并不能代表所有人。根据目前的情况,我们的市场占有率可以达到8%左右。这取决于你观察的位置和时间范围。所以,我们能称它为代表性样本,作为代表性样本它能传递大量信息。即使我们看到的是不到10%的市场渗透率,这是显示手机位置的个人记录,这是大量的数据。 Of the magnitude of billions of records per day. So, with this in mind, we wanted to be able to understand where people are going from and where they’re going to, we wanted to be able to put it on a map so that we can understand it, but will billions and billions of records per day, we were trying to do analysis from the beginning of March. You can imagine, hundreds of billions of records. There was no really other way to do it than using Spark in a scaled environment in which case we’re using Databricks to be able to do this. – Okay, so this is essentially the general workload that we took, the approach that we took. The top three items here are powered by Databricks and our data toolkit, which again, sits natively in the cluster. The jar that gets loaded there is similar to our open source engine that powers things like Athena and AWS Athena and Presto but it’s been enhanced to be even more perform inside of data risk. And the bottom three here are powered by RJS. Essentially, we start from raw data, we apply this hexagon based index to generalize the data, we build up the summarized origin destination pairs for hexagon and then we bring that much smaller data set into Esri and we have pinned our demographic data, visualize, and ultimately publish an interactive dashboard. – So when we’re looking at this panel data, the raw data, what we’re looking at, as we were talking about before, is each one of these records is really nothing more than when did it happen, a unique identifier, the location of the data, and then finally, how accurate that location is. Because ultimately, there is a margin of error for how precise you know where a device is, so what this allows us to do is this gives us a starting point. Ultimately though, there’s a lot of work that has to be done to be able to understand the relationship of this data so that we can then ultimately get our index. And since there’s so many records, the first step that we want to do is to be able to understand the data in some sort of generalizable form. What we used is a hexagon index for this. This enables us to be able to group them based on an area that’s roughly the size of 2/3 of a city block in New York City, just to give you a rough idea of what we’re looking at and then from there, what this allows us to do is understand the relationship based on the origin and the destination. And in this case, what we refer to as the origin is where the device, and by proxy, a person resides during the night time hours and then everywhere that they go that is not during the night time hours, this then becomes a trip that they venture to. Specifically, we also examine how fast the device is moving because we don’t want to be looking at people driving down the interstate. Ultimately what we want to do is we want to understand the location of people that is relatively static because that’s when people are at rest and have the potential to be interacting with other humans in a different location other than their home. – Okay, so let’s get to the good stuff. Here’s what’s happening inside of Databricks really and our workflow in order to build up that risk index. Essentially, we take the raw data and we filter it by significant dwells where a device is seen multiple times in a given location. We bend those into these hexagons as Joel said and we’re doing this at level nine which is about a city block. We take those hexagons and we build it up, an origin cell, based on where they sleep and a destination cell and we total those up, so that at each hexagon, we have a cumulative trip the destination paired for all permutations. And from there, we calculate that risk index which is simply the number of trips times distance and finally, we output this much smaller reduced cleaning data set as a process table for use by the GIS and now, let’s take a look at the actual notebook.
再一次,这是交互式仪表板中的多维数据我可以看到这是底特律,我可以看到风险区域,我可以看到这里的贡献,根据这些不同的挂毯部分的贡献,我可以在几个城市之间来回切换。我接下来要去波士顿。我们看到了一个非常不同的模式。每个城市都是独一无二的。这里有一些集群。下面有一大群人可能会有危险。让我们跳到纽约。这里我们看到一个关于曼哈顿的故事我们看到这些不同的贡献者,但有趣的是让我们以这些高层租户为例,对其进行筛选。再一次,我可以在仪表板上过滤显示不同的片段。在城市的北部我们看到这些高层租户这里有一个很大的集群。 I may want to think about who are these people and as a policy maker or decision maker, how might I message to them? So we just look at this chart here or this little infographic shows who those high rise renters are from the segment. So we see median age 32, we see that they are relatively low income, much of which goes to rent. We see many single parents and we can just sort of explore who those people are that are occupying that cell and then as I said, make certain decisions or policies or messaging about how to reduce that risk.
现在我们回到工作流程,我想说的是,如果没有这种组合,神奇的组合,或者Databricks前端的分布式处理以及GIS上的丰富和可视化功能,这个分析是不可能实现的。这让我们能够获取非常原始的、大量的数据,并为这些数据带来一些意义和理解,然后我们可以在社区内以可消费的形式轻松地共享这些数据。同样的工作流程或者从原始数据到这些可视化,使用相同的方法可以应用于大量的行业,无论是电信数据还是观察移动数据,就像Joel所做的那样,比如零售店开业和选址。它真的很像你正在分析的人类天气模式,我不得不说,我喜欢Databricks的协作能力,能够建立这些笔记本。
但说实话,乔尔是那个把大部分时间都花在笔记本上的人,乔尔,你的经历怎么样?
-所以我想在结束的时候,有一件更重要的事情要强调,这是吉姆暗示的,他总是喜欢嘲笑我,我没有进入这个领域,因为我对大数据很感兴趣,我真正感兴趣的是我是一个地理学家。但在此之前,我一直想强调的是,如果我能做到这一点,你也能做到,原因是,我有公园、娱乐和旅游的学位。我发现地理学几乎是偶然的,然后我有点无意中进入了大数据分析,因为最终,我做得最多的是,我是一名地理学家。我有个问题需要解决。我需要了解人们从哪里来,他们要去哪里,在多大程度上,以一种我们可以量化风险的方式。我能够提出一个地理问题,我需要解决它。做到这一点的唯一方法是利用可扩展的架构,与Esri的技术相一致,能够将其提炼成有意义的东西,然后我们可以将其放入GIS中,从而能够为其添加更多的上下文。因为最终游戏的名称是能够获取数据并从中提取信息而这真正始于能够首先理解问题。这里我要强调的是我能够做到这一点。我不是数据专家。 I will freely concede that. I started doing this about five weeks ago. And ultimately, I was able to put this all together and get something up and running. I didn’t do it alone. Jim obviously helped me a lot, but this really was an idea that came to fruition because I had a need and then reached out and found the right technologies and ultimately, the people to help me get over the humps to be able to do this. So really, this was the type of thing where the combination of these two is really greater than the sum of the parts because we have a scalable ability with the context of geography to be able to understand this problem in a very meaningful way. So, with that, thank you so much for your time.
ESRI
Joel擅长使用地理来寻找答案,特别是从地理数据中获得可操作的信息。几乎所有的数据都有地理相关性。然而,在正确的上下文中定义地理以发现正确的地理相关性,这在某种程度上更具挑战性。Joel的大部分职业生涯都在与地理信息系统(GIS)合作,从这些地理关系中挖掘信息。随着数据规模的增长,对技术的需求也呈指数级增长,必须不断发展。这让Joel进入了大数据的世界,继续应用地理,但规模要大得多。
ESRI
吉姆·杨(Jim Young)是Esri的业务开发主管,专注于大数据和人工智能。他正在与科技公司和开发人员合作,探索在他们的产品和应用程序中使用位置感知api和空间分析。他的热情是物理和数字的交叉——专注于计算机视觉、传感器网络和位置服务。作为移动社交网络的先驱,Jim在加入Esri之前创立了基于地理位置的Jambo networks。他获得了剑桥大学地理信息系统硕士学位,并拥有南卫理公会大学历史和经济学学士学位。