Python location, show distance from closest other location
Answer a question
I am a location in a dataframe, underneath lat lon column names. I want to show how far that is from the lat lon of the nearest train station in a separate dataframe.
So for example, I have a lat lon of (37.814563 144.970267), and i have a list as below of other geospatial points. I want to find the point that is closest and then find the distance between those points, as an extra column in the dataframe in suburbs.
This is the example of the train dataset
<bound method NDFrame.to_clipboard of STOP_ID STOP_NAME LATITUDE \
0 19970 Royal Park Railway Station (Parkville) -37.781193
1 19971 Flemington Bridge Railway Station (North Melbo... -37.788140
2 19972 Macaulay Railway Station (North Melbourne) -37.794267
3 19973 North Melbourne Railway Station (West Melbourne) -37.807419
4 19974 Clifton Hill Railway Station (Clifton Hill) -37.788657
LONGITUDE TICKETZONE ROUTEUSSP \
0 144.952301 1 Upfield
1 144.939323 1 Upfield
2 144.936166 1 Upfield
3 144.942570 1 Flemington,Sunbury,Upfield,Werribee,Williamsto...
4 144.995417 1 Mernda,Hurstbridge
geometry
0 POINT (144.95230 -37.78119)
1 POINT (144.93932 -37.78814)
2 POINT (144.93617 -37.79427)
3 POINT (144.94257 -37.80742)
4 POINT (144.99542 -37.78866) >
and this is an example of the suburbs
<bound method NDFrame.to_clipboard of postcode suburb state lat lon
4901 3000 MELBOURNE VIC -37.814563 144.970267
4902 3002 EAST MELBOURNE VIC -37.816640 144.987811
4903 3003 WEST MELBOURNE VIC -37.806255 144.941123
4904 3005 WORLD TRADE CENTRE VIC -37.822262 144.954856
4905 3006 SOUTHBANK VIC -37.823258 144.965926>
Which I am trying to show, is the distance from the lat lon to the closet train station in a new column for the suburb list.
Using a solution get a weird output, wondering if it's correct?
With both solutions shown,
from sklearn.neighbors import NearestNeighbors
from haversine import haversine
NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(trains_shape[['LATITUDE', 'LONGITUDE']])
indices = NN.kneighbors(df_complete[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(df_complete[['lat', 'lon']])[0]
df_complete['closest_station'] = trains_shape.iloc[indices]['STOP_NAME'].reset_index(drop=True)
df_complete['closest_station_distances'] = distances
print(df_complete)
The output here,
<bound method NDFrame.to_clipboard of postcode suburb state lat lon Venues Cluster \
1 3040 aberfeldie VIC -37.756690 144.896259 4.0
2 3042 airport west VIC -37.711698 144.887037 1.0
4 3206 albert park VIC -37.840705 144.955710 0.0
5 3020 albion VIC -37.775954 144.819395 2.0
6 3078 alphington VIC -37.780767 145.031160 4.0
#1 #2 #3 \
1 Café Electronics Store Grocery Store
2 Fast Food Restaurant Café Supermarket
4 Café Pub Coffee Shop
5 Café Fast Food Restaurant Grocery Store
6 Café Park Bar
#4 ... #6 \
1 Coffee Shop ... Bakery
2 Grocery Store ... Italian Restaurant
4 Breakfast Spot ... Burger Joint
5 Vietnamese Restaurant ... Pub
6 Pizza Place ... Vegetarian / Vegan Restaurant
#7 #8 #9 \
1 Shopping Mall Japanese Restaurant Indian Restaurant
2 Portuguese Restaurant Electronics Store Middle Eastern Restaurant
4 Bar Bakery Gastropub
5 Chinese Restaurant Gym Bakery
6 Italian Restaurant Gastropub Bakery
#10 Ancestry Cluster ClosestStopId \
1 Greek Restaurant 8.0 20037
2 Convenience Store 5.0 20032
4 Beach 6.0 22180
5 Convenience Store 5.0 20004
6 Coffee Shop 5.0 19931
ClosestStopName \
1 Essendon Railway Station (Essendon)
2 Glenroy Railway Station (Glenroy)
4 Southern Cross Railway Station (Melbourne City)
5 Albion Railway Station (Sunshine North)
6 Alphington Railway Station (Alphington)
closest_station closest_station_distances
1 Glenroy Railway Station (Glenroy) 0.019918
2 Southern Cross Railway Station (Melbourne City) 0.031020
4 Alphington Railway Station (Alphington) 0.023165
5 Altona Railway Station (Altona) 0.005559
6 Newport Railway Station (Newport) 0.002375
And the second function.
def ClosestStop(r):
# Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
# Stop with minimum Distance from the Suburb
closestStationId = distances[distances == distances.min()].index.to_list()[0]
return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]
df_complete[['ClosestStopId', 'ClosestStopName']] = df_complete.apply(ClosestStop, axis=1)
This is giving different answers oddly enough, and leads me to think that there is an issue with this code. the KM's seem wrong as well.
Completely unsure how to approach this problem - would love some guidance here, thanks!
Answers
A few key concepts
- do a Cartesian product between two data frames to get all combinations (joining on identical value between two data frames is approach to this
foo=1) - once both sets of data is together, have both sets of lat/lon to calculate distance) geopy has been used for this
- cleanup the columns, use
sort_values()to find smallest distance - finally a
groupby()andagg()to get first values for shortest distance
There are two data frames for use
dfdistcontains all the combinations and distancesdfnearestwhich contains result
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
'STOP_NAME': ['Royal Park Railway Station (Parkville)',
'Flemington Bridge Railway Station (North Melbo...',
'Macaulay Railway Station (North Melbourne)',
'North Melbourne Railway Station (West Melbourne)',
'Clifton Hill Railway Station (Clifton Hill)'],
'LATITUDE': ['-37.781193',
'-37.788140',
'-37.794267',
'-37.807419',
'-37.788657'],
'LONGITUDE': ['144.952301',
'144.939323',
'144.936166',
'144.942570',
'144.995417'],
'TICKETZONE': ['1', '1', '1', '1', '1'],
'ROUTEUSSP': ['Upfield',
'Upfield',
'Upfield',
'Flemington,Sunbury,Upfield,Werribee,Williamsto...',
'Mernda,Hurstbridge'],
'geometry': ['POINT (144.95230 -37.78119)',
'POINT (144.93932 -37.78814)',
'POINT (144.93617 -37.79427)',
'POINT (144.94257 -37.80742)',
'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
'postcode': ['3000', '3002', '3003', '3005', '3006'],
'suburb': ['MELBOURNE',
'EAST MELBOURNE',
'WEST MELBOURNE',
'WORLD TRADE CENTRE',
'SOUTHBANK'],
'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})
import geopy.distance
# cartesian product so we get all combinations
dfdist = (dfsub.assign(foo=1).merge(dfstat.assign(foo=1), on="foo")
# calc distance in km between each suburb and each train station
.assign(km=lambda dfa: dfa.apply(lambda r:
geopy.distance.geodesic(
(r["LATITUDE"],r["LONGITUDE"]),
(r["lat"],r["lon"])).km, axis=1))
# reduce number of columns to make it more digestable
.loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
# sort so shortest distance station from a suburb is first
.sort_values(["postcode","suburb","km"])
# good practice
.reset_index(drop=True)
)
# finally pick out stations nearest to suburb
# this can easily be joined back to source data frames as postcode and STOP_ID have been maintained
dfnearest = dfdist.groupby(["postcode","suburb"])\
.agg({"STOP_ID":"first","STOP_NAME":"first","km":"first"}).reset_index()
print(dfnearest.to_string(index=False))
dfnearest
output
postcode suburb STOP_ID STOP_NAME km
3000 MELBOURNE 19973 North Melbourne Railway Station (West Melbourne) 2.564586
3002 EAST MELBOURNE 19974 Clifton Hill Railway Station (Clifton Hill) 3.177320
3003 WEST MELBOURNE 19973 North Melbourne Railway Station (West Melbourne) 0.181463
3005 WORLD TRADE CENTRE 19973 North Melbourne Railway Station (West Melbourne) 1.970909
3006 SOUTHBANK 19973 North Melbourne Railway Station (West Melbourne) 2.705553
an approach to reducing size of tested combinations
# pick nearer places, based on lon/lat then all combinations
dfdist = (dfsub.assign(foo=1, latr=dfsub["lat"].round(1), lonr=dfsub["lon"].round(1))
.merge(dfstat.assign(foo=1, latr=dfstat["LATITUDE"].round(1), lonr=dfstat["LONGITUDE"].round(1)),
on=["foo","latr","lonr"])
# calc distance in km between each suburb and each train station
.assign(km=lambda dfa: dfa.apply(lambda r:
geopy.distance.geodesic(
(r["LATITUDE"],r["LONGITUDE"]),
(r["lat"],r["lon"])).km, axis=1))
# reduce number of columns to make it more digestable
.loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
# sort so shortest distance station from a suburb is first
.sort_values(["postcode","suburb","km"])
# good practice
.reset_index(drop=True)
)
更多推荐

所有评论(0)