Comunitat Valenciana' pharmacies' import¶
This jupyter notebook contains the script for importing the pharmacies in Comunitat Valenciana into OSM, as well as the documentation of the whole process in a single file, making it easier to review both the process and the results as well as the decisions taken.
The goal is to manually merge and import all the pharmacies' information provided by Generalitat Valenciana, while testing the scripts for data preparation.
Data Sources¶
License¶
We have requested autorization due to COV19 emergency. Data is released under CC
Import type¶
This import will be done manually, using JOSM to edit the data. Consider using Task Manager.
Data preparations¶
All data preparations will be made automatically in this notebook.
import numpy as np
import pandas as pd
import geopandas as gpd
import geopy
from osmi_helpers import data_gathering as osmi_dg
# Define Data Sources
DATA_RAW = 'data/interim/ListadoOficinasFarmacia_clean.csv'
CSV_PARSER = 'fields_mapping.csv'
Fields' mapping.¶
# Read CSV file with fields' mapping and description.
fields_mapping = pd.read_csv(CSV_PARSER)
# Display table.
fields_mapping
Data gathering¶
Run the code below to download original datasources and convert them into a dataframe.
# Download a file and convert it into a dataframe.
df_raw = pd.read_csv(DATA_RAW)
df_raw.head(10)
Data conversion¶
Run the cell below to convert raw data into a suitable OSM-friendly structure, according to the provided CSV fields with fields' mappings stated in CSV_PARSER
variable.
# Selects and renames fields according to CSV parser.
df = osmi_dg.csv_parser(df_raw, CSV_PARSER)
# Fix uppercase.
df['operator'] = df['operator'].str.title()
df['addr:province'] = df['addr:province'].str.title()
df['addr:full'] = df['addr:full'].str.title()
# Split address.
df['addr:street'], df['addr:housenumber'], df['addr:postcode'] = df['addr:full'].str.split(',', 2).str
df['addr:housenumber'] = df['addr:housenumber'].replace(regex = 'S/N', value = '')
# Create some hardcoded fields.
df['source'] = "Opendata Generalitat Valenciana"
df['amenity'] = 'pharmacy'
df.head(10)
Geocode dataframe¶
# Geocode
from geopy.geocoders import Photon
geolocator = Photon(timeout=10, user_agent = "myGeolocator")
#df = df.iloc[0:25, :]
df['addr_full'] = df['addr:street'] + ', ' + df['addr:city'] + ', ' + df['addr:province']
df['addr:housenumber'] = df['addr:housenumber'].replace(regex = '', value = np.NaN)
df['gcode'] = df.addr_full.apply(geolocator.geocode)
# Store rows that have not been geolocated.
df_not_found = df[df['gcode'].isnull()]
df_not_found
# Proceed with geolocated values only.
df_loc = df[df['gcode'].notna()]
# Generate a `lat` and `lon` columns with latitude and longitude values.
df_loc['lat'] = [g.latitude for g in df_loc.gcode]
df_loc['lon'] = [g.longitude for g in df_loc.gcode]
df_loc
df_not_found
Export clean data¶
If the attributes above are correct, we have to proceed to export them into a CSV
and geojson
files that can be used in the Task Manager's project.
# Drop unnecessary fields.
df_loc = df_loc.drop(columns=['addr:full', 'addr_full', 'gcode'])
# Generate a CSV File.
df_loc.to_csv('data/processed/pharmacies_cval.csv', index = False)
df_not_found.to_csv('data/processed/pharmacies_not_found_cval.csv', index = False)
# Convert dataframe into a GeoDataframe.
gdf = gpd.GeoDataFrame(
df_loc,
geometry=gpd.points_from_xy(df_loc.lon, df_loc.lat))
# Export to geojson.
gdf.to_file('data/processed/pharmacies_cval.geojson', driver='GeoJSON')
data/processed
folder:
/data/processed/pharmacies_cval.geojson
: ageojson
file with all pharmacies that have been geocoded./data/processed/pharmacies_cval.csv
: acsv
file with all pharmacies that have been geocoded./data/processed/pharmacies_not_found_cval.csv.csv
: acsv
file with all pharmacies that have NOT been geocoded.
TODOs:
- Drop latitude and longitude fields from GeoJson, Issue #19 (help appreciated!)