How to remove accents to webscrape name using python
·
Answer a question
i have a list of names, but some have accents. i want to be able to find the page of the person without having to manually get rid of the accent on the name, which prevents the search. is there a way to even do this?
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
base_url = 'https://basketball.realgm.com'
player_names=['Ante Žižić','Anžejs Pasečņiks', 'Dario Šarić', 'Dāvis Bertāns', 'Jakob Pöltl']
# Empty DataFrame
result = pd.DataFrame()
for name in player_names:
url = f'{base_url}/search?q={name.replace(" ", "+")}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
if url == response.url:
# Get all NBA players
for player in soup.select('.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]'):
response = requests.get(base_url + player['href'])
player_soup = BeautifulSoup(response.content, 'lxml')
player_data = get_player_stats(search_name=player.text, real_name=name, player_soup=player_soup)
result = result.append(player_data, sort=False).reset_index(drop=True)
else:
player_data = get_player_stats(search_name=name, real_name=name, player_soup=soup)
result = result.append(player_data, sort=False).reset_index(drop=True)
Answers
python-slugify can handle the spaces and the unicode characters. Then since you're dealing with a search string, just convert the - to + with a simple replace('-', '+').
from slugify import slugify
base_url = "https://basketball.realgm.com"
player_names = [
"Ante Žižić",
"Anžejs Pasečņiks",
"Dario Šarić",
"Dāvis Bertāns",
"Jakob Pöltl",
]
for name in player_names:
url = f"{base_url}/search?q={slugify(name).replace('-', '+')}"
print(url)
Output:
https://basketball.realgm.com/search?q=ante+zizic
https://basketball.realgm.com/search?q=anzejs+pasecniks
https://basketball.realgm.com/search?q=dario+saric
https://basketball.realgm.com/search?q=davis+bertans
https://basketball.realgm.com/search?q=jakob+poltl
Granted, the unidecode module the others have mentioned will work as well.
from unidecode import unidecode
for name in player_names:
url = f"{base_url}/search?q={unidecode(name).replace(' ', '+')}"
print(url)
The URL doesn't seem to care if you have lower or title case for the names.
https://basketball.realgm.com/search?q=Ante+Zizic
https://basketball.realgm.com/search?q=Anzejs+Pasecniks
https://basketball.realgm.com/search?q=Dario+Saric
https://basketball.realgm.com/search?q=Davis+Bertans
https://basketball.realgm.com/search?q=Jakob+Poltl
Here's the links so you can validate that it's working.
- https://basketball.realgm.com/search?q=ante+zizic
- https://basketball.realgm.com/search?q=anzejs+pasecniks
- https://basketball.realgm.com/search?q=dario+saric
- https://basketball.realgm.com/search?q=davis+bertans
- https://basketball.realgm.com/search?q=jakob+poltl
更多推荐

所有评论(0)