Answer a question

I made this code to extrat lyrics from a website informing the artist and the music name.

The code is working, the problem is that I have a DataFrame (named years_1920_2020) with 10000 musics, and it took 1:30h to retrieve all these lyrics .

Is there a way to do it faster?

def url_lyric(music,artist):
 url_list = ("https://www.letras.mus.br/", str(artist),"/", str(music),"/")
 url = ''.join(url_list)
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 try:
   webpage = urlopen(req).read()
   bs = BeautifulSoup(webpage, 'html.parser')
   lines =bs.find('div', {'class':'cnt-letra p402_premium'})
   final_lines = lines.find_all('p')
   return final_lines
 except:
     return 0


final_lyric_series = pd.Series(name = "lyrics")

for year in range (1920,2021):
  lyrics_serie = lyrics_from_year(year)
  final_lyric_series = pd.concat([final_lyric_series, lyrics_serie])
  print(year)

the function lyrics_from_year(year) uses the function url_lyric, perform some re tasks and return a pd.series with all the lyrics of the chosen year

Answers

We can get the solution using the pythons asyncio module. Please refer to this Article It's not an exact solution but similar to your problem.

import asyncio
from concurrent.futures import ThreadPoolExecutor
import pandas as pd


def url_lyric(music, artist):
    pass


def lyrics_from_year(year):
    music = None
    artist = None
    return url_lyric(music, artist)


async def get_work_done():
    with ThreadPoolExecutor(max_workers=10) as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                executor,
                lyrics_from_year,
                *(year)  # Allows us to pass in arguments to `lyrics_from_year`
            )
            for year in range(1920, 2021)
        ]

    return await asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_work_done())
loop.run_until_complete(future)

final_lyric_series = pd.Series(name="lyrics")


for result in future:
    final_lyric_series = pd.concat([final_lyric_series, result])
    print(result)
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐