Answer a question

I have a dictionary as follows:

{'abc':100,'xyz':200,'def':250 .............}

It is a dictionary with keys as a name of a entity and the value is count of that entity. I need to return the top 10 elements of from the dictionary.

I can write a heap to do it, but I'm not sure how to do value to key mapping as certain values will be equal.

Is there any other data structure to do this?

Answers

Using heapq you probably want to do something like this:

heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]

Note that since heapq implements only a min heap it's better to invert the values, so that bigger values become smaller.

This solution will be slower for small sizes of the heap, for example:

>>> import random
>>> import itertools as it
>>> def key_generator():
...     characters = [chr(random.randint(65, 90)) for x in range(100)]
...     for i in it.count():
...             yield ''.join(random.sample(characters, 3))
... 
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
...     items = [(-value, key) for key, value in the_dict.items()]
...     smallest = heapq.nsmallest(10, items)
...     return [-value for value, key in smallest]
... 
>>> def with_sorted(the_dict):
...     return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
... 
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902

With 3000 values it's just slightly faster than the sorted version, which is O(nlogn) instead of O(n + mlogn). If we increase the size of the dict to 10000 the heapq version becomes even faster:

>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549

The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted version because it's simpler.

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐