Tutorial¶
Tries¶
There are several trie classes in this package:
marisa_trie.BinaryTrie |
|
marisa_trie.Trie |
|
marisa_trie.RecordTrie |
|
marisa_trie.BytesTrie |
marisa_trie.Trie¶
Create a new trie from a list of keys:
>>> import marisa_trie
>>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
Check if a key is present:
>>> u'key1' in trie
True
>>> u'key20' in trie
False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys in a trie:
>>> trie[u'key2']
1
Note that you can’t assign a value to a marisa_trie.Trie
key,
but can use the returned ID to store values in a separate data structure
(e.g. in a Python list or NumPy array).
An ID can be mapped back to the corresponding key:
>>> trie.restore_key(1)
u'key2'
Query a trie
Find all trie keys which are prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
Find all trie keys which start with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
The latter is complemented by
items()
which returns all matching(key, ID)
pairs.
All query methods have generator-based versions prefixed with iter
.
Note
If you’re looking for a trie with bytes keys, check out
BinaryTrie
.
marisa_trie.RecordTrie¶
Create a new trie from a list of (key, data)
pairs:
>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [(1, 2), (2, 1), (3, 3), (2, 1)]
>>> fmt = "<HH" # two short integers.
>>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
Each data tuple would be converted to bytes using struct.pack()
. Take a
look at available format strings
here.
Check if a key is present:
>>> u'foo' in trie
True
>>> u'spam' in trie
False
marisa_trie.RecordTrie
allows duplicate keys. Therefore __getitem__
and
get
return a list of values.
>>> trie[u'bar']
[(2, 1)]
>>> trie[u'foo']
[(1, 2), (2, 1)]
>>> trie.get(u'bar', 123)
[(2, 1)]
>>> trie.get(u'BAAR', 123) # default value.
123
Similarly, keys()
and
items()
take into account key multiplicities:
>> trie.keys(u'fo')
[u'foo', u'foo', u'foobar']
>> trie.items(u'fo')
[(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
marisa_trie.BytesTrie¶
BytesTrie
is similar to RecordTrie
, but the values are raw bytes,
not tuples:
>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2']
>>> trie = marisa_trie.BytesTrie(zip(keys, values))
>>> trie[u'bar']
[b'bar-value']
Persistence¶
Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.
Save trie to a file:
>>> trie.save('my_trie.marisa')
Load trie from a file:
>>> trie2 = marisa_trie.Trie()
>>> trie2.load('my_trie.marisa')
Note
You may also build a trie using marisa-build
command-line
utility (provided by underlying C++ library; it should be
downloaded and compiled separately) and then load the trie
from the resulting file using load
.
Trie objects are picklable:
>>> import pickle
>>> data = pickle.dumps(trie)
>>> trie3 = pickle.loads(data)
Memory mapped I/O¶
It is possible to use memory mapped file as data source:
>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
This way the whole dictionary won’t be loaded fully to memory; memory mapped I/O is an easy way to share dictionary data among processes.
Warning
Memory mapped trie might cause lots of random disk accesses which considerably increases the search time.
Storage options¶
marisa-trie C++ library provides some configuration options for trie storage; See “Enumeration Constants” section in the library docs.
These options are exposed as order
, num_tries
, cache_size
and binary
keyword arguments for trie constructors.
For example, set order
to marisa_trie.LABEL_ORDER
in order to
make trie functions return results in alphabetical oder:
>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
Note that two tries constructed from identical data but with different order
arguments will compare unequal:
>>> t1 = marisa_trie.Trie(order=marisa_trie.LABEL_ORDER)
>>> t2 = marisa_trie.Trie(order=marisa_trie.WEIGHT_ORDER)
>>> t1 == t2
False