# بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ We need to turn characters into numbers. We can do that with Unicode like this. # Unicode On Ubuntu Linux press [CTRL][Shift][T] and in the terminal: **schmuck@Schmoe:~$** `python3` to get **>>>** `ord('h')` 'h' has the Unicode *code point* 104. `ord` can only take a single character, to get the code points of many characters: `[ord(x) for x in "إِنَّ اللَّهَ اصْطَفَىٰ آدَمَ وَنُوحًا وَآلَ إِبْرَاهِيمَ وَآلَ عِمْرَانَ"]` [1573, 1616, 1606, 1617, 1614, 32, 1575, 1604, 1604, 1617, 1614, 1607, 1614, 32, 1575, 1589, 1618, 1591, 1614, 1601, 1614, 1609, 1648, 32, 1570, 1583, 1614, 1605, 1614, 32, 1608, 1614, 1606, 1615, 1608, 1581, 1611, 1575, 32, 1608, 1614, 1570, 1604, 1614, 32, 1573, 1616, 1576, 1618, 1585, 1614, 1575, 1607, 1616, 1610, 1605, 1614, 32, 1608, 1614, 1570, 1604, 1614, 32, 1593, 1616, 1605, 1618, 1585, 1614, 1575, 1606, 1614] But Unicode is always changing so not very good for us. We can use a type of Unicode called *UTF-8* which can turn our characters into *binary-data* or *byte-streams* like this: `"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8")` b'\xd8\xa8\xd9\x90\xd8\xb3\xd9\x92\xd9\x85\xd9\x90 \xd8\xa7\xd9\x84\xd9\x84\xd9\x91\xd9\x8e\xd9\x87\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x92\xd9\x85\xd9\x8e\xd9\xb0\xd9\x86\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x90\xd9\x8a\xd9\x85\xd9\x90' but it is not very pretty so we can turn it into useful numbers to work with like this: `list("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8"))` [216, 168, 217, 144, 216, 179, 217, 146, 217, 133, 217, 144, 32, 216, 167, 217, 132, 217, 132, 217, 145, 217, 142, 217, 135, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 146, 217, 133, 217, 142, 217, 176, 217, 134, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 144, 217, 138, 217, 133, 217, 144] ## Getting the data Go to https://tanzil.net/download/ and choose 'Uthmani' under *Quran text type:* , 'Text' for *Output file format:* and tick all boxes except 'Include sequential tanweens', then **Download** to get the file and then using python3 to open: `text = open("quran-uthmani.txt", 'r').read()` `print(text)` We can encode in UTF-8 and get some ugly binary: `tokens = text.encode("utf-8")` `print(tokens)` A neater way, so we get 0-255 range of code points: `tokens = list(map(int, tokens))` `quit()` to get out of >>> and back to $ ## Google Colab If you have a Google account, or if you have a throw away SIM card to get a new Google account using a dumb phone for SMS verification, you can start using Google's Colab for **free**: https://colab.research.google.com/ Yes, for **free** so when at the payment screen, click away from it to get the free lab. ## Adding your file to Google Colab Click the file icon and then the dog-earred page with an up arrow and add your *quran-uthmani.txt* file from 'Getting the data' (above). ![image](https://yakihonne.s3.ap-east-1.amazonaws.com/c0de404505c2a1c45168bf84aea1302005ff0432942cfef3a40cb1db6ee577fe/files/1711232985090-YAKIHONNES3.png) It *will be deleted* so you have to add it every session you start Colab, so have it saved somewhere. ## Our actor example applied to Python So, we talked earlier about Fenyman and James Clear's 'Atomic Habits' and how actors learn their script. We are now going to use that to learn our own 'scripting', that is learn programming: *Type **everything out** and **never ever copy** and paste code. And never say never.* We said we won't type/write by going back and forth but in this case we will except that we type everything out. It will first enter our memory so we know where to find it when we need it in the future and for speed we will **type first** because searching for something we saw months ago somewhere takes longer than trying muscle memory out and just typing it. We say, 'never say never' because there are times when there is no point in typing out long paragraphs of code that are 'boiler plate' meaning, always used as is everywhere. In those **rare** cases, copy and paste. ## Getting the most common pairs Type in Google Colab `text = open("quran-uthmani.txt", 'r').read()` and then press [SHIFT][ENTER] to run the 'code cell'. `open()` `"quran-uthmani.txt"` as read-only `'r'` and save as `text`. `tokens = text.encode("utf-8")` take `text` and `encode()` it with `"utf-8"` and save as `tokens` `tokens = list(map(int, tokens))` ``` def get_stats(ids): counts = {} for pair in zip(ids, ids[1:]): counts[pair] = counts.get(pair, 0) + 1 return counts stats = get_stats(tokens) print(stats) ``` {(216, 168): 11603, (168, 217): 11593, (217, 144): 46642, (144, 216): 7778, (216, 179): 6122, (179, 217): 6122, (217, 146): 37372, (146, 217): 14675, (217, 133): 27071, (133, 217): 25740, (144, 32): 7712, (32, 217): 45082, (217, 177): 13819, (177, 217): 25239, (217, 132): 38550, (132, 217): 36124, (217, 145): 23016, (145, 217): 23016, (217, 142): 123396, (142, 217): 53930, (217, 135): 14962, (135, 217): 14961, (132, 216): 2316, (216, 177): 12627, (142, 216): 50698, (216, 173): 4364, (173, 217): 4364, (217, 128): 6848, (128, 217): 6808, (217, 176): 9838, (176, 217): 10000, (217, 134): 27380, (134, 217): 22530, (144, 217): 29600, (217, 138): 18334, (138, 217): 16706, (144, 10): 595, (10, 217): 4481, (146, 216): 13852, (216, 175): 5991, (175, 217): 5945, (217, 143): 37320, (143, 32): 6675, (32, 216): 26762, (216, 185): 9405, (185, 217): 9403, (142, 10): 2843, (217, 131): 10497, (131, 217): 10497, (217, 136): 24970, (136, 217): 20377, (10, 216): 1555, (216, 165): 5088, (165, 217): 5260, (216, 167): 25184, (167, 217): 6829, (142, 32): 15924, (143, 216): 5806, (216, 170): 10520, (170, 217): 10504, (143, 10): 331, (167, 32): 10720, (216, 181): 2074, (181, 217): 2071, (176, 216): 3255, (216, 183): 1273, (183, 217): 1266, (217, 130): 7034, (130, 217): 7034, (216, 176): 4932, (216, 163): 8900, (163, 217): 8901, (146, 32): 8751, (216, 186): 1221, (186, 217): 1221, (216, 182): 1686, (182, 217): 1686, (143, 217): 23252, (136, 216): 4289, (217, 147): 5376, (147, 217): 90, (147, 10): 76, (32, 219): 4379, (219, 155): 12, (155, 32): 12, (217, 129): 8747, (129, 217): 8746, (217, 139): 3741, (139, 217): 93, (217, 137): 6603, (137, 32): 3035, (216, 164): 706, (164, 217): 706, (216, 169): 2344, (169, 217): 2344, (216, 178): 1599, (178, 217): 1599, (147, 32): 2459, (134, 216): 1499, (134, 32): 3081, (217, 148): 773, (148, 217): 773, (167, 216): 2953, (216, 174): 2497, (174, 217): 2497, (136, 219): 255, (219, 159): 3988, (159, 217): 268, (147, 216): 2751, (216, 166): 921, (166, 217): 1085, (137, 217): 3531, (176, 32): 1337, (219, 150): 1682, (150, 32): 1682, (167, 219): 3789, (159, 32): 3704, (216, 161): 2782, (161, 217): 2782, (217, 140): 2519, (140, 32): 1777, (216, 180): 2124, (180, 217): 2124, (216, 184): 853, (184, 217): 853, (140, 10): 605, (133, 32): 1328, (139, 216): 2976, (140, 219): 134, (219, 162): 510, (162, 32): 338, (219, 151): 603, (151, 32): 603, (177, 216): 1197, (170, 32): 17, (216, 172): 3317, (172, 217): 3317, (216, 171): 1414, (171, 217): 1414, (143, 219): 1256, (219, 165): 1257, (165, 32): 1042, (217, 141): 2633, (141, 32): 2080, (219, 154): 1972, (154, 32): 1972, (138, 216): 1618, (139, 32): 556, (144, 219): 957, (219, 166): 957, (166, 32): 791, (219, 153): 68, (153, 32): 68, (10, 219): 199, (219, 158): 199, (158, 32): 199, (219, 152): 22, (152, 32): 22, (134, 219): 270, (162, 216): 158, (132, 32): 110, (141, 10): 454, (141, 219): 99, (219, 173): 99, (173, 32): 84, (168, 32): 9, (136, 32): 49, (128, 219): 40, (219, 167): 38, (139, 219): 106, (175, 216): 38, (181, 219): 3, (219, 156): 7, (156, 217): 2, (165, 216): 19, (175, 32): 8, (219, 160): 66, (160, 32): 62, (159, 216): 14, (177, 32): 9, (138, 219): 10, (167, 10): 931, (159, 10): 2, (140, 216): 3, (162, 10): 14, (183, 216): 7, (137, 219): 1, (171, 32): 1, (219, 169): 15, (169, 10): 15, (177, 219): 1, (219, 170): 1, (142, 219): 1, (219, 171): 1, (129, 32): 1, (156, 10): 2, (137, 10): 36, (185, 32): 2, (135, 10): 1, (176, 10): 178, (146, 10): 94, (219, 168): 1, (168, 216): 1, (160, 10): 4, (173, 10): 15, (156, 32): 3, (219, 172): 1, (172, 216): 1, (133, 10): 3, (219, 163): 1, (139, 10): 10, (165, 10): 24, (166, 10): 2, (168, 10): 1, (10, 10): 2, (10, 35): 28, (35, 32): 18, (32, 80): 6, (80, 76): 1, (76, 69): 1, (69, 65): 1, (65, 83): 1, (83, 69): 2, (69, 32): 3, (32, 68): 1, (68, 79): 1, (79, 32): 1, (32, 78): 2, (78, 79): 2, (79, 84): 2, (84, 32): 4, (32, 82): 1, (82, 69): 1, (69, 77): 1, (77, 79): 1, (79, 86): 1, (86, 69): 1, (32, 79): 2, (79, 82): 1, (82, 32): 1, (32, 67): 6, (67, 72): 2, (72, 65): 2, (65, 78): 2, (78, 71): 3, (71, 69): 1, (32, 84): 9, (84, 72): 1, (72, 73): 1, (73, 83): 2, (83, 32): 3, (67, 79): 1, (79, 80): 1, (80, 89): 1, (89, 82): 1, (82, 73): 1, (73, 71): 1, (71, 72): 1, (72, 84): 1, (32, 66): 1, (66, 76): 1, (76, 79): 2, (79, 67): 1, (67, 75): 1, (75, 10): 1, (35, 61): 2, (61, 61): 134, (61, 10): 2, (35, 10): 8, (32, 32): 29, (84, 97): 4, (97, 110): 19, (110, 122): 6, (122, 105): 6, (105, 108): 7, (108, 32): 9, (32, 81): 3, (81, 117): 3, (117, 114): 4, (114, 97): 5, (110, 32): 11, (84, 101): 1, (101, 120): 6, (120, 116): 6, (116, 32): 9, (32, 40): 3, (40, 85): 1, (85, 116): 1, (116, 104): 6, (104, 109): 1, (109, 97): 2, (110, 105): 3, (105, 44): 1, (44, 32): 6, (32, 86): 1, (86, 101): 1, (101, 114): 7, (114, 115): 2, (115, 105): 3, (105, 111): 5, (111, 110): 9, (32, 49): 1, (49, 46): 1, (46, 49): 1, (49, 41): 1, (41, 10): 1, (67, 111): 2, (111, 112): 7, (112, 121): 4, (121, 114): 2, (114, 105): 7, (105, 103): 3, (103, 104): 3, (104, 116): 3, (40, 67): 1, (67, 41): 1, (41, 32): 2, (32, 50): 1, (50, 48): 2, (48, 48): 1, (48, 55): 1, (55, 45): 1, (45, 50): 1, (48, 50): 1, (50, 52): 1, (52, 32): 1, (80, 114): 3, (114, 111): 9, (111, 106): 3, (106, 101): 3, (101, 99): 5, (99, 116): 3, (116, 10): 1, (32, 76): 1, (76, 105): 1, (105, 99): 4, (99, 101): 5, (101, 110): 2, (110, 115): 2, (115, 101): 4, (101, 58): 1, (58, 32): 2, (67, 114): 1, (114, 101): 4, (101, 97): 3, (97, 116): 11, (116, 105): 9, (105, 118): 2, (118, 101): 5, (101, 32): 13, (111, 109): 2, (109, 109): 1, (109, 111): 2, (115, 32): 17, (32, 65): 2, (65, 116): 1, (116, 116): 2, (116, 114): 3, (105, 98): 2, (98, 117): 3, (117, 116): 3, (32, 51): 1, (51, 46): 1, (46, 48): 1, (48, 10): 1, (84, 104): 3, (104, 105): 6, (105, 115): 12, (32, 99): 12, (99, 111): 7, (121, 32): 9, (32, 111): 8, (111, 102): 6, (102, 32): 6, (32, 116): 16, (104, 101): 3, (116, 101): 12, (32, 105): 10, (99, 97): 4, (97, 114): 2, (101, 102): 1, (102, 117): 1, (117, 108): 1, (108, 108): 5, (108, 121): 5, (32, 112): 3, (112, 114): 5, (111, 100): 2, (100, 117): 2, (117, 99): 2, (101, 100): 10, (100, 44): 2, (32, 104): 2, (104, 108): 1, (32, 10): 7, (32, 118): 3, (105, 102): 1, (102, 105): 2, (105, 101): 3, (100, 32): 12, (32, 97): 13, (110, 100): 5, (110, 116): 4, (105, 110): 9, (110, 117): 1, (117, 111): 1, (111, 117): 3, (117, 115): 3, (115, 108): 1, (32, 109): 2, (105, 116): 3, (116, 111): 5, (111, 114): 4, (32, 98): 5, (98, 121): 1, (97, 32): 2, (32, 103): 2, (103, 114): 2, (117, 112): 3, (112, 32): 1, (32, 115): 5, (115, 112): 1, (112, 101): 1, (99, 105): 1, (105, 97): 3, (97, 108): 6, (108, 105): 3, (115, 116): 3, (116, 115): 2, (116, 46): 2, (46, 10): 4, (84, 69): 1, (69, 82): 1, (82, 77): 1, (77, 83): 1, (79, 70): 1, (70, 32): 1, (32, 85): 1, (85, 83): 1, (69, 58): 1, (58, 10): 1, (32, 45): 3, (45, 32): 3, (80, 101): 1, (114, 109): 1, (109, 105): 1, (115, 115): 1, (111, 32): 4, (32, 100): 2, (100, 105): 2, (114, 98): 2, (98, 97): 2, (105, 109): 2, (109, 32): 3, (112, 105): 2, (101, 115): 6, (116, 44): 2, (71, 73): 1, (73, 78): 1, (71, 32): 1, (32, 73): 2, (73, 84): 1, (65, 76): 1, (76, 76): 1, (79, 87): 1, (87, 69): 1, (69, 68): 1, (68, 46): 1, (98, 101): 3, (32, 117): 3, (110, 121): 1, (32, 119): 1, (119, 101): 1, (101, 98): 1, (98, 115): 2, (114, 32): 2, (97, 112): 2, (112, 112): 2, (112, 108): 1, (110, 44): 1, (111, 118): 1, (118, 105): 1, (105, 100): 1, (100, 101): 4, (104, 97): 4, (115, 111): 1, (114, 99): 1, (40, 84): 1, (116, 41): 1, (99, 108): 2, (108, 101): 4, (114, 108): 1, (32, 108): 1, (110, 107): 1, (107, 32): 3, (97, 100): 1, (116, 97): 4, (108, 46): 2, (46, 110): 2, (110, 101): 2, (101, 116): 2, (32, 101): 1, (110, 97): 1, (97, 98): 1, (98, 108): 1, (32, 107): 1, (107, 101): 1, (101, 101): 1, (101, 112): 2, (112, 10): 1, (97, 99): 1, (99, 107): 2, (99, 104): 2, (110, 103): 2, (103, 101): 1, (115, 46): 1, (32, 110): 1, (110, 111): 1, (111, 116): 1, (115, 104): 2, (110, 99): 1, (108, 117): 1, (117, 100): 1, (32, 114): 1, (101, 108): 1, (32, 102): 2, (102, 114): 1, (97, 105): 1, (103, 32): 1, (115, 117): 1, (117, 98): 1, (112, 111): 1, (114, 116): 1, (80, 108): 1, (97, 115): 1, (112, 100): 2, (100, 97): 2, (116, 58): 1, (116, 112): 1, (112, 58): 1, (58, 47): 1, (47, 47): 1, (47, 116): 1, (116, 47): 1, (47, 117): 1, (115, 47): 1, (47, 10): 1} ## The most common pair is ... `print(sorted(((v,k) for k,v in stats.items()), reverse=True))` `sorted` by value`v` to get the most common pairs first [(123396, (217, 142)), (53930, (142, 217)), (50698, (142, 216)), (46642, (217, 144)), (45082, (32, 217)), (38550, (217, 132)), (37372, (217, 146)), (37320, (217, 143)), (36124, (132, 217)), (29600, (144, 217)), (27380, (217, 134)), (27071, (217, 133)), (26762, (32, 216)), (25740, (133, 217)), (25239, (177, 217)), (25184, (216, 167)), (24970, (217, 136)), (23252, (143, 217)), (23016, (217, 145)), (23016, (145, 217)), (22530, (134, 217)), (20377, (136, 217)), (18334, (217, 138)), (16706, (138, 217)), (15924, (142, 32)), (14962, (217, 135)), (14961, (135, 217)), (14675, (146, 217)), (13852, (146, 216)), (13819, (217, 177)), (12627, (216, 177)), (11603, (216, 168)), (11593, (168, 217)), (10720, (167, 32)), (10520, (216, 170)), (10504, (170, 217)), (10497, (217, 131)), (10497, (131, 217)), (10000, (176, 217)), (9838, (217, 176)), (9405, (216, 185)), (9403, (185, 217)), (8901, (163, 217)), (8900, (216, 163)), (8751, (146, 32)), (8747, (217, 129)), (8746, (129, 217)), (7778, (144, 216)), (7712, (144, 32)), (7034, (217, 130)), (7034, (130, 217)), (6848, (217, 128)), (6829, (167, 217)), (6808, (128, 217)), (6675, (143, 32)), (6603, (217, 137)), (6122, (216, 179)), (6122, (179, 217)), (5991, (216, 175)), (5945, (175, 217)), (5806, (143, 216)), (5376, (217, 147)), (5260, (165, 217)), (5088, (216, 165)), (4932, (216, 176)), (4481, (10, 217)), (4379, (32, 219)), (4364, (216, 173)), (4364, (173, 217)), (4289, (136, 216)), (3988, (219, 159)), (3789, (167, 219)), (3741, (217, 139)), (3704, (159, 32)), (3531, (137, 217)), (3317, (216, 172)), (3317, (172, 217)), (3255, (176, 216)), (3081, (134, 32)), (3035, (137, 32)), (2976, (139, 216)), (2953, (167, 216)), (2843, (142, 10)), (2782, (216, 161)), (2782, (161, 217)), (2751, (147, 216)), (2633, (217, 141)), (2519, (217, 140)), (2497, (216, 174)), (2497, (174, 217)), (2459, (147, 32)), (2344, (216, 169)), (2344, (169, 217)), (2316, (132, 216)), (2124, (216, 180)), (2124, (180, 217)), (2080, (141, 32)), (2074, (216, 181)), (2071, (181, 217)), (1972, (219, 154)), (1972, (154, 32)), (1777, (140, 32)), (1686, (216, 182)), (1686, (182, 217)), (1682, (219, 150)), (1682, (150, 32)), (1618, (138, 216)), (1599, (216, 178)), (1599, (178, 217)), (1555, (10, 216)), (1499, (134, 216)), (1414, (216, 171)), (1414, (171, 217)), (1337, (176, 32)), (1328, (133, 32)), (1273, (216, 183)), (1266, (183, 217)), (1257, (219, 165)), (1256, (143, 219)), (1221, (216, 186)), (1221, (186, 217)), (1197, (177, 216)), (1085, (166, 217)), (1042, (165, 32)), (957, (219, 166)), (957, (144, 219)), (931, (167, 10)), (921, (216, 166)), (853, (216, 184)), (853, (184, 217)), (791, (166, 32)), (773, (217, 148)), (773, (148, 217)), (706, (216, 164)), (706, (164, 217)), (605, (140, 10)), (603, (219, 151)), (603, (151, 32)), (595, (144, 10)), (556, (139, 32)), (510, (219, 162)), (454, (141, 10)), (338, (162, 32)), (331, (143, 10)), (270, (134, 219)), (268, (159, 217)), (255, (136, 219)), (199, (219, 158)), (199, (158, 32)), (199, (10, 219)), (178, (176, 10)), (158, (162, 216)), (134, (140, 219)), (134, (61, 61)), (110, (132, 32)), (106, (139, 219)), (99, (219, 173)), (99, (141, 219)), (94, (146, 10)), (93, (139, 217)), (90, (147, 217)), (84, (173, 32)), (76, (147, 10)), (68, (219, 153)), (68, (153, 32)), (66, (219, 160)), (62, (160, 32)), (49, (136, 32)), (40, (128, 219)), (38, (219, 167)), (38, (175, 216)), (36, (137, 10)), (29, (32, 32)), (28, (10, 35)), (24, (165, 10)), (22, (219, 152)), (22, (152, 32)), (19, (165, 216)), (19, (97, 110)), (18, (35, 32)), (17, (170, 32)), (17, (115, 32)), (16, (32, 116)), (15, (219, 169)), (15, (173, 10)), (15, (169, 10)), (14, (162, 10)), (14, (159, 216)), (13, (101, 32)), (13, (32, 97)), (12, (219, 155)), (12, (155, 32)), (12, (116, 101)), (12, (105, 115)), (12, (100, 32)), (12, (32, 99)), (11, (110, 32)), (11, (97, 116)), (10, (139, 10)), (10, (138, 219)), (10, (101, 100)), (10, (32, 105)), (9, (177, 32)), (9, (168, 32)), (9, (121, 32)), (9, (116, 105)), (9, (116, 32)), (9, (114, 111)), (9, (111, 110)), (9, (108, 32)), (9, (105, 110)), (9, (32, 84)), (8, (175, 32)), (8, (35, 10)), (8, (32, 111)), (7, (219, 156)), (7, (183, 216)), (7, (114, 105)), (7, (111, 112)), (7, (105, 108)), (7, (101, 114)), (7, (99, 111)), (7, (32, 10)), (6, (122, 105)), (6, (120, 116)), (6, (116, 104)), (6, (111, 102)), (6, (110, 122)), (6, (104, 105)), (6, (102, 32)), (6, (101, 120)), (6, (101, 115)), (6, (97, 108)), (6, (44, 32)), (6, (32, 80)), (6, (32, 67)), (5, (118, 101)), (5, (116, 111)), (5, (114, 97)), (5, (112, 114)), (5, (110, 100)), (5, (108, 121)), (5, (108, 108)), (5, (105, 111)), (5, (101, 99)), (5, (99, 101)), (5, (32, 115)), (5, (32, 98)), (4, (160, 10)), (4, (117, 114)), (4, (116, 97)), (4, (115, 101)), (4, (114, 101)), (4, (112, 121)), (4, (111, 114)), (4, (111, 32)), (4, (110, 116)), (4, (108, 101)), (4, (105, 99)), (4, (104, 97)), (4, (100, 101)), (4, (99, 97)), (4, (84, 97)), (4, (84, 32)), (4, (46, 10)), (3, (181, 219)), (3, (156, 32)), (3, (140, 216)), (3, (133, 10)), (3, (117, 116)), (3, (117, 115)), (3, (117, 112)), (3, (116, 114)), (3, (115, 116)), (3, (115, 105)), (3, (111, 117)), (3, (111, 106)), (3, (110, 105)), (3, (109, 32)), (3, (108, 105)), (3, (107, 32)), (3, (106, 101)), (3, (105, 116)), (3, (105, 103)), (3, (105, 101)), (3, (105, 97)), (3, (104, 116)), (3, (104, 101)), (3, (103, 104)), (3, (101, 97)), (3, (99, 116)), (3, (98, 117)), (3, (98, 101)), (3, (84, 104)), (3, (83, 32)), (3, (81, 117)), (3, (80, 114)), (3, (78, 71)), (3, (69, 32)), (3, (45, 32)), (3, (32, 118)), (3, (32, 117)), (3, (32, 112)), (3, (32, 81)), (3, (32, 45)), (3, (32, 40)), (2, (185, 32)), (2, (166, 10)), (2, (159, 10)), (2, (156, 217)), (2, (156, 10)), (2, (121, 114)), (2, (117, 99)), (2, (116, 116)), (2, (116, 115)), (2, (116, 46)), (2, (116, 44)), (2, (115, 104)), (2, (114, 115)), (2, (114, 98)), (2, (114, 32)), (2, (112, 112)), (2, (112, 105)), (2, (112, 100)), (2, (111, 109)), (2, (111, 100)), (2, (110, 115)), (2, (110, 103)), (2, (110, 101)), (2, (109, 111)), (2, (109, 97)), (2, (108, 46)), (2, (105, 118)), (2, (105, 109)), (2, (105, 98)), (2, (103, 114)), (2, (102, 105)), (2, (101, 116)), (2, (101, 112)), (2, (101, 110)), (2, (100, 117)), (2, (100, 105)), (2, (100, 97)), (2, (100, 44)), (2, (99, 108)), (2, (99, 107)), (2, (99, 104)), (2, (98, 115)), (2, (98, 97)), (2, (97, 114)), (2, (97, 112)), (2, (97, 32)), (2, (83, 69)), (2, (79, 84)), (2, (78, 79)), (2, (76, 79)), (2, (73, 83)), (2, (72, 65)), (2, (67, 111)), (2, (67, 72)), (2, (65, 78)), (2, (61, 10)), (2, (58, 32)), (2, (50, 48)), (2, (46, 110)), (2, (41, 32)), (2, (35, 61)), (2, (32, 109)), (2, (32, 104)), (2, (32, 103)), (2, (32, 102)), (2, (32, 100)), (2, (32, 79)), (2, (32, 78)), (2, (32, 73)), (2, (32, 65)), (2, (10, 10)), (1, (219, 172)), (1, (219, 171)), (1, (219, 170)), (1, (219, 168)), (1, (219, 163)), (1, (177, 219)), (1, (172, 216)), (1, (171, 32)), (1, (168, 216)), (1, (168, 10)), (1, (142, 219)), (1, (137, 219)), (1, (135, 10)), (1, (129, 32)), (1, (119, 101)), (1, (118, 105)), (1, (117, 111)), (1, (117, 108)), (1, (117, 100)), (1, (117, 98)), (1, (116, 112)), (1, (116, 58)), (1, (116, 47)), (1, (116, 41)), (1, (116, 10)), (1, (115, 117)), (1, (115, 115)), (1, (115, 112)), (1, (115, 111)), (1, (115, 108)), (1, (115, 47)), (1, (115, 46)), (1, (114, 116)), (1, (114, 109)), (1, (114, 108)), (1, (114, 99)), (1, (112, 111)), (1, (112, 108)), (1, (112, 101)), (1, (112, 58)), (1, (112, 32)), (1, (112, 10)), (1, (111, 118)), (1, (111, 116)), (1, (110, 121)), (1, (110, 117)), (1, (110, 111)), (1, (110, 107)), (1, (110, 99)), (1, (110, 97)), (1, (110, 44)), (1, (109, 109)), (1, (109, 105)), (1, (108, 117)), (1, (107, 101)), (1, (105, 102)), (1, (105, 100)), (1, (105, 44)), (1, (104, 109)), (1, (104, 108)), (1, (103, 101)), (1, (103, 32)), (1, (102, 117)), (1, (102, 114)), (1, (101, 108)), (1, (101, 102)), (1, (101, 101)), (1, (101, 98)), (1, (101, 58)), (1, (99, 105)), (1, (98, 121)), (1, (98, 108)), (1, (97, 115)), (1, (97, 105)), (1, (97, 100)), (1, (97, 99)), (1, (97, 98)), (1, (89, 82)), (1, (87, 69)), (1, (86, 101)), (1, (86, 69)), (1, (85, 116)), (1, (85, 83)), (1, (84, 101)), (1, (84, 72)), (1, (84, 69)), (1, (82, 77)), (1, (82, 73)), (1, (82, 69)), (1, (82, 32)), (1, (80, 108)), (1, (80, 101)), (1, (80, 89)), (1, (80, 76)), (1, (79, 87)), (1, (79, 86)), (1, (79, 82)), (1, (79, 80)), (1, (79, 70)), (1, (79, 67)), (1, (79, 32)), (1, (77, 83)), (1, (77, 79)), (1, (76, 105)), (1, (76, 76)), (1, (76, 69)), (1, (75, 10)), (1, (73, 84)), (1, (73, 78)), (1, (73, 71)), (1, (72, 84)), (1, (72, 73)), (1, (71, 73)), (1, (71, 72)), (1, (71, 69)), (1, (71, 32)), (1, (70, 32)), (1, (69, 82)), (1, (69, 77)), (1, (69, 68)), (1, (69, 65)), (1, (69, 58)), (1, (68, 79)), (1, (68, 46)), (1, (67, 114)), (1, (67, 79)), (1, (67, 75)), (1, (67, 41)), (1, (66, 76)), (1, (65, 116)), (1, (65, 83)), (1, (65, 76)), (1, (58, 47)), (1, (58, 10)), (1, (55, 45)), (1, (52, 32)), (1, (51, 46)), (1, (50, 52)), (1, (49, 46)), (1, (49, 41)), (1, (48, 55)), (1, (48, 50)), (1, (48, 48)), (1, (48, 10)), (1, (47, 117)), (1, (47, 116)), (1, (47, 47)), (1, (47, 10)), (1, (46, 49)), (1, (46, 48)), (1, (45, 50)), (1, (41, 10)), (1, (40, 85)), (1, (40, 84)), (1, (40, 67)), (1, (32, 119)), (1, (32, 114)), (1, (32, 110)), (1, (32, 108)), (1, (32, 107)), (1, (32, 101)), (1, (32, 86)), (1, (32, 85)), (1, (32, 82)), (1, (32, 76)), (1, (32, 68)), (1, (32, 66)), (1, (32, 51)), (1, (32, 50)), (1, (32, 49))] ``` top_pair = max(stats, key=stats.get) top_pair ``` Our most common pair is (217, 142) which is at the top of our list (123396, (217, 142) occuring 123396 times. `chr(217), chr(142)` ('Ù', '\x8e') ## Swapping the pair for a single token ``` def get_stats(ids): counts = {} for pair in zip(ids, idx[1:]): counts[pair] = counts.get(pair, 0) + 1 return counts def merge(ids, pair, idx): newids = [] i = 0 while i < len(ids): if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]: newids.append(idx) i += 2 else: newids.append(ids[i]) i += 1 return newids print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99)) ``` Replace a `pair`(6, 7) in a list `[5, 6, 6, 7, 9, 1]` of numbers called `ids` with a single token `idx` 99 [5, 6, 99, 9, 1] ## We have 0-255 tokens, to replace the most common pair with a new token 256: ``` tokens2 = merge(tokens, top_pair, 256) #print(tokens2) print("length: ", len(tokens2)) ``` length: 1237147 ``` vocab_size = 276 num_merges = vocab_size - 256 ids = list(tokens) merges = {} for i in range(num_merges): stats = get_stats(ids) pair = max(stats, key=stats.get) idx = 256 + i print(f'merging {pair} into a new token {idx}') ids = merge(ids, pair, idx) merges[pair] = idx ``` ## Source: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode