言語処理100本ノック 2015をやってみた（第２章） - "Diary" インターネットさんへの恩返し

やばいおもろすぎる。第２章やってみた。www.cl.ecei.tohoku.ac.jp

10. 行数のカウント

【プログラム】

#!/usr/bin/env python
# coding:utf-8

i=0
for line in open('hightemp.txt', 'r'):
        i+=1
print i

【実行&結果】

$ sudo python 10.py
24

【確認】

$ wc -l hightemp.txt
24 hightemp.txt

11. タブをスペースに置換

【プログラム】

#!/usr/bin/env python
# coding:utf-8

for line in open('hightemp.txt', 'r'):
    print line[:-1].replace('\t', '  ')

【実行&結果】

$ sudo python 11.py
高知県  江川崎  41  2013-08-12
埼玉県  熊谷  40.9  2007-08-16
岐阜県  多治見  40.9  2007-08-16
・
・
・
<省略>

【確認】

$ expand -t 2 hightemp.txt
高知県 江川崎 41  2013-08-12
埼玉県 熊谷  40.9  2007-08-16
岐阜県 多治見 40.9  2007-08-16
・
・
・
<省略>

12. 1列目をcol1.txtに，2列目をcol2.txtに保存

【プログラム】

#!/usr/bin/env python
# coding:utf-8

str1=""
str2=""

for line in open('hightemp.txt', 'r'):
        itemList = line[:-1].split('\t')
        str1 += itemList[0] + "\n"
        str2 += itemList[1] + "\n"

f = open('col1.txt', 'w')
f.write(str1)
f.close()

f = open('col2.txt', 'w')
f.write(str2)
f.close()

【実行&結果】

$ sudo python 12.py
$ cat col1.txt
高知県
埼玉県
岐阜県
・
・
・
<省略>

$ cat col2.txt
江川崎
熊谷
多治見
・
・
・
<省略>

【確認】

$ cut -f 1 hightemp.txt
高知県
埼玉県
岐阜県
・
・
・
<省略>

$ cut -f 2 hightemp.txt
江川崎
熊谷
多治見
・
・
・
<省略>

13. col1.txtとcol2.txtをマージ

【プログラム】

#!/usr/bin/env python
# coding:utf-8

temp_arr=[]
temp_str=""

for (a,b) in zip(open('col1.txt', 'r'),open('col2.txt', 'r')):
        temp_arr.append([a[:-1],b[:-1]])

for i in temp_arr:
        temp_str += i[0] + "\t" + i[1] + "\n"

f = open('col1-2.txt', 'w')
f.write(temp_str)
f.close()

【実行&結果】

$ sudo python 13.py
$ cat col1-2.txt
高知県  江川崎
埼玉県  熊谷
岐阜県  多治見
・
・
・
<省略>

【確認】

0$ paste col1.txt col2.txt
高知県  江川崎
埼玉県  熊谷
岐阜県  多治見
・
・
・
<省略>

14. 先頭からN行を出力

【プログラム】

#!/usr/bin/env python
# coding:utf-8

import sys
import random

argvs = sys.argv
num   = int(argvs[1])
i = 0

for a in open('hightemp.txt', 'r'):
        if num > i:
                print a[:-1]
        else:
                break
        i+=1

【実行&結果】

$ sudo python 14.py 3
高知県  江川崎  41      2013-08-12
埼玉県  熊谷    40.9    2007-08-16
岐阜県  多治見  40.9    2007-08-16

15. 末尾のN行を出力

【プログラム】

#!/usr/bin/env python
# coding:utf-8

import sys
import random

argvs = sys.argv
num   = int(argvs[1])

i=0
for line in open('hightemp.txt', 'r'):
        i+=1

j = 0
for a in open('hightemp.txt', 'r'):
        #print str(i) + " " + str(num) + " " + str(j)
        if (i - num) <= j:
                print a[:-1]
        j+=1

【実行&結果】

$ sudo python 15.py 4
大阪府  豊中    39.9    1994-08-08
山梨県  大月    39.9    1990-07-19
山形県  鶴岡    39.9    1978-08-03
愛知県  名古屋  39.9    1942-08-02

【確認】

$ tail -n 4 hightemp.txt
大阪府  豊中    39.9    1994-08-08
山梨県  大月    39.9    1990-07-19
山形県  鶴岡    39.9    1978-08-03
愛知県  名古屋  39.9    1942-08-02

16. ファイルをN分割する

【プログラム】

#!/usr/bin/env python
# coding:utf-8

import sys
import random

argvs = sys.argv
num   = int(argvs[1])
temp_arr1 = []
temp_arr2 = []

for line in open('hightemp.txt', 'r'):
        temp_arr1.append(line[:-1])

if len(temp_arr1)%num > 0:
        split_num = len(temp_arr1)/num + 1
else:
        split_num = len(temp_arr1)/num

temp_arr2 = [temp_arr1[i:i+split_num] for i in range(0,(len(temp_arr1)-1),split_num)]

i=0
for a in temp_arr2:
        f = open('16-' + str(i) + '.txt', 'w')
        temp = ""
        for b in a:
                temp += b + "\n"
        f.write(temp)
        i+=1
        f.close()

【実行&結果】

$ sudo python 16.py 5
$

$ ls | grep 16-
16-0.txt
16-1.txt
16-2.txt
16-3.txt
16-4.txt

$ cat -n 16-0.txt
     1  高知県  江川崎  41      2013-08-12
     2  埼玉県  熊谷    40.9    2007-08-16
     3  岐阜県  多治見  40.9    2007-08-16
     4  山形県  山形    40.8    1933-07-25
     5  山梨県  甲府    40.7    2013-08-10

・
・
＜省略＞
・
・
$ cat -n 16-4.txt
     1  大阪府  豊中    39.9    1994-08-08
     2  山梨県  大月    39.9    1990-07-19
     3  山形県  鶴岡    39.9    1978-08-03
     4  愛知県  名古屋  39.9    1942-08-0

$ split -n 5 hightemp.txt

$ ls | grep xa
xaa
xab
xac
xad
xae

$ cat -n xaa
     1  高知県  江川崎  41      2013-08-12
     2  埼玉県  熊谷    40.9    2007-08-16
     3  岐阜県  多治見  40.9    2007-08-16
     4  山形県  山形    40.8    1933-07-25
     5  山梨県  甲府    40.7    2013

$ cat -n xae
     1  玉県    鳩山    39.9    1997-07-05
     2  大阪府  豊中    39.9    1994-08-08
     3  山梨県  大月    39.9    1990-07-19
     4  山形県  鶴岡    39.9    1978-08-03
     5  愛知県  名古屋  39.9    1942-08-02

#なんかsplitしたcatの方が変なデータになっている？

17. １列目の文字列の異なり

【プログラム】

#!/usr/bin/env python
# coding:utf-8

temp_arr=[]

for line in open('hightemp.txt', 'r'):
        itemList = line[:-1].split('\t')
        temp_arr.append(itemList[0])

li_uniq = list(set(temp_arr))

for a in li_uniq:
        print a

【実行&結果】

$ sudo python 17.py
愛知県
山形県
岐阜県
千葉県
埼玉県
高知県
群馬県
山梨県
和歌山県
愛媛県
大阪府
静岡県

【確認】

$ cut -f1 hightemp.txt | sort | uniq
埼玉県
岐阜県
山形県
和歌山県
山梨県
愛媛県
大阪府
千葉県
愛知県
群馬県
静岡県
高知県

18. 各行を3コラム目の数値の降順にソート

【プログラム】

#!/usr/bin/env python
# coding:utf-8

temp_arr=[]

for line in open('hightemp.txt', 'r'):
        itemList = line[:-1].split('\t')
        temp_arr.append(itemList)

temp_arr.sort(key=lambda x:x[2])

for a in temp_arr:
        temp = ""
        for b in a:
                temp+=b + "\t"
        print temp

【実行&結果】

$ sudo python 18.py
千葉県  茂原    39.9    2013-08-11
埼玉県  鳩山    39.9    1997-07-05
大阪府  豊中    39.9    1994-08-08
山梨県  大月    39.9    1990-07-19
山形県  鶴岡    39.9    1978-08-03
・
・
・
＜省略＞

【確認】

$ sort -k 3 hightemp.txt
愛知県  名古屋  39.9    1942-08-02
山形県  鶴岡    39.9    1978-08-03
山梨県  大月    39.9    1990-07-19
大阪府  豊中    39.9    1994-08-08
・
・
・
＜省略＞

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

【プログラム】

#!/usr/bin/env python
# coding:utf-8

temp_arr=[]

for line in open('hightemp.txt', 'r'):
        itemList = line[:-1].split('\t')
        temp_arr.append(itemList[0])

temp_arr.sort()
pre_str = ""
temp=[]

for a in temp_arr:
        if pre_str == a:
                temp[len(temp)-1] = [a,temp[len(temp)-1][1]+1]
        else:
                pre_str = a
                temp.append([a,1])

temp.sort(reverse=True,key=lambda x:x[1])
temp_str=""
for a in temp:
        temp_str+= a[0] + " " + str(a[1]) + "\n"

print temp_str

【実行&結果】

$ sudo python 19.py
埼玉県 3
山形県 3
山梨県 3
群馬県 3
千葉県 2
岐阜県 2
愛知県 2
静岡県 2
和歌山県 1
大阪府 1
愛媛県 1
高知県 1

【確認】

$ cut -f 1 hightemp.txt | sort | uniq -c | sort -k 1 -r
      3 群馬県
      3 山梨県
      3 山形県
      3 埼玉県
      2 静岡県
      2 愛知県
      2 千葉県
      2 岐阜県
      1 高知県
      1 大阪府
      1 愛媛県
      1 和歌山県

azwoo.hatenablog.com