Name Last Update
..
README.md Loading commit data...
duplicate_counts.txt Loading commit data...
fashion_quries__high_quality.from_tags Loading commit data...
fashion_quries__high_quality.txt Loading commit data...
fashion_quries__high_quality.txt.v2 Loading commit data...
fashion_quries__high_quality.txt.v2.uniq Loading commit data...
fashion_quries__high_quality.txt.v2.uniq.trans Loading commit data...
lowercase_counts.txt Loading commit data...
lowercase_words.txt Loading commit data...
prompts.txt Loading commit data...
queries.txt Loading commit data...
queries.txt.formated Loading commit data...
tag体系结构 2.md Loading commit data...
tag体系结构 2_formatted.txt Loading commit data...
tag体系结构 3.md Loading commit data...
tag体系结构 3_formatted.txt Loading commit data...
tag体系结构 4.md Loading commit data...
tag体系结构 4_formatted.txt Loading commit data...
tag体系结构 5.md Loading commit data...
tag体系结构 5_formatted.txt Loading commit data...
tag体系结构 6.md Loading commit data...
tag体系结构 6_formatted.txt Loading commit data...
tag体系结构 7.md Loading commit data...
tag体系结构 7_formatted.txt Loading commit data...
tag体系结构.md Loading commit data...
tag体系结构1.md Loading commit data...
tag体系结构1_formatted.txt Loading commit data...
tag体系结构_formatted.txt Loading commit data...

README.md

最终是三个文件:

queries.txt.formated fashion_quries_high_quality.txt.v2.uniq fashion_qurieshigh_quality.from_tags fashion_quries_high_quality.txt.v2.uniq.trans

cat *rmatted.txt | grep ", " | sed 's/, /\n/g' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -k1rn | awk '$1 > 2 {print}' > lowercase_counts.txt

awk '{$1=""; print $0}' /data/tw/SearchEngine/docs/dataset/lowercase_counts.txt | sed 's/^ *//' | grep -v '$' > lowercase_words.txt