Vi Regular Expression and Data Processing(1)
data processing is very important before data mining
- like if you are going to do some data mining in the linkedin public profile
- 1.find out the number of the people that have
- education - experience - skills - have all
- 2.find out all the skills and list the number of user who own them in a diagram
- 3.find out all the company number
- 4.find out all the people that is in US
- 5.find out all the education level of all people
- like undergraduate, graduate, phd
- 6.find out all the experience about job numbers
- like 1 job, 2 jobs, 3 jobs
regular expression in VIM
- delete the line that match the particular format
- like match all the lines in a document that match ‘name: current’
- :%s/^name:\tcurrent.*$//g
- :%s/skill_num:0\n//g
- like match all the lines in a document that match ‘name: current’
count the number
the number of the education, the number of the experience and the number of the skill
- like
- grep ‘skill_num’ *.txt > ./checklist/skill_num.txt
- :%s/output[1-9][0-9]*.txt:skill_num://
- delete the tag
- :%s/output[1-9][0-9]*.txt:skill_num://
- grep ‘experience_num’ *.txt > checklist/exp_num.txt
- :%s/output[1-9][0-9]*.txt:experience_number://
- delete the tag
- :%s/output[1-9][0-9]*.txt:experience_number://
- grep ‘education_nun’ *.txt > ./checklist/edu_num.txt
- :%s/output[1-9][0-9]*.txt:education_nun://
- delete the tag before the number
- :%s/output[1-9][0-9]*.txt:education_nun://
- grep ‘skill_num’ *.txt > ./checklist/skill_num.txt
- like
use the number of the education, experience and skills to plot the tendency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
diff the file
- diff whole_file subfile | grep “< ” | sed ’s/< //g’
- using this command can get all the content in the subfile out of the wholefile
if you want to separate one very long string
- use “\r” to replace “,” not “\n” in vim
- or use
tr ", " "\n"
erase the empty line
sed 's/^*$//g' file
replace the ^M into \n
- use ’s/\r/\r/’
in vim use \r as “enter button”
sort the file use the sequence of first line
sort -n -k1,1
filename