I have a very large string s
, the s
string is conformed by word_1
followed by word_2
an id
and a number
:
word_1 word_2 id number
I would like to create a regex that catch in a list all the ocurrences of the words that has as an id RN_ _ _
followed by the id VA_ _ _ _
and the id VM_ _ _ _
. The constrait to extract the RN_ _ _ _ _
,VA_ _ _ _ _ _
and VM _ _ _ _
pattern is that the ocurrences must appear one after another, where _
are free characters of the id
string this free characters can be more than 3 e.g. :
casa casa NCFS000 0.979058
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN
esta estar VASI1S0
lavando lavar VMP00SM
. . Fp 1
This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
For example this will be wrong, since they are not one after another:
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
So for the s
string:
s = '''
No no RN 0.998045
sabía saber VMII3S0 0.592869
como como CS 0.999289
se se P00CN000 0.465639
ponía poner VMII3S0 0.65
una uno DI0FS0 0.951575
error error RN
actuar accion VMP00SM
lavadora lavadora NCFS000 0.414738
hasta hasta SPS00 0.957698
error error VMP00SM
que que PR0CN000 0.562517
conocí conocer VMIS1S0 1
esta este DD0FS0 0.986779
error error VA00SM
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
! ! Fat 1
Todo todo DI0MS0 0.560961
un uno DI0MS0 0.987295
gustazo gustazo NCMS000 1
error error VA00SM
cuando cuando CS 0.985595
estamos estar VAIP1P0 1
error error VMP00RM
aprendiendo aprender VMG0000 1
para para SPS00 0.999103
emancipar emancipar VMN0000 1
nos nos PP1CP000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
error error RN
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
error error VMP00RM
tareas tarea NCFP000 1
de de SPS00 0.999984
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
alla alla VASI1S0
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN 1
esta estar VASI1S0 0.908900
lavando lavar VMP00SM 0.9080972
. . Fp 1
'''
this is what I tried:
import re
weird_triple = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VA\w+)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VM\w+)', s)
print "\n This is the weird triple\n"
print weird_triple
The problem with this aproach is that returns a list of the pattern RN_ _ _ _
, VA_ _ _ _
, VM_ _ _
, but without the one after another order(some ids and words between this pattern are being matched). Any idea of how to fix this in order to obtain:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Thanks in advance guys!
UPDATE I tried the aproaches that other uses recommend me but the problem is that if I add another one after another pattern like:
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
To the s
string the recommended regex of this question doesnt work. They only catch:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
Instead of:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Which is right. Any idea of how to reach the one after another pattern output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Aucun commentaire:
Enregistrer un commentaire