samedi 25 avril 2015

Problems preserving the ocurrence in a regex?


I have a very large string s, the s string is conformed by word_1 followed by word_2 an id and a number:

word_1 word_2 id number

I would like to create a regex that catch in a list all the ocurrences of the words that has as an id RN_ _ _ followed by the id VA_ _ _ _ and the id VM_ _ _ _. The constrait to extract the RN_ _ _ _ _,VA_ _ _ _ _ _ and VM _ _ _ _ pattern is that the ocurrences must appear one after another, where _ are free characters of the id string this free characters can be more than 3 e.g. :

casa casa NCFS000 0.979058
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN
esta estar VASI1S0
lavando lavar VMP00SM
. . Fp 1

This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:

 [('no RN', 'estar VASI1S0', 'lavar VMP00SM')]

For example this will be wrong, since they are not one after another:

error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM

So for the s string:

s = '''
    No no RN 0.998045
    sabía saber VMII3S0 0.592869
    como como CS 0.999289
    se se P00CN000 0.465639
    ponía poner VMII3S0 0.65
    una uno DI0FS0 0.951575
    error error RN
    actuar accion VMP00SM
    lavadora lavadora NCFS000 0.414738
    hasta hasta SPS00 0.957698
    error error VMP00SM
    que que PR0CN000 0.562517
    conocí conocer VMIS1S0 1
    esta este DD0FS0 0.986779
    error error VA00SM
    y y CC 0.999962
    es ser VSIP3S0 1
    que que CS 0.437483
    es ser VSIP3S0 1
    muy muy RG 1
    sencilla sencillo AQ0FS0 1
    de de SPS00 0.999984
    utilizar utilizar VMN0000 1
    ! ! Fat 1

    Todo todo DI0MS0 0.560961
    un uno DI0MS0 0.987295
    gustazo gustazo NCMS000 1
    error error VA00SM
    cuando cuando CS 0.985595
    estamos estar VAIP1P0 1
    error error VMP00RM
    aprendiendo aprender VMG0000 1
    para para SPS00 0.999103
    emancipar emancipar VMN0000 1
    nos nos PP1CP000 1
    , , Fc 1
    que que CS 0.437483
    si si CS 0.99954
    error error RN
    nos nos PP1CP000 0.935743
    ponen poner VMIP3P0 1
    facilidad facilidad NCFS000 1
    con con SPS00 1
    las el DA0FP0 0.970954
    error error VMP00RM
    tareas tarea NCFP000 1
    de de SPS00 0.999984
    no no RN 0.998134
    estás estar VAIP2S0 1
    condicionado condicionar VMP00SM 0.491858
    alla alla VASI1S0
    la el DA0FS0 0.972269
    casa casa NCFS000 0.979058
    error error RN
    error error VASI1S0
    pues pues CS 0.998047
    error error VMP00SM
    mejor mejor AQ0CS0 0.873665
    que que PR0CN000 0.562517
    mejor mejor AQ0CS0 0.873665
    no no RN 1
    esta estar VASI1S0 0.908900
    lavando lavar VMP00SM 0.9080972
    . . Fp 1
    '''

this is what I tried:

import re
weird_triple = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VA\w+)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VM\w+)', s)

print "\n This is the weird triple\n"
print weird_triple

The problem with this aproach is that returns a list of the pattern RN_ _ _ _, VA_ _ _ _, VM_ _ _, but without the one after another order(some ids and words between this pattern are being matched). Any idea of how to fix this in order to obtain:

[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]

Thanks in advance guys!

UPDATE I tried the aproaches that other uses recommend me but the problem is that if I add another one after another pattern like:

no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858

To the s string the recommended regex of this question doesnt work. They only catch:

[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]

Instead of:

[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]

Which is right. Any idea of how to reach the one after another pattern output:

[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]


Aucun commentaire:

Enregistrer un commentaire